python


Writing and reading to and from BigQuery in one Dataflow job


I have a seemingly simple problem when constructing my pipeline for Dataflow. I have multiple pipelines that fetch data from external sources, transform the data and write it to several BigQuery tables. When this process is done I would like to run a query that queries the just generated tables. Ideally I would like this to happen in the same job.
Is this the way Dataflow is meant to be used, or should the loading to BigQuery and the querying of the tables be split up between jobs?
If this is possible in the same job how would one solve this, as the BigQuerySink does not generate a PCollection? If this is not possible in the same job, is there some way to trigger a job on the completion of another job (i.e. the writing job and the querying job)?
You alluded to what would need to happen to do this in a single job -- the BigQuerySink would need to produce a PCollection. Even if it is empty, you could then use it as the input to the step that reads from BigQuery in a way that made that step wait until the first sink was done.
You would need to create your own version of the BigQuerySink to to do this.
If possible, an easier option might be to have the second step read from the collection that you wrote to BigQuery rather than reading the table you just put into BigQuery. For example:
PCollection<TableRow> rows = ...;
rows.apply(BigQuery.Write.to(...));
rows.apply(/* rest of the pipeline */);
You could even do this earlier if you wanted to continue processing the elements written to BigQuery rather than the table rows.

Related Links

Getting Slice of one dimension in ndarray
Parsing Tex using python re library
Fix the seed for the community module in Python that uses networkx module
select first n items of a list without using a loop.
when I initiate second button click -> AttributeError: Application instance has no attribute 'readfile'
Python 3.4 with an older Python script for use in Blender, TypeErrors
error when installing numpy for pypy2.2.1
How to find out in which locale it was encoded to?
alembic revision - multiple heads (due branching) error
Python set value multiindex Pandas
Multiplication table for double digit numbers using nested loops in Python
pandas time_range does not start from start date
Django How about my method to solve anti spam post request? Is there better solutions?
How can I write a csv file with multiple header lines with pandas to_csv()?
Connect to RDS from EC2 instance with Python
Make a function that extract value from a list and use it to match with another list that has a tuple in it

Categories

HOME
visual-studio-2015
caching
tinyos
system-verilog
uibutton
wxwidgets
reportportal
rocketmq
jframe
wicket
jboss-eap-7
game-physics
rapidjson
lenskit
opera-mini
pyyaml
pycrypto
tortoisegit
cronet
fabric
philips-hue
pipelinedb
mef2
multichoiceitems
referenceerror
pycparser
rhomobile
npm-shrinkwrap
modulo
shibboleth
nodeclipse
zero
stdclass
paho
rkt
worker
jtds
dimple.js
istorage
pinvoke
azureportal
uicollectionviewlayout
date-range
adler32
bettercms
efxclipse
gulp-typescript
master
debugdiag
scalar
freetype2
instruments
pagekit
django-south
aerogear
sun-codemodel
fill
anythingslider
insertion-sort
eyeql
textkit
windowlistener
divide-by-zero
twitter-rest-api
guzzle6
ember-cli-addons
word-2013
google-admin-audit-api
firebug-lite
surrogate-key
kraken.js
db4o
trimming
ng-hide
carddav
multiple-conditions
grunt-contrib-compass
git-filter-branch
autostart
symfony-2.0
macruby
hibernate3
msn
osx-leopard
sitemappath
jquery-ui-button
idictionary
multibyte-functions
swfloader
odbc-sql-server-driver
handheld
castle-validators
document-library

Resources

Mobile Apps Dev
Database Users
javascript
java
csharp
php
android
MS Developer
developer works
python
ios
c
html
jquery
RDBMS discuss
Cloud Virtualization
Database Dev&Adm
javascript
java
csharp
php
python
android
jquery
ruby
ios
html
Mobile App
Mobile App
Mobile App