Writing and reading to and from BigQuery in one Dataflow job
I have a seemingly simple problem when constructing my pipeline for Dataflow. I have multiple pipelines that fetch data from external sources, transform the data and write it to several BigQuery tables. When this process is done I would like to run a query that queries the just generated tables. Ideally I would like this to happen in the same job. Is this the way Dataflow is meant to be used, or should the loading to BigQuery and the querying of the tables be split up between jobs? If this is possible in the same job how would one solve this, as the BigQuerySink does not generate a PCollection? If this is not possible in the same job, is there some way to trigger a job on the completion of another job (i.e. the writing job and the querying job)?
You alluded to what would need to happen to do this in a single job -- the BigQuerySink would need to produce a PCollection. Even if it is empty, you could then use it as the input to the step that reads from BigQuery in a way that made that step wait until the first sink was done. You would need to create your own version of the BigQuerySink to to do this. If possible, an easier option might be to have the second step read from the collection that you wrote to BigQuery rather than reading the table you just put into BigQuery. For example: PCollection<TableRow> rows = ...; rows.apply(BigQuery.Write.to(...)); rows.apply(/* rest of the pipeline */); You could even do this earlier if you wanted to continue processing the elements written to BigQuery rather than the table rows.
Getting Slice of one dimension in ndarray
Parsing Tex using python re library
Fix the seed for the community module in Python that uses networkx module
select first n items of a list without using a loop.
when I initiate second button click -> AttributeError: Application instance has no attribute 'readfile'
Python 3.4 with an older Python script for use in Blender, TypeErrors
error when installing numpy for pypy2.2.1
How to find out in which locale it was encoded to?
alembic revision - multiple heads (due branching) error
Python set value multiindex Pandas
Multiplication table for double digit numbers using nested loops in Python
pandas time_range does not start from start date
Django How about my method to solve anti spam post request? Is there better solutions?
How can I write a csv file with multiple header lines with pandas to_csv()?
Connect to RDS from EC2 instance with Python
Make a function that extract value from a list and use it to match with another list that has a tuple in it