For development purposes, I'd like to cache the results of queries in BigQuery made by beam.io.ReadFromBigQuery
connector - so I'd be able to load them quickly from the local file system when running the exact same query in the next times.
The problem is that I cannot run any PTransform before beam.io.ReadFromBigQuery
to validate existence of caching and skip the reading from BigQuery as a result.
Currently I came up with two possible solutions:
- Creating a customized
beam.DoFn
for reading from BigQuery. It will include the caching mechanism, but might underperform comparing to the existing connector. One variation might be inheritance of the existing connector - but it will require knowledge of Beam "under the hood" - which might be overwhelming. - Implementing the caching when building the pipeline, and the resulting step will be determined according the existence or inexistence of the cache (
apache_beam.io.textio.ReadAllFromText
orbeam.io.ReadFromBigQuery
, respectively).