- 21 Jan 2025
- 7 Minutes to read
- PDF
Cloud Data Warehouse sources to File Storage destinations transfers
- Updated on 21 Jan 2025
- 7 Minutes to read
- PDF
Diagram representing Table sources (e.g., Google BigQuery), and File destinations (e.g., Google Cloud Storage).
To deliver data from a data warehouse (source tables) to a file storage destination (destination files), Bobsled seeks to replicate the source tables as a well-structured folder with files.
Replication methods
There are three different ways to replicate from the data warehouse to file storage. This selection depends on your ELT/ELT approach in your data warehouse, as well as the scale of the data.
Full-table replication
Using the full table replication method, Bobsled will extract the full table from the source data warehouse each time the data changes. Bobsled will then deliver the data in a new path. Bobsled delivers compressed file(s) that represent the full state of the table.
This pattern is well-suited for:
Tables where the ETL overwrites much of the data each pipe runs or where no timestamps are maintained to understand which records have changed, and,
Small tables where the contents don't change too often.
On each extraction and data transfer, Bobsled will write data parts into a new folder and write a _SUCCESS
file after all the data has been successfully extracted from the source.
Cursor to control full-table replication on views in BigQuery
If needed, you can use a 'last_modified_at
' field to help manage data copying from Google BigQuery views. This acts like a bookmark for the last time the data was changed.
To set this up, make a new source share in the BigQuery UI, and choose the necessary column as the last_modified_at
bookmark in the replication settings. Bobsled will use the maximum value
in the last_modified_at
field and see if that has increased since the last sync.
NOTE:
This only works for views, not tables. If you don’t specify a last_modified_at column, the system will automatically update full-table replication views every 24 hours.
Incremental replication - Append only
Bobsled will identify new records based on a created at
field in the table. This field is used as a bookmark; each time Bobsled extracts the new rows that have a higher value than this field and transfers them to the destination.
This pattern is well-suited for:
Cases where the rows in the table are immutable records and old records are not updated (e.g. event data or logs).
You will be required to provide Bobsled with one value in your dataset: a created at
field.
Created at
: A key or composite key that Bobsled will use to identify new records. If your dataset does not track when new records are created, consider using a different loading pattern.
Incremental replication - Update and Append
Bobsled will identify newly updated records based on a updated at
field in the table. Bobsled will then replicate the newly added or updated records to the destination warehouse and ensure that only the latest version of the record, identified by unique identifier
, is present in the destination table.
This method is well-suited for datasets where records evolve or move through various states such as e-commerce orders where the status can change. With each automation, all new data is promoted transactionally regardless of whether the records were newly appended or updated.
You will be required to provide Bobsled with two values in your dataset: a unique identifier and a last modified at field.
Unique identifier(s)
: A key or composite key that should be unique and distinctive for each record that will be updated.Last modified at
: A key or composite key that Bobsled will use to identify the most recent version of each record. If your dataset does not have a last modified value for Bobsled to identify, consider using a different loading pattern.Delete flag (optional)
: A key that Bobsled will use to determine if a given row should be deleted from the destination. Bobsled will make sure any column marked for deletion is not added and remove any row with the same unique identifier for new records marked for deletion.
Snowflake source only: Row based replication - Change tracking
Bobsled will identify newly updated records by querying the Snowflake change stream to look for newly inserted, updated or deleted rows. This requires change tracking to be turned on for any given table or table(s) referenced within a view..
This method is well-suited for:
Datasets where records evolve or move through various states such as e-commerce orders where the status can change. With each automation, all new data is promoted transactionally regardless of whether the records were newly appended or updated.
This method should be used if any rows are deleted from your source table and need to make sure they are deleted downstream.
You will be required to provide Bobsled with one value in your dataset: a unique identifier
.
Unique identifier:
A key or composite key that should be unique and distinctive for each record that will be updated.Created at (optional)
: An optional key that Bobsled will use to filter down theMERGE
statement run in the destination data warehouse to improve the performance of the load by pruning files containing historical records. This should be used in cases where the table is large or updates very frequently and when yourcreated_at
field doesn’t change and is not nullable. When using thecreated at key
, it is suggested to cluster the table by that field for optimal loading performance.
NOTE:
If a share is paused for longer than the data retention window, a backfill will be automatically kicked off. As such, Bobsled recommends increasing the table time travel window on tables that are used in change tracking replication pattern to enabling pausing—when required.
Folder structure approach
Bobsled file storage delivery from data warehouse sources employs a versioning system to manage data updates for each table delivered to consumers. Each table's data is stored in a versioned folder structure like table_a/version=n/
and table_b/version=n/
, where n
represents the version number for that specific table.
Each version number represents a new backfill from the source table, and within the latest version number, all data needed to rebuild the latest table state is present. Each data replication pattern employs the same versioning approach however processing the data in the different replication patterns requires slightly different techniques for consumers.
Overview
Bobsled transfers files into the folder structure: /table_name/version=n/YYYY/MM/DD/HH_MM_SS/file_name.file_type
table_name
: the source table name or a provided nameversion: the version number
YYYY
: four-digit year of extraction timeMM
: two-digit month of extraction timeDD
: two-digit day of extraction timeHH
: two-digit hour of extraction timeMM
: two-digit minute of extraction timeSS
: two-digit second of extraction timefile_name
: the file name is structured asYYYYMMDD_part_num.file_type
NOTE:
All times are in UTC.
Folder structure by data replication patterns
Full-table Replication
:Each time there's a change in the source data, a complete copy of the table is sent to file storage with a new version number
Incremental replication - Append Only
:New rows represent inserts.
When a table backfill occurs, the version increments.
Backfills can be triggered manually if a correction to the data is required
All incremental changes are appended to the existing data in new files in date-partitioned folders
Incremental - Update and append and Change tracking
:New rows represent
INSERTS
,UPDATES
orDELETES
;Rows should be MERGED into the destination using:
MATCHED ON
theunique_identifier(s)
field(s)DEDUPED
to find the latest record bylast_modified_at
field DESC, file name DESC or any other key that is updated in your source data to reflect latest updateDELETED
when_META_IS_DELETED = true
When a table backfill occurs, the version increments.
Backfills can be triggered manually if a correction to the data is required
Backfills can also be automatically triggered in the case of change tracking loss or editing replication configuration
Accessing the latest state of each table
To determine the last state of each table, follow these steps:
Find the latest version folder
Apply the appropriate logic depending on the replication pattern chosen to recreate the latest table state as described above.
Note - when a new version folder is created, a backfill should be run in any destination database or data warehouse before the new version is loaded
Example
To access the latest data for table_a
:
Go to
s3://[your-bucket]/table_a/
;Identify the folder with the highest
version=n
(e.g., version=5);Access the data files within
s3://[your-bucket]/table_a/version_5/
with appropriate transformation logic.
File format
Bobsled supports delivering files to consumers in three formats:
Parquet (snappy): snappy compressed
CSV (gzip): gzip compressed
JSON (gzip): gzip compressed
Further file format delivery options are also available, however, vary by source and format.
Single file: option to deliver data within one file. Only choose this for small tables (not available in BigQuery);
Target file size: target file size for large tables. The default is 250 MB (not available in BigQuery);
CSV format options
Delimiter: choose between comma (default), semi-colon, tab, and pipe delimiters
Escape character to escape double quotes: double quote
“
(default) or backslash\
(not available in BigQuery)CSV data is optionally enclosed by double quotes
“
.
Success files
can be turned off for including in delivery to customers. Bobsled encourages the use of success files alongside the version folder prefixes to enable transaction consistency when delivering data.
Empty source tables
In Bobsled, if you initiate an operation with empty tables in your data warehouse source, the operation will complete successfully but no data will be delivered to these destination folders. This behavior allows operations to proceed without requiring any error management, even when starting with empty source tables.