- 30 Oct 2024
- 5 Minutes to read
- PDF
Cloud Data Warehouse sources to Cloud Data Warehouse destinations transfers
- Updated on 30 Oct 2024
- 5 Minutes to read
- PDF
Diagram representing Table sources (e.g., Snowflake), and Table destinations (e.g., Databricks).
To deliver data from a data warehouse (source tables) to a data warehouse destination (destination tables), Bobsled seeks to mirror the source tables, prioritizing correctness and efficiency to the data warehouse destination.
Table replication patterns
There are three different ways to replicate tables from a data warehouse to a data warehouse. The selection depends on your ELT/ELT approach in your data warehouse as well as the scale of the data.
Full-table replication
Using the full-table replication
method, Bobsled will extract the full table from the source data warehouse and replace the contents in the data warehouse destination.
This pattern is well-suited for:
Tables where the ETL overwrites much of the data each pipe runs or where no timestamps are maintained to understand which records have changed and,
Small tables where the contents don't change too often.
Cursor to control full-table replication on views in BigQuery
It is recommended to use a 'last_modified_at
' field to help manage data copying from Google BigQuery views. This acts as a bookmark for the last time the data was changed.
To set this up, make a new source share in the BigQuery UI, and choose the necessary column as the last_modified_at
bookmark in the replication settings. Bobsled will use the maximum value
in the last_modified_at
field and see if that has increased since the last sync.
The system will automatically update full-table replication views every 24 hours if a last_modified_at column is not specified.
NOTE:
This only works for views, not tables.
Incremental replication - Append only
Using the incremental replication - append only
method, Bobsled will identify new records based on a created at
field in the table. This field is used as a bookmark, and each time Bobsled runs new rows that have a higher value of that field will be identified and transferred to the destination.
This pattern is well-suited for:
Cases where the rows in the table are immutable records and old records are not updated (e.g. event data or logs).
You will be required to provide Bobsled with one value in your dataset:
Created at
: A key or composite key that Bobsled will use to identify new records. If your dataset does not track when new records are created, consider using a different loading pattern.
Incremental replication - Update and append
Using the incremental replication - Update and append
, Bobsled will identify newly updated records based on a updated at
field in the table. Bobsled will then replicate the newly added or updated records to the destination warehouse and ensure that only the latest version of the record, identified by unique identifier, is present in the destination table.
This method is well-suited for:
Datasets where records evolve or move through various states such as e-commerce orders where the status can change. In each automation, all new data is promoted transactionally regardless of whether the records were newly appended or updated.
You will be required to provide Bobsled with two values in your dataset: a unique identifier(s)
and a last modified at
field.
Unique identifier(s)
: A key or composite key that should be unique and distinctive for each record that will be updated.Last modified at
: A key that Bobsled will use to identify the most recent version of each record. If your dataset does not have a last modified value for Bobsled to identify, consider using a different loading pattern.Delete flag (optional)
: A key that Bobsled will use to determine if a given row should be deleted from the destination table. Bobsled will make sure any column marked for deletion is not added and remove any row with the same unique identifier for new records marked for deletion.Created at (optional)
: A key that Bobsled will use to filter down theMERGE
statement run in the destination data warehouse to improve the performance of the load by pruning files containing historical records. This should be used in cases where the table is large or updates very frequently and when yourcreated_at
field doesn’t change and is not nullable. Nulls in the source data can lead to duplicates. When using the created at key, it is suggested to cluster the table by that field for optimal loading performance and efficiency.
Snowflake source only: Row based replication - Change tracking
Bobsled will identify newly updated records by querying the Snowflake change stream to look for newly inserted, updated or deleted rows. This requires change tracking to be turned on for any given table or table(s) referenced within a view.
This method is well-suited for:
Datasets where records evolve or move through various states such as e-commerce orders where the status can change. With each automation, all new data is promoted transactionally regardless of whether the records were newly appended or updated.
This method should be used if any rows are deleted from your source table and need to make sure they are deleted downstream.
You will be required to provide Bobsled with one value in your dataset: a unique identifier(s)
.
Unique identifier(s):
A key or composite key that should be unique and distinctive for each record that will be updated.Created at (optional)
: An optional key that Bobsled will use to filter down the MERGE statement run in the destination data warehouse to improve the performance of the load by pruning files containing historical records. This should be used in cases where the table is large or updates very frequently and when your created_at field doesn’t change and is not nullable. When using the created at key, it is suggested to cluster the table by that field for optimal loading performance.
When change tracking is enabled on views, it is recommended that you always add a column alias for each column in the SELECT statement. This improves Snowflake’s ability to parse and relay the schema properly.
NOTE:
If a share is paused for longer than the data retention window, a backfill will be automatically kicked off. As such, Bobsled recommends increasing the table time travel window on tables that are used in change tracking replication pattern to enabling pausing—when required.
Backfilling tables
Bobsled enables manually triggering a backfill of a table when you need to restate the table. On the next scheduled sync Bobsled will extract the full table and transactionally replace the existing table in the destination.
Backfills can be manually requested in the UI. Bobsled will also automatically trigger a backfill when required in certain cases (e.g. replication pattern change).
Empty source tables
In Bobsled, if you initiate an operation with empty tables in your data warehouse source, the operation will complete successfully but no data will be delivered to these destination tables. This behavior allows operations to proceed without requiring any error management, even when starting with empty source tables.