Cloud Data Warehouse sources to File Storage destinations transfers
  • 21 Jan 2025
  • 7 Minutes to read
  • PDF

Cloud Data Warehouse sources to File Storage destinations transfers

  • PDF

Article summary

Diagram representing Table sources (e.g., Google BigQuery), and File destinations (e.g., Google Cloud Storage).

To deliver data from a data warehouse (source tables) to a file storage destination (destination files), Bobsled seeks to replicate the source tables as a well-structured folder with files.


Replication methods

There are three different ways to replicate from the data warehouse to file storage. This selection depends on your ELT/ELT approach in your data warehouse, as well as the scale of the data.

Full-table replication

Using the full table replication method, Bobsled will extract the full table from the source data warehouse each time the data changes. Bobsled will then deliver the data in a new path. Bobsled delivers compressed file(s) that represent the full state of the table.

This pattern is well-suited for:

  • Tables where the ETL overwrites much of the data each pipe runs or where no timestamps are maintained to understand which records have changed, and,

  • Small tables where the contents don't change too often.

On each extraction and data transfer, Bobsled will write data parts into a new folder and write a _SUCCESS file after all the data has been successfully extracted from the source.

Cursor to control full-table replication on views in BigQuery

If needed, you can use a 'last_modified_at' field to help manage data copying from Google BigQuery views. This acts like a bookmark for the last time the data was changed.

To set this up, make a new source share in the BigQuery UI, and choose the necessary column as the last_modified_at bookmark in the replication settings. Bobsled will use the maximum value in the last_modified_at field and see if that has increased since the last sync.

NOTE:
This only works for views, not tables. If you don’t specify a last_modified_at column, the system will automatically update full-table replication views every 24 hours.


Incremental replication - Append only

Bobsled will identify new records based on a created at field in the table. This field is used as a bookmark; each time Bobsled extracts the new rows that have a higher value than this field and transfers them to the destination.

This pattern is well-suited for:

  • Cases where the rows in the table are immutable records and old records are not updated (e.g. event data or logs).

You will be required to provide Bobsled with one value in your dataset: a created at field.

  • Created at: A key or composite key that Bobsled will use to identify new records. If your dataset does not track when new records are created, consider using a different loading pattern.


Incremental replication - Update and Append

Bobsled will identify newly updated records based on a updated at field in the table. Bobsled will then replicate the newly added or updated records to the destination warehouse and ensure that only the latest version of the record, identified by unique identifier, is present in the destination table.

This method is well-suited for datasets where records evolve or move through various states such as e-commerce orders where the status can change. With each automation, all new data is promoted transactionally regardless of whether the records were newly appended or updated.

You will be required to provide Bobsled with two values in your dataset: a unique identifier and a last modified at field.

  • Unique identifier(s): A key or composite key that should be unique and distinctive for each record that will be updated.

  • Last modified at: A key or composite key that Bobsled will use to identify the most recent version of each record. If your dataset does not have a last modified value for Bobsled to identify, consider using a different loading pattern.

  • Delete flag (optional): A  key that Bobsled will use to determine if a given row should be deleted from the destination. Bobsled will make sure any column marked for deletion is not added and remove any row with the same unique identifier for new records marked for deletion.


Snowflake source only: Row based replication - Change tracking

Bobsled will identify newly updated records by querying the Snowflake change stream to look for newly inserted, updated or deleted rows. This requires change tracking to be turned on for any given table or table(s) referenced within a view..

This method is well-suited for:

  • Datasets where records evolve or move through various states such as e-commerce orders where the status can change. With each automation, all new data is promoted transactionally regardless of whether the records were newly appended or updated.

This method should be used if any rows are deleted from your source table and need to make sure they are deleted downstream.

You will be required to provide Bobsled with one value in your dataset: a unique identifier.

  • Unique identifier: A key or composite key that should be unique and distinctive for each record that will be updated.

  • Created at (optional): An optional key that Bobsled will use to filter down the MERGE statement run in the destination data warehouse to improve the performance of the load by pruning files containing historical records. This should be used in cases where the table is large or updates very frequently and when your created_at field doesn’t change and is not nullable. When using the created at key, it is suggested to cluster the table by that field for optimal loading performance.

NOTE:
If a share is paused for longer than the data retention window, a backfill will be automatically kicked off. As such, Bobsled recommends increasing the table time travel window on tables that are used in change tracking replication pattern to enabling pausing—when required.


Folder structure approach

Bobsled file storage delivery from data warehouse sources employs a versioning system to manage data updates for each table delivered to consumers. Each table's data is stored in a versioned folder structure like table_a/version=n/ and table_b/version=n/, where n represents the version number for that specific table.

Each version number represents a new backfill from the source table, and within the latest version number, all data needed to rebuild the latest table state is present. Each data replication pattern employs the same versioning approach however processing the data in the different replication patterns requires slightly different techniques for consumers.

Overview

Bobsled transfers files into the folder structure: /table_name/version=n/YYYY/MM/DD/HH_MM_SS/file_name.file_type

  • table_name: the source table name or a provided name

  • version: the version number

  • YYYY: four-digit year of extraction time

  • MM: two-digit month of extraction time

  • DD: two-digit day of extraction time

  • HH: two-digit hour of extraction time

  • MM: two-digit minute of extraction time

  • SS: two-digit second of extraction time

  • file_name: the file name is structured as YYYYMMDD_part_num.file_type

NOTE:
All times are in UTC.

Folder structure by data replication patterns

  1. Full-table Replication:

    • Each time there's a change in the source data, a complete copy of the table is sent to file storage with a new version number

  2. Incremental replication - Append Only:

    • New rows represent inserts.

    • When a table backfill occurs, the version increments.

      • Backfills can be triggered manually if a correction to the data is required

    • All incremental changes are appended to the existing data in new files in date-partitioned folders

  3. Incremental - Update and append and Change tracking:

    • New rows represent INSERTS, UPDATES or DELETES;

    • Rows should be MERGED into the destination using:

      • MATCHED ON the unique_identifier(s) field(s)

      • DEDUPED to find the latest record by last_modified_at field DESC, file name DESC or any other key that is updated in your source data to reflect latest update

      • DELETED when _META_IS_DELETED = true

    • When a table backfill occurs, the version increments.

      • Backfills can be triggered manually if a correction to the data is required

      • Backfills can also be automatically triggered in the case of change tracking loss or editing replication configuration

Accessing the latest state of each table

To determine the last state of each table, follow these steps:

  1. Find the latest version folder

  2. Apply the appropriate logic depending on the replication pattern chosen to recreate the latest table state as described above.

Note - when a new version folder is created, a backfill should be run in any destination database or data warehouse before the new version is loaded

Example

To access the latest data for table_a:

  • Go to s3://[your-bucket]/table_a/;

  • Identify the folder with the highest version=n (e.g., version=5);

  • Access the data files within s3://[your-bucket]/table_a/version_5/ with appropriate transformation logic.


File format

Bobsled supports delivering files to consumers in three formats:

  • Parquet (snappy):  snappy compressed

  • CSV (gzip): gzip compressed

  • JSON (gzip): gzip compressed

Further file format delivery options are also available, however, vary by source and format.

  • Single file: option to deliver data within one file. Only choose this for small tables (not available in BigQuery);

  • Target file size: target file size for large tables. The default is 250 MB (not available in BigQuery);

  • CSV format options

    • Delimiter: choose between comma (default), semi-colon, tab, and pipe delimiters

    • Escape character to escape double quotes:  double quote (default) or backslash \ (not available in BigQuery)

      • CSV data is optionally enclosed by double quotes .

Success files can be turned off for including in delivery to customers. Bobsled encourages the use of success files alongside the version folder prefixes to enable transaction consistency when delivering data.


Empty source tables

In Bobsled, if you initiate an operation with empty tables in your data warehouse source, the operation will complete successfully but no data will be delivered to these destination folders. This behavior allows operations to proceed without requiring any error management, even when starting with empty source tables.


Was this article helpful?