Bobsled data types

Overview: Schema Inference and Mapping

sschema inferences(1)

It’s a well-known fact that data types are not identical across various file formats and databases. When transferring structured data from a source to a destination, Bobsled takes care of reading the columns and data types of source data and mapping them accurately to the destination.

When Bobsled reads the data from a source, we infer the schema based on the file format and data source.

For self-describing file formats such as Parquet, we read the schema directly from the files. The same is true of data from a data warehouse source.
For formats like CSVs and JSONs, we auto-infer the schema.

NOTE:
Four source schema that contain the same column name in a different casing (e.g. ‘hash’ and ‘HASH’), Bobsled will only infer one of the columns, at random.

This schema is represented using our internal Bobsled Data Types. These Bobsled data types are designed to provide an interchange layer between various Sources and Destinations and to help providers better understand the way data will be delivered to downstream destinations.

Providers should therefore familiarize themselves with the Bobsled data types and can use this documentation to understand how Bobsled types drive destination data types.

Bobsled Data Types

Bobsled Data Types include all the well-known SQL Data Types.

Primitive Data Types

Data Type	Explanation
`BINARY`	Variable length Binary Data.
`BOOLEAN`	TRUE or FALSE or NULL
`DECIMAL(precision,scale)`	Represents a Fixed Point Decimal Number. The precision and scale are limited by the source and destination systems or file formats. See the sections on Mappings for details.
`FLOAT`	Double Precision (64-bit) Floating Point Number
`FLOAT<FLOAT4 \| FLOAT8>`	Parameterized Floating point. Can represent either a 32-bit or 64-bit floating point number. Currently supported only when transferring data from Parquet to Databricks
`INTEGER`	INTEGER data type. The range of value depends on source and destination systems or file formats.
`INTEGER<TINYINT \| SMALLINT \| INT \| BIGINT>`	Parameterized Integer Data Type. Represents 1, 2, 4, and 8-byte integers. Currently supported only when transferring data from Parquet to Databricks.
`STRING`	UTF-8 encoded String of varying length.

Date and Time

Data Type	Explanation
`DATE`	SQL Date in the Gregorian Calendar
`TIME_NTZ`	Represents a wall-clock TIME value (for ex. 10:23) irrespective of any time zone.
`TIME_TZ`	Represents a wall-clock TIME value (for ex. 10:23 CET) in a specific time zone.
`TIMESTAMP_NTZ`	Represents a specific wall-clock date-time value (for ex. 2024-04-01 10:23) irrespective of any time zone. Equivalent to DATETIME in some SQL dialects.
`TIMESTAMP_TZ`	Represents a specific point in time (for ex. 2024-04-01 10:23 UTC)

Geospatial Data Types

Data Type	Explanation
`GEOGRAPHY`	GEOGRAPHY SQL data type as per the WGS 84 standard ↗
`GEOMETRY`	GEOMETRY SQL data type as per OpenGIS Simple Features Specification (PDF) ↗

Complex Data Types

Data Type	Explanation
`ARRAY<Primitive>`	Variable length list of elements of simple data types. Containing elements must be of the same data type.
`ARRAY<COMPLEX>`	COMPLEX is a notional data type to indicate that the contained element is either an array, struct, or map.
`JSON`	JSON-formatted String. Also used for nested types such as Structs and Maps.

Mapping into Bobsled Data Types

Snowflake → Bobsled

Snowflake	Bobsled Data Type
`ARRAY`	`ARRAY<Primitive>`Snowflake doesn’t provide types for arrays, so the primitive type is always mapped to a string here (see Snowflake documentation ↗)
`VARIANT` / `OBJECT`	`JSON`
`BINARY` / `VARBINARY`	`BINARY`
`BOOLEAN`	`BOOLEAN`
`DATE`	`DATE`
`NUMBER(p,s)`	`DECIMAL(p,s)`
`INTEGER`	`DECIMAL(38,0)` Snowflake internally stores INTEGERS as NUMBERS with maximum precision.
`FLOAT`	`FLOAT`
`GEOGRAPHY`	`GEOGRAPHY`
`GEOMETRY`	`GEOMETRY`
`TIME`	`TIME_NTZ`
`TIMESTAMP` / `TIMESTAMP_NTZ`	`TIMESTAMP_NTZ`
`TIMESTAMP_TZ`	`TIMESTAMP_TZ`
`VARCHAR`	`STRING`

BigQuery → Bobsled

BigQuery	Bobsled Type
`ARRAY<Primitive>`	`ARRAY<Primitive>`
`ARRAY<STRUCT/MAP>`	`ARRAY<COMPLEX>`
`BYTES`	`BINARY`
`BOOLEAN`	`BOOLEAN`
`DATE`	`DATE`
`NUMERIC(p,s)`	`DECIMAL(p,s)`
`NUMERIC(38,0)` / `NUMERIC`	`DECIMAL(38,0)`
`BIGNUMERIC(p,s)`	`DECIMAL(p,s)`
`BIGNUMERIC(76,0)` / `BIGNUMERIC`	`FLOAT`
`INTEGER`	`INTEGER`
`FLOAT`	`FLOAT`
`GEOGRAPHY`	`GEOGRAPHY`
`JSON`	`JSON`
`MAP`	`JSON`
`STRUCT`	`JSON`
`VARCHAR`	`STRING`
`TIME`	`TIME_NTZ`
`DATETIME` / `TIMESTAMP`	`TIMESTAMP_NTZ`

Parquet/Delta-Lake → Bobsled

Parquet	Bobsled Type
`LIST<Primitive>`	`ARRAY<Primitive>`
`LIST<LIST/STRUCT/MAP>`	`ARRAY<COMPLEX>`
`BYTES_ARRAY`	`BINARY`
`BOOLEAN`	`BOOLEAN`
`DATE`	`DATE`
`NUMERIC(p,s)`	`DECIMAL(p,s)`
`TINYINT`	`INTEGER<TINYINT>`
`SMALLINT` / `UTINYINT`	`INTEGER<SMALLINT>`
`INTEGER` / `USMALLINT`	`INTEGER<INT>`
`BIGINT` / `UINTEGER`	`INTEGER<BIGINT>`
`INTERVAL`	Not Supported
`FLOAT`	`FLOAT<FLOAT4>`
`DOUBLE`	`FLOAT<FLOAT8>`
`MAP`	`JSON`
`STRUCT`	`JSON`
`JSON`	`JSON`
`VARCHAR`	`STRING`
`TIME_NTZ` (isAdjustedToUTC=false)	`TIME_NTZ`
`TIME_TZ` (isAdjustedToUTC=true)	`TIME_TZ`
`TIMESTAMP_NTZ` (isAdjustedToUTC=false)	`TIMESTAMP_NTZ`
`TIMESTAMP_TZ` (isAdjustedToUTC=true)	`TIMESTAMP_TZ`
`UUID`	`STRING`

CSV/JSON → Bobsled

CSV	Bobsled Type
`BINARY`	`BINARY`
`BOOLEAN`	`BOOLEAN`
`DATE`	`DATE`
`NUMERIC(p,s)`	`DECIMAL(p,s)`
`INT64` (and all other signed and unsigned INT types)	`INTEGER`
`FLOAT` (and other FLOAT types)	`FLOAT`
`VARCHAR`	`STRING`
`TIMENTZ`	`TIME_NTZ`
`TIMETZ`	`TIME_NTZ`
`TIMESTAMPNTZ`	`TIMESTAMP_NTZ`
`TIMESTAMP_TZ`	`TIMESTAMP_TZ`
`ARRAY<Primitive>`	`ARRAY<Primitive>`
`ARRAY<COMPLEX>`	`ARRAY<COMPLEX>`
`JSON`	`JSON`

Bobsled Types to Destinations Mapping

Bobsled → Snowflake

Bobsled Type	Snowflake
`ARRAY<Primitive>`	`ARRAY`
`ARRAY<COMPLEX>`	`ARRAY`
`BINARY`	`BINARY`
`BOOLEAN`	`BOOLEAN`
`DATE`	`DATE`
`DECIMAL(p,s)` p<=38 AND scale<=37	`DECIMAL(p,s)`
`DECIMAL(p,s)` p>38 OR scale> 37	`STRING` If the source data contains a DECIMAL that can’t be stored as a DECIMAL in Snowflake, we use a STRING.
`FLOAT`	`FLOAT`
`GEOGRAPHY`	`GEOGRAPHY`
`GEOMETRY`	`GEOMETRY`
`INTEGER`	`INTEGER` / `DECIMAL(38,0)`
`JSON`	`VARIANT`
`STRING`	`STRING`
`TIME_NTZ`	`TIME`
`TIME_TZ`	`TIME`
`TIMESTAMP_NTZ`	`TIMESTAMP_NTZ`
`TIMESTAMP_TZ`	`TIMESTAMP_TZ`

Bobsled → BigQuery

Bobsled Type	BigQuery
`ARRAY<Primitive>`	`JSON`
`ARRAY<COMPLEX>`	`JSON`
`BINARY`	`BYTES`
`BOOLEAN`	`BOOLEAN`
`DATE`	`DATE`
`DECIMAL(p,s)`	`NUMERIC / BIGNUMERIC` (depending on the BigQuery rules described. Learn more ↗)
`FLOAT`	`FLOAT`
`GEOGRAPHY`	`GEOGRAPHY`
`GEOMETRY`	`BYTES`
`INTEGER`	`INTEGER`
`JSON`	`JSON`
`STRING`	`STRING`
`TIME_NTZ`	`TIME`
`TIME_TZ`	`TIME`
`TIMESTAMP_NTZ`	`TIMESTAMP`
`TIMESTAMP_TZ`	`TIMESTAMP`

Bobsled → Databricks

Bobsled Type	Databricks
`ARRAY<Primitive>`	`ARRAY<Primitve>`
`ARRAY<COMPLEX>`	`ARRAY<JSON>`
`BINARY`	`BINARY`
`BOOLEAN`	`BOOLEAN`
`DATE`	`DATE`
`DECIMAL(p,s)` p<=38 AND scale<=37	`DECIMAL(p,s)` If the source data contains a DECIMAL that can’t be stored as a DECIMAL in Databricks, we use STRING.
`DECIMAL(p,s)` p>38 OR scale> 37	`STRING`
`FLOAT<FLOAT4>`	`FLOAT`
`FLOAT<FLOAT8>`	`DOUBLE`
`FLOAT`	`DOUBLE`
`GEOGRAPHY`	`BINARY`
`GEOMETRY`	`BINARY`
`INTEGER<TINYINT>`	`TINYINT`
`INTEGER<SMALLINT>`	`SMALLINT`
`INTEGER<INT>`	`INT`
`INTEGER<BIGINT>`	`BIGINT`
`INTEGER`	`BIGINT`
`JSON`	`STRING`
`STRING`	`STRING`
`TIME_NTZ` (parquet)	Cannot be mapped. A source parquet file containing TIME_NTZ can not be loaded to Databricks.
`TIME_NTZ` (csv)	`STRING`
`TIME_TZ` (parquet)	Cannot be mapped. A source parquet file containing TIME_NTZ can not be loaded to Databricks.
`TIME_TZ` (csv)	`STRING`
`TIMESTAMP_NTZ`	`TIMESTAMP`
`TIMESTAMP_TZ`	`TIMESTAMP`

Bobsled → Redshift

Bobsled Type	Redshift
`ARRAY<Primitive>`	`SUPER`
`ARRAY<COMPLEX>`	`SUPER`
`BINARY`	`VARBYTE`
`BOOLEAN`	`BOOLEAN`
`DATE`	`DATE`
`DECIMAL(p,s)` p<=38 AND scale<=37	`DECIMAL(p,s)`
`DECIMAL(p,s)` p>38 OR scale>37	`VARCHAR`
`FLOAT`	`FLOAT`
`GEOGRAPHY`	`GEOGRAPHY`
`GEOMETRY`	`GEOMETRY`
`INTEGER`	`BIGINT`
`JSON`	`SUPER`
`STRING`	`VARCHAR`
`TIME_NTZ`	`TIME`
`TIME_TZ`	`TIMETZ`
`TIMESTAMP_NTZ`	`TIMESTAMP`
`TIMESTAMP_TZ`	`TIMESTAMPTZ`

Bobsled → Parquet

Parquet Mappings depend on the source of the data, as we rely on the parquet writers of the source systems.

Bobsled Type	Snowflake → Parquet	BigQuery → Parquet
`ARRAY<Primitive>`	`JSON`	`ARRAY<Primitive>`
`ARRAY<COMPLEX>`	`JSON`	`ARRAY<COMPLEX>`
`BINARY`	`BYTE_ARRAY`	`BYTE_ARRAY`
`BOOLEAN`	`BOOLEAN`	`BOOLEAN`
`DATE`	`DATE`	`DATE`
`DECIMAL(p,s)`	`DECIMAL(p,s)`	`DECIMAL(38,9)` OR `DECIMAL(76,38)` BigQuery widens the precision/scale to max when unloading to Parquet.
`FLOAT`	`FLOAT`	`FLOAT`
`GEOGRAPHY`	`BYTE_ARRAY`	`BYTE_ARRAY`
`GEOMETRY`	`BYTE_ARRAY`	N/A
`INTEGER`	`DECIMAL(38,0)` Snowflake stores INTs as DECIMAL(38,0)	`INT64`
`JSON`	`STRING`	`STRING`
`STRING`	`STRING`	`STRING`
`TIME_NTZ`	`TIME_NTZ` (isAdjustedToUTC=false)	`TIME_NTZ` (isAdjustedToUTC=false)
`TIME_TZ`	`TIME_TZ` (isAdjustedToUTC=true)	`TIME_TZ` (isAdjustedToUTC=true)
`TIMESTAMP_NTZ`	`TIMESTAMP_NTZ` (isAdjustedToUTC=false)	`TIMESTAMP_NTZ` (isAdjustedToUTC=false)
`TIMESTAMP_TZ`	`TIMESTAMP_TZ` (isAdjustedToUTC=true)	`TIMESTAMP_TZ` (isAdjustedToUTC=true)