- 11 Oct 2024
- 4 Minutes to read
- PDF
Setting globs (Public preview)
- Updated on 11 Oct 2024
- 4 Minutes to read
- PDF
To guarantee maximum flexibility when creating a data transfer from File storage sources, Bobsled supports Glob selection for advanced data selection. This article will describe how to set up Globs in the Bobsled application—including through the Developer mode—, and setting these via the Bobsled Rest API ↗
GLOBS PUBLIC PREVIEW (September 2024)
Feature is in public preview, available for Cloud Storage Sources. It is suitable for certain production workloads but may not be appropriate for all use cases. For guidance, contact your account representative.
Overview
When applying Globs with Bobsled, please consider the following:
Globs are applied in the order they are provided in the array of Globs. This means that the previous
globs
can be overridden. For example, give["**/*.*", "!**/*.csv", "**/important.csv"]
, the fileimportant.csv
will be included despite the factcsv
were excluded in"!**/*.csv"
.Positive patterns (e.g.
**/*.csv
or**/*.*
) add to the results of what will be loaded to the destination, while negative patterns (e.g.!**/file.csv
) subtract from the results. Therefore, a single negation (e.g.["!**/file.csv"]
) will never match anything since nothing was added to the results. Use["**/*.*", "!**/file.csv"]
instead.Globs are case-sensitive.
Selecting the root folder and targeting the desired subfolders
globs
is slower and more expensive. Whenever possible, usepathNames
to restrict the transfer to only include the paths that contain the files you need to deliver.All globs should start with either the bucket’s URL or glob stars (
**
).For example, consider the source path
gs://bucket-name/user_data/
with the following globs:
Glob expression | Validity | Explanation |
---|---|---|
| ✗ Invalid | This won’t match anything because |
| ✔ Valid | All files and folders in the |
| ✔ Valid | CSV files in the |
NOTE:
Bobsled uses the micromatch ↗ library to apply globs. All available features in the library are available in Bobsled. If you have any questions, reach out to your account team.
Setup instructions
Globs via the Rest API
In the Create a transfer ↗ endpoint, add the
globs
properly in thetransferEntity
:
{
"transferEntity": [
{
"entityType": "files",
"pathNames": [
"gs://bucket-name/user_data/ids/"
],
"globs": ["**/appID001/**/userClicks/*.csv"]
}
]
}
Globs via the Bobsled Application
Option 1: Via the path selector
From the side panel, select Shares
Click the Create share button
Choose a File Storage source and a destination of your choice
Click create transfer
In the wizard, select the paths you wish to transfer. Once you select a path, locate and select the glob icon (a star with a doc).
By default, bobsled sets the “select all” glob expression,
“**/*.*“
, edit it to your desired target. Once no errors are observed, select save. Bobsled displays a check icon to provide feedback that it has a glob expression.
NOTE:
For Cloud Data Warehouse destinations, if you are merging multiple folders, these must share the same Glob expression(s).
Continue with the wizard depending on your destination, review your transfer configuration, and select Save transfer.
Option 2: Via the developer mode
From the side panel, select Shares
Click the Create share button
Choose a File Storage source and a destination of your choice
Click create transfer
In the wizard, locate and select the developer mode toggle. This mode accepts a
JSON
that should be the same shape as thetransferEntity
in the Bobsled API ↗
In the Developer mode, edit the
transferEntity
body and:Enter the details for your selection as per the Bobsled Rest API ↗, or you can select via the path selector and enter this mode. By default, bobsled sets the “select all” glob expression,
“**/*.*“
, edit it to your desired target.
When no errors are observed, click Continue to review your transfer configuration.
Click Save transfer
NOTE:
Bobsled does not allow to switch between modes when there are errors, but allows you to discard the changes you’ve made in ‘Developer mode’ before switching to the interactive mode.
Common Use-Cases with Globs
Here are several common scenarios where globs can be used to select files during a data transfer in Bobsled:
Include files of a certain extension:
Example: Include all
.json
files in the selected path.Glob:
"**/*.json"
Exclude files of a certain extension:
Example: Exclude all
.tmp
files from the selected path.Glob:
"!**/*.tmp"
Share files from a specific start date with no end date (assuming
yyyy/mm/dd
partitioning):Example: Include all files from August 1, 2023, onwards.
Glob:
["**/2023/08/{01..31}/**", "**/2023/{09..12}/**", "**/{2024..}/**"]
Share files from a specific start date with no end date (Hive-style partitioning):
Example: Include all files from August 1, 2023, onwards, assuming Hive-style partitioning (
/year=yyyy/month=mm/day=dd/
).Glob:
["**/year=2023/month=08/day={01..31}/**", "**/year=2023/month={09..12}/**", "**/year={2024..}/**"]
Share files from a specific start date and end date (assuming
yyyy/mm/dd
partitioning):Example 1: Include files from January 2023.
Glob:
"**/2023/01/**"
Example 2: Include files from March 1, 2022, to March 31, 2022.
Glob:
"**/2022/03/{01..31}/**"
Choose a certain partition named after a stock ticker (e.g., stock=ibm):
Example: Include all files under the partition where
stock=ibm
.Glob:
"**/stock=ibm/**"
Choose a given region when paths are named with regions (e.g.,
/US/
):Example: Include all files under the
/US/
region.Glob:
"**/US/**"