Data Source Configuration Best Practices
#
IntroductionTransform takes the physical data layer
of your data warehouse (rows and columns) and creates a semantic data layer
on top of it. Our framework creates this semantic layer based on your config files that describe how the data is stored in the data warehouse. This configuration is accomplished in YAML data source
configuration files. These are a list of common questions and best practices when attempting to optimize the performance of the framework.
#
When should I use a SQL Table vs SQL Query?#
SQL Table:The table name in your data warehouse.
data_source: name: data_source_name description: My description owners: - owner_email1@company.com sql_table: schema.table
#
SQL Query:A query that runs against your data warehouse.
data_source: name: data_source_name description: My description owners: - owner_email1@company.com sql_query: Select * from schema.table
#
Under the hood:SQL Table is always preferred to a SQL Query because it pushes more logic construction into the framework and enables more optimization. A SQL Query is materialized in a feature called "priming" to a table and then that table is treated as a SQL Table in the framework.
Best Practices:
- Whenever possible use
sql_table
. This will be substantially more performant. - Whenever possible avoid joins in a
sql_query
. Instead, create a new data source. Transform's semantic layer will automate the joins for you! - If a
sql_query
is used, consider the most efficient mutability setting. By default, the primed table will be considered invalid on an hourly basis which makes an efficient mutability setting even more important.
data_source
?#
What is an appropriate mutation setting for my #
Under the hood:The mutability setting determines when the cache considers a data set invalid or when it needs to look at the underlying dataset to pull fresh data.
#
Full MutationBy default, the framework will assume all data constructed off of a full_mutation is invalid every hour. If you know when datasets are typically updated, say by 12 AM UTC each day, you might consider setting an update_cron
for 1 PM UTC.
mutability: type: full_mutation type_params: # 6:00 Pacific Time or 9:00 Eastern Time is 13:00 UTC update_cron: 0 13 * * *
#
Append Onlymutability: type: append_only type_params: along: ds # or some other timestamp column defined as a dimension.
#
DS Append OnlyAssumes that datasets are batch exported at Midnight UTC every day. Note that if you export datasets daily but not at Midnight UTC cut-off you should use append_only
.
mutability: type: ds_append_only # Appends on the primary time column by default.
#
Best Practices- Wherever possible, use
append_only
andds_append_only
. They are are significantly more performant thanfull_mutation
. - If you must use
full_mutation
specify anupdate_cron
time to schedule the priming. If you do not, a priming query will be run every hour which can be expensive and may impact the experience your end users have consuming data through the APIs.
#
When do I use denormalized vs normalized tables?#
Best Practices- Either works! But, you will likely get more from the framework by using
normalized
tables asdata sources
. A normalized table will maximize your ability to produce metrics to a variety of granularities consistently. - A denormalized table is already aggregated and therefore cannot produce metrics to nearly as many granularities.
- On the other hand, a raw table may not be consistent with other data sets in your warehouse.
- Transform's semantic layer is capable of most denormalization tasks (aggregation, filtering, joining) and we're constantly seeking to fill any gaps.
Example
For an Example Data Source and Metrics written against a data model using these best practices, see the Introduction to Transform's Metric Framework