Data Source Configuration Best Practices
Introduction#
Transform takes the physical data layer of your data warehouse (rows and columns) and creates a semantic data layer on top of it. Our framework creates this semantic layer based on your config files that describe how the data is stored in the data warehouse. This configuration is accomplished in YAML data source configuration files. These are a list of common questions and best practices when attempting to optimize the performance of the framework.
When should I use a SQL Table vs SQL Query?#
SQL Table:#
The table name in your data warehouse.
data_source: name: data_source_name description: My description owners: - owner_email1@company.com sql_table: schema.tableSQL Query:#
A query that runs against your data warehouse.
data_source: name: data_source_name description: My description owners: - owner_email1@company.com sql_query: Select * from schema.tableUnder the hood:#
SQL Table is always preferred to a SQL Query because it pushes more logic construction into the framework and enables more optimization. A SQL Query is materialized in a feature called "priming" to a table and then that table is treated as a SQL Table in the framework.
Best Practices:
- Whenever possible use
sql_table. This will be substantially more performant. - Whenever possible avoid joins in a
sql_query. Instead, create a new data source. Transform's semantic layer will automate the joins for you! - If a
sql_queryis used, consider the most efficient mutability setting. By default, the primed table will be considered invalid on an hourly basis which makes an efficient mutability setting even more important.
What is an appropriate mutation setting for my data_source ?#
Under the hood:#
The mutability setting determines when the cache considers a data set invalid or when it needs to look at the underlying dataset to pull fresh data.
Full Mutation#
By default, the framework will assume all data constructed off of a full_mutation is invalid every hour. If you know when datasets are typically updated, say by 12 AM UTC each day, you might consider setting an update_cron for 1 PM UTC.
mutability: type: full_mutation type_params: # 6:00 Pacific Time or 9:00 Eastern Time is 13:00 UTC update_cron: 0 13 * * *Append Only#
mutability: type: append_only type_params: along: ds # or some other timestamp column defined as a dimension.DS Append Only#
Assumes that datasets are batch exported at Midnight UTC every day. Note that if you export datasets daily but not at Midnight UTC cut-off you should use append_only.
mutability: type: ds_append_only # Appends on the primary time column by default.Best Practices#
- Wherever possible, use
append_onlyandds_append_only. They are are significantly more performant thanfull_mutation. - If you must use
full_mutationspecify anupdate_crontime to schedule the priming. If you do not, a priming query will be run every hour which can be expensive and may impact the experience your end users have consuming data through the APIs.
When do I use denormalized vs normalized tables?#
Best Practices#
- Either works! But, you will likely get more from the framework by using
normalizedtables asdata sources. A normalized table will maximize your ability to produce metrics to a variety of granularities consistently. - A denormalized table is already aggregated and therefore cannot produce metrics to nearly as many granularities.
- On the other hand, a raw table may not be consistent with other data sets in your warehouse.
- Transform's semantic layer is capable of most denormalization tasks (aggregation, filtering, joining) and we're constantly seeking to fill any gaps.
Example
For an Example Data Source and Metrics written against a data model using these best practices, see the Introduction to Transform's Metric Framework