Skip to main content

Data Source Configuration Best Practices

Introduction#

Transform takes the physical data layer of your data warehouse (rows and columns) and creates a semantic data layer on top of it. Our framework creates this semantic layer based on your config files that describe how the data is stored in the data warehouse. This configuration is accomplished in YAML data source configuration files. These are a list of common questions and best practices when attempting to optimize the performance of the framework.

When should I use a SQL Table vs SQL Query?#

SQL Table:#

The table name in your data warehouse.

data_source:  name: data_source_name  description: My description  owners:    - owner_email1@company.com  sql_table: schema.table

SQL Query:#

A query that runs against your data warehouse.

data_source:  name: data_source_name  description: My description  owners:    - owner_email1@company.com  sql_query: Select * from schema.table

Under the hood:#

SQL Table is always preferred to a SQL Query because it pushes more logic construction into the framework and enables more optimization. A SQL Query is materialized in a feature called "priming" to a table and then that table is treated as a SQL Table in the framework.

Best Practices:

  • Whenever possible use sql_table. This will be substantially more performant.
  • Whenever possible avoid joins in a sql_query. Instead, create a new data source. Transform's semantic layer will automate the joins for you!
  • If a sql_query is used, consider the most efficient mutability setting. By default, the primed table will be considered invalid on an hourly basis which makes an efficient mutability setting even more important.

What is an appropriate mutation setting for my data_source ?#

Under the hood:#

The mutability setting determines when the cache considers a data set invalid or when it needs to look at the underlying dataset to pull fresh data.

Full Mutation#

By default, the framework will assume all data constructed off of a full_mutation is invalid every hour. If you know when datasets are typically updated, say by 12 AM UTC each day, you might consider setting an update_cron for 1 PM UTC.

mutability:  type: full_mutation  type_params:    # 6:00 Pacific Time or 9:00 Eastern Time is 13:00 UTC    update_cron: 0 13 * * *

Append Only#

mutability:        type: append_only      type_params:        along: ds # or some other timestamp column defined as a dimension.

DS Append Only#

Assumes that datasets are batch exported at Midnight UTC every day. Note that if you export datasets daily but not at Midnight UTC cut-off you should use append_only.

mutability:  type: ds_append_only # Appends on the primary time column by default.

Best Practices#

  • Wherever possible, use append_only and ds_append_only. They are are significantly more performant than full_mutation.
  • If you must use full_mutation specify an update_cron time to schedule the priming. If you do not, a priming query will be run every hour which can be expensive and may impact the experience your end users have consuming data through the APIs.

When do I use denormalized vs normalized tables?#

Best Practices#

  • Either works! But, you will likely get more from the framework by using normalized tables as data sources. A normalized table will maximize your ability to produce metrics to a variety of granularities consistently.
  • A denormalized table is already aggregated and therefore cannot produce metrics to nearly as many granularities.
  • On the other hand, a raw table may not be consistent with other data sets in your warehouse.
  • Transform's semantic layer is capable of most denormalization tasks (aggregation, filtering, joining) and we're constantly seeking to fill any gaps.

Example

For an Example Data Source and Metrics written against a data model using these best practices, see the Introduction to Transform's Metric Framework