Materializations

What are materializations in Transform?#

Materializations enable you to produce denormalized tables in a data warehouse. These are pre-computed and never evicted from the cache, and they can be surfaced in downstream tools directly from the data warehouse.

You can define a set of metrics and corresponding dimensions that are most commonly used and need to be accessed frequently by their organizations, which Transform can compute ahead of time at a scheduled cadence and write them to the data warehouse.

The primary benefit of materializations is that you don't need to manually "warm-up" the cache for the most common metrics and dimensions instead, this will happen at a cadence automatically. Also, you can hook these materialized tables into third-party tools from the data warehouse.

Materialization Components#

Configuration & Definition: The Materialization definition occurs inside our framework, which is similar to how you currently define metrics.
Calling a Materialization: You call a Materialization through MQL via mql materialize
Writing the Materialized table: Materialize will write data to your data warehouse or in a specified location. You can optionally write it to a fast cache.
Scheduling a Materialization: To get the most value out of materializations, we recommend you schedule the materialize on a given cadence to ensure your data is ready all the time.

Requirements#

The metrics you define in the materialization configuration must already be defined in Transform.
There must be a primary time dimension that is consistently named across your data sources. This time dimension will be used to set the date parameters of the materialization.

Examples#

Defining a Materialization#

Here is an example of a materialization definition that reflects data from a booking service company.

materialization:  name: user_bookings_summary # name your materialization. this will be the table name written in your data warehouse.  description: # add an optional description.  metrics: # list all the metrics you want to materialize in your materialization. all metrics must be defined in your metrics.yaml.    - bookings    - booking_value    - customer_service_tickets    - guest_host_messages
  dimensions: # list all corresponding dimensions you want to include for these metrics. these must all be defined in your data sources.    - ds    - guest__signup_channel
# Optionally, specify the name of the schema and the table for the materialization.# MQL will try to create / drop this table as needed. It's recommended that the# permissions of the user associated with the MQL server be configured to have# limited permissions.  destination_table: transform_prod.my_table
# Optional destinations:# Materializations always write to the datawarehouse.# Optionally provide additional destinations for the materialization and rollup  destinations:    - location: DW      format: WIDE    - location: FAST_CACHE      format: WIDE      rollups:        - [ds]
 #   format: WIDE # WIDE is the only supported option currently #   dimension_sets: # rollups allow you to create separate tables that group by a subset of the dimensions in your materialization #   [ds] # add a dimension you want to a rollup by. this should be a subset of the dimensions you list in your materialization. These rollups will be stored in separate tables than the core materializations

Materializations have a few inputs

Name: The name of the materialization
Owners: The owners of the materialization
Description: An optional description
Metrics: The important metrics that you want to be pre-computed in your materialization. These must already be defined in Transform to be referenced by the materialization.
Dimensions: The corresponding dimensions that are important to you and that associate with your metrics.
Destinations: By default, we write data to your data warehouse. You can optionally write to more destinations, which includes destinations of your rollup tables. Options for destinations are DW and FAST_CACHE.
- Rollups: Rollups are an additional argument to each destination which allow you to create separate tables for a subset of the dimensions for all of the metrics listed in the materialization.

Calling a Materialization#

Once you define a materialization, you can invoke it by using MQL. While not required, it is best practice to pass in start-time and end-time arguments. If you do not, the entire table will be materialized. Depending on the size of your datasets, this materialization could be expensive to build. The materialization command is idempotent and will incrementally add any gaps in the time range specified.

mql materialize --name my_materialization --start-time <date> --end-time <date>

We recommend you use an existing scheduling tool to schedule bash commands from the CLI or python interface. You can also use the materialize command to run a manual backfill, or anytime you want to re-run a materialization, even if you already have the command scheduled.

Command Options

Options:  --materialization-name TEXT  Name of materialization to materialize                               [required]
  --start-time TEXT            iso8601 timestamp to materialize from  --end-time TEXT              iso8601 timestamp to materialize to  -t, --timeout INTEGER        Sets the timeout to wait for query completion. Not applicable in --detach mode.
  --detach BOOLEAN             Returns the created query ID to allow for                               asynchronous querying.
  --help                       Show this message and exit.

Materialization Location#

When materialize is run, a table is written to your data warehouse.

The table name is the same name as the materialization name.
Unless there is a warehouse destination specified, the materialization will be written to the schema specified in the project configs with the name of the materialization.
To write to an arbitary schema, pass in the schema and table name as a location.
There is a write lock when writing data. You can only materialize a single time range at a time for each materialization.

By default, materializations are stored in the cache schema, but you can also write them to an arbitrary schema.

We recommend you specify the environment variable which indicates which schema we should write the data in, and it should be a schema that Transform is aware of. You can use the TFD_MQL_MATERIALIZATIONS_SCHEMA environment variable to set this. For example, if you're launching the MQL server container with the Docker run command and the -env-file flag, the environment variable file should contain a line similar to:

TFD_MQL_MATERIALIZATIONS_SCHEMA = tfd_materializations;

You can write to a different schema, but if you don't write to a Transform-managed schema, Transform does not have a way to manage or update this table. It is preferred that you write to Transform managed schema. In order to materialize to a different schema, the schema must be provided as the output_table parameter in the materialize call.

Scheduling a Materialization#

We're currently working on an airflow-operator for companies running airflow as their scheduler. In the meantime, we recommend you use an existing scheduling tool.

Invalidation#

If your underlying metric configuration changes and you re-materialize the dataset, we will invalidate any previous information stored that associates with the old configuration.

When a configuration change occurs, all materialized data created from the old definition will be ignored, until you manually run materialize or the scheduled jobs kick in. When the materialize command is run and detects out-of-date data in the table, it will drop those rows or the entire table as necessary.

If your underlying data changes (e.g., updates to the data warehouse), Transform will not be aware of these changes, so you should re-materialize your data for the affected time periods by running the materialize command.

Rollups#

You can optionally create additional rollup tables with data in your materialization using the rollup configuration.

The value of rollups is for creating and quickly accessing a ready dataset from the materialization that contains a subset of the dimensions defined in the materialization config. For example, if you have two dimensions in your materialization, country, and date, but you only want to group by date in a certain table, you can store that combination in a rollup table. Rollups stores data in tables that are prefixed with the same name as your materialization, with an additional hash or identifier to uniquely identify the table. The rollup tables will be used to quickly respond to queries that are an exact match of the dimensions contained in the roll-up.

Backfills#

Backfills should be done manually by running materialize with the backfill dates.

Drop a Materialization#

If you need to drop a materialization, you can use the mql drop-materialization --materialization-name command. It is best practice to drop materializations you are no longer using. Your metrics will try and pull from the materialization first, so if a materialization is outdated it could lead to incorrect metric values.