Data Sources

Data Source YAMLs#

For each data source, create a configuration YAML in your Transform repository.

A data source is usually a table within your data warehouse. However, you can create as many data sources out of each table you want.

See the example below for a full config and see below for a more detailed description of each field.

Example:#

data_source: # This section includes metadata on the data source and the source definition.  name: transactions # Define the name of the source. This name can be updated in the future.  description: | # Provide a detailed description of the data source here and include any important details. Other configuration contributors will primarily use this description.                                 # It is only surfaced in the UI under the lineage view on the metric page.    This table captures every transaction starting July    02, 2014. Each row represents one transaction. There    will be a new row for any cancellations or alterations.    There is a transaction, order, and user id for    every transaction. There is only one transaction id per    transaction, but there can be many rows per order id and    user id. The `ds` or date is reflected in UTC.  owners: # List the emails of the owners of this data source. This list is currently not used for notifications, but it will be in the future.    - support@transformdata.io  sql_table: demo_schema.transactions # The table from which the data source is constructed. You can also pass in the databse name first.    # sql_query: You can also pass in a SQL Query here. This allows you to aggregate or filter the data before passing it into Transform.
  identifiers:# Define the primary, foreign, or unique key columns in your source that can be used to join to other data sources.# Each join key should have a type of either primary, foreign, or unique.# Additionally, keys can be referenced directly by column name from data source or they can be altered using an expression.    - name: id_transaction      type: primary    - name: id_order      type: foreign    - name: id_user      type: foreign      expr: SUBSTRING(id_order from 2)
  measures:# Define the fields from your data source to be aggregated as inputs to metrics (e.g. in  metrics.yaml).# Each measure has an aggregation and optional description. A measure can simply reference a column or may be calculated using a SQL expression.# Measures have a default expression of their name. The available aggregations are sum, max, min, count_distinct, and sum_boolean.    - name: transaction_amount_usd      description: The total USD value of the transaction.      agg: sum    - name: transactions      description: The total number of transactions.      expr: "1"      agg: sum    - name: quick_buy_amount_usd      description: The total USD value of the transactions that were                   purchased using the “quick buy” button.      expr: CASE WHEN transaction_type_name = ‘quick buy’ THEN transaction_amount_usd ELSE 0 END      agg: sum    - name: quick_buy_transactions      description: The total transactions bought as quick buy.      expr: quick_buy_flag      agg: sum_boolean
  dimensions:# Define the dimensions from your source. Dimensions are qualitative values such as names, dates, or geographical data.# Dimensions provide context to measures and are associated with metrics created from those measures to provide “metric by dimension” data slicing.# Dimensions can either directly reference a column or may be calculated using a SQL expression.    - name: ds      type: time            type_params:                is_primary: # true or false                time_format: YYYY-MM-DD # only format currently supported                time_granularity: day # only format currenlty supported    - name: quick_buy_transaction      type: categorical      expr: |        CASE          WHEN transaction_type_name = ‘quick_buy’ THEN          ELSE ‘not_quick_buy’        END
  mutability:# Define the appropriate mutability for this data source. Mutability refers to how the data underlying this configuration (a sql table or the results of a sql query) changes.# The following are the available options and a brief description of the type of data they describe.    type: append_only      type_params:        along: ts

Data Source#

This section includes metadata on the data source and the source definition.

Name#

Define the name of the source. This name can be updated in the future.

Description#

Provide a detailed description of the data source here and include any important details. Other configuration contributors will primarily use this description. It is only surfaced in the UI under the lineage view on the metric page.

Example:

description: |  This table captures every transaction starting July  02, 2014. Each row represents one transaction. There  will be a new row for any cancellations or alterations.  There is a transaction, order, and user id for  every transaction. There is only one transaction id per  transaction, but there can be many rows per order id and  user id. The `ds` or date is reflected in UTC.

Owners#

List the emails of the owners of this data source. This list is currently not used for notifications, but it will be in the future.

Example:

owners:  - support@transformdata.io

sql_table or sql_query#

There are two types of Data Sources: sql_table or sql_query.

Tip: A Table is preferred over a SQL Query as it allows us to perform more efficient queries against source tables. Where possible use a Table and move logic to the expr

If using sql_table, simply add the schema and table in the warehouse.

Example:

sql_table: demo_schema.transactions

if you want to reference the database in datawarehouses such as Snowflake, you can do so directly in the sql_table or sql_query:

sql_table: demo_db.demo_schema.transactions

If using a full query, use your local data warehouse query language.

Example:

sql_query: |  SELECT    t.id AS id_transaction    , t.id_order    , t.id_user    , t.transaction_amount    , t.transaction_amount_usd    , tt.transaction_type_name    , t.ds  FROM    demo_schema.transactions t  JOIN    demo_schema.transaction_type tt  ON    tt.id = t.id_transaction_type

Identifiers#

Define the primary, foreign, or unique key columns in your source that can be used to join to other data sources. Each join key should have a type of either primary, foreign, or unique. Additionally, keys can be referenced directly by column name from data source or they can be altered using an expression.

Identifier Types:

Primary: A primary key has one and only one record for each row in the table and it is also inclusive of every record in the data warehouse

Unique: **A unique key has one and only one record for each row in the table, but it may only have a subset of every record in the data warehouse. They can also have nulls.

Foreign: A foreign key can have zero, one or many of the same records. They can also have nulls

Example:

identifiers:  - name: transaction    type: primary    expr: id_transaction  - name: order    type: foreign    expr: id_order  - name: user    type: foreign    expr: SUBSTRING(id_order FROM 2)

Note about Dimensions and Identifiers

Dimensions are properties of Primary or Unique Identifiers: Dimensions cannot be aggregated and so they are then considered to be a property of the primary or unique identifiers of the table. In the table above, is_bulk_transaction is considered to be an attribute of a transaction_id. This dimension is then useable in the Metrics Framework by any metric that can be aggregated to the transaction granularity.

Composite Keys

data source: users    identifiers:        - name: message            expr: message_id      type: foreign    - name: user_message      type: primary      identifiers:        - ref: message        - name: user_id

To configure a composite key

Name the composite key using the same name field as when naming any other identifier. In this example's case, it's user_message
ref may be specified instead to reference another identifier, in which case the name and expr are inherited from the referenced identifier
Define the fields that make up your key ahead of defining the key itself.
Specify the composite primary key via the identifier type primary.
The fields that make up the key should be listed under identifiers.

Note: If keys that make up an identifier cannot stand alone or be joined to other data sources, as in this example, we recommend that you define those fields directly in the composite statement as we did with message. Notice in this example that we first only defined one field before the composite key definition (message_id) and defined the other (user_id) directly in the composite key.

You can still choose to explicitly define all identifiers that make up your key ahead of defining the composite primary key itself, and our recommended approach is to consider whether the field can stand alone or whether it is only usable as part of a composite key.

Note: Transform will never implicitly create a composite key. If a data source happens to have team_id and user_id, we won't assume that data source can be joined with user_team unless explicitly defined.

Measures#

Define the fields from your data source to be aggregated as inputs to metrics (e.g. in metrics.yaml). Each measure has an aggregation and optional description. A measure can simply reference a column or may be calculated using a SQL expression. Measures have a default expression of their name. The available aggregations are sum, max, min, count_distinct, and sum_boolean.

Example:

measures:  - name: transaction_amount_usd    description: The total USD value of the transaction.    agg: sum  - name: transactions    description: The total number of transactions.    expr: "1"    agg: sum  - name: quick_buy_amount_usd    description: The total USD value of the transactions that were      purchased using the “quick buy” button.    expr: CASE WHEN transaction_type_name = ‘quick buy’ THEN transaction_amount_usd ELSE 0 END    agg: sum  - name: quick_buy_transactions    description: The total transactions bought as quick buy.    expr: quick_buy_flag    agg: sum_boolean

Dimensions#

Define the dimensions from your source. Transform currently supports dimensions of type time and categorical. Dimensions are qualitative values such as names, dates, or geographical data. Dimensions provide context to measures and are associated with metrics created from those measures to provide “metric by dimension” data slicing. Dimensions can either directly reference a column or may be calculated using a SQL expression.

Example:

dimensions:  - name: ds    type: time        type_params:            is_primary: # true or false            time_format: YYYY-MM-DD # only format currently supported            time_granularity: day # only format currenlty supported  - name: quick_buy_transaction    type: categorical    expr: |      CASE        WHEN transaction_type_name = ‘quick_buy’        THEN 1        ELSE 0      END

Time dimensions

It's a requirement to add the type parameters all time dimensions. In addition, for data sources with a measure, a primary time dimension is required.

The primary time dimension is used for graphing the x-axis. For materializations, it's used as the common time dimension for all metrics.
Granularity currently supports day but in the future will support additional granularities.
Format currently supports YYYY-MM-DD but in the future, we will support additional formats.

NOTE: The name of your primary time dimension must be the same across all your data sources.

For data sources that use a SQL query, Transform's framework makes an optimization to prevent the query from executing multiple times if not necessary. A common example of the benefit of this optimization is if multiple metrics need data from a single data source, the query is not run multiple times to retrieve the data from those metrics. This process where this table is produced is called priming, and it's beneficial especially when queries may be doing full table scans and are expensive to run. In certain cases, some configuration in the data source is needed to tell Transform's framework how to prime in order to get the best performance.

Mutability Configuration#

caution

An incorrect mutability setting will lead to inaccurate metric values. To avoid this, be sure to understand the various options and the update process for the underlying dataset

Mutability refers to how the underlying data in this configuration (a sql table or the results of a sql query) changes.

The options are as follows:

immutable: the data source rarely changes and the framework only runs the query once.

full mutation: the data changes frequently and we will update it on a periodic basis.

append_only: the data source will update with only new rows appended to it, and there's a mutability column that indicates when there are new rows.

ds_append_only: this is similar to append_only except that the mutability column that indicates there are new rows is by default ds.

For more information, please visit Mutability Settings.