Data Sources
#
Data Source YAMLsFor each data source, create a configuration YAML in your Transform repository.
A data source is usually a table within your data warehouse. However, you can create as many data sources out of each table you want.
See the example below for a full config and see below for a more detailed description of each field.
#
Example:data_source: # This section includes metadata on the data source and the source definition. name: transactions # Define the name of the source. This name can be updated in the future. description: | # Provide a detailed description of the data source here and include any important details. Other configuration contributors will primarily use this description. # It is only surfaced in the UI under the lineage view on the metric page. This table captures every transaction starting July 02, 2014. Each row represents one transaction. There will be a new row for any cancellations or alterations. There is a transaction, order, and user id for every transaction. There is only one transaction id per transaction, but there can be many rows per order id and user id. The `ds` or date is reflected in UTC. owners: # List the emails of the owners of this data source. This list is currently not used for notifications, but it will be in the future. - support@transformdata.io sql_table: demo_schema.transactions # The table from which the data source is constructed. You can also pass in the databse name first. # sql_query: You can also pass in a SQL Query here. This allows you to aggregate or filter the data before passing it into Transform.
identifiers:# Define the primary, foreign, or unique key columns in your source that can be used to join to other data sources.# Each join key should have a type of either primary, foreign, or unique.# Additionally, keys can be referenced directly by column name from data source or they can be altered using an expression. - name: id_transaction type: primary - name: id_order type: foreign - name: id_user type: foreign expr: SUBSTRING(id_order from 2)
measures:# Define the fields from your data source to be aggregated as inputs to metrics (e.g. in metrics.yaml).# Each measure has an aggregation and optional description. A measure can simply reference a column or may be calculated using a SQL expression.# Measures have a default expression of their name. The available aggregations are sum, max, min, count_distinct, and sum_boolean. - name: transaction_amount_usd description: The total USD value of the transaction. agg: sum - name: transactions description: The total number of transactions. expr: "1" agg: sum - name: quick_buy_amount_usd description: The total USD value of the transactions that were purchased using the “quick buy” button. expr: CASE WHEN transaction_type_name = ‘quick buy’ THEN transaction_amount_usd ELSE 0 END agg: sum - name: quick_buy_transactions description: The total transactions bought as quick buy. expr: quick_buy_flag agg: sum_boolean
dimensions:# Define the dimensions from your source. Dimensions are qualitative values such as names, dates, or geographical data.# Dimensions provide context to measures and are associated with metrics created from those measures to provide “metric by dimension” data slicing.# Dimensions can either directly reference a column or may be calculated using a SQL expression. - name: ds type: time type_params: is_primary: # true or false time_format: YYYY-MM-DD # only format currently supported time_granularity: day # only format currenlty supported - name: quick_buy_transaction type: categorical expr: | CASE WHEN transaction_type_name = ‘quick_buy’ THEN ELSE ‘not_quick_buy’ END
mutability:# Define the appropriate mutability for this data source. Mutability refers to how the data underlying this configuration (a sql table or the results of a sql query) changes.# The following are the available options and a brief description of the type of data they describe. type: append_only type_params: along: ts
#
Data SourceThis section includes metadata on the data source and the source definition.
#
NameDefine the name of the source. This name can be updated in the future.
#
DescriptionProvide a detailed description of the data source here and include any important details. Other configuration contributors will primarily use this description. It is only surfaced in the UI under the lineage view on the metric page.
Example:
description: | This table captures every transaction starting July 02, 2014. Each row represents one transaction. There will be a new row for any cancellations or alterations. There is a transaction, order, and user id for every transaction. There is only one transaction id per transaction, but there can be many rows per order id and user id. The `ds` or date is reflected in UTC.
#
OwnersList the emails of the owners of this data source. This list is currently not used for notifications, but it will be in the future.
Example:
owners: - support@transformdata.io
#
sql_table or sql_queryThere are two types of Data Sources: sql_table
or sql_query
.
Tip: A Table is preferred over a SQL Query as it allows us to perform more efficient queries against source tables. Where possible use a Table and move logic to the expr
If using sql_table
, simply add the schema and table in the warehouse.
Example:
sql_table: demo_schema.transactions
if you want to reference the database in datawarehouses such as Snowflake, you can do so directly in the sql_table
or sql_query
:
sql_table: demo_db.demo_schema.transactions
If using a full query, use your local data warehouse query language.
Example:
sql_query: | SELECT t.id AS id_transaction , t.id_order , t.id_user , t.transaction_amount , t.transaction_amount_usd , tt.transaction_type_name , t.ds FROM demo_schema.transactions t JOIN demo_schema.transaction_type tt ON tt.id = t.id_transaction_type
#
IdentifiersDefine the primary, foreign, or unique key columns in your source that can be used to join to other data sources. Each join key should have a type of either primary, foreign, or unique. Additionally, keys can be referenced directly by column name from data source or they can be altered using an expression.
Identifier Types:
Primary: A primary key has one and only one record for each row in the table and it is also inclusive of every record in the data warehouse
Unique: **A unique key has one and only one record for each row in the table, but it may only have a subset of every record in the data warehouse. They can also have nulls.
Foreign: A foreign key can have zero, one or many of the same records. They can also have nulls
Example:
identifiers: - name: transaction type: primary expr: id_transaction - name: order type: foreign expr: id_order - name: user type: foreign expr: SUBSTRING(id_order FROM 2)
Note about Dimensions and Identifiers
Dimensions are properties of Primary or Unique Identifiers: Dimensions cannot be aggregated and so they are then considered to be a property of the primary or unique identifiers of the table. In the table above, is_bulk_transaction
is considered to be an attribute of a transaction_id. This dimension is then useable in the Metrics Framework by any metric that can be aggregated to the transaction granularity.
Composite Keys
data source: users identifiers: - name: message expr: message_id type: foreign - name: user_message type: primary identifiers: - ref: message - name: user_id
To configure a composite key
- Name the composite key using the same
name
field as when naming any other identifier. In this example's case, it'suser_message
ref
may be specified instead to reference another identifier, in which case thename
andexpr
are inherited from the referenced identifier- Define the fields that make up your key ahead of defining the key itself.
- Specify the composite primary key via the identifier type
primary
. - The fields that make up the key should be listed under
identifiers
.
Note: If keys that make up an identifier cannot stand alone or be joined to other data sources, as in this example, we recommend that you define those fields directly in the composite statement as we did with message
. Notice in this example that we first only defined one field before the composite key definition (message_id
) and defined the other (user_id
) directly in the composite key.
You can still choose to explicitly define all identifiers that make up your key ahead of defining the composite primary key itself, and our recommended approach is to consider whether the field can stand alone or whether it is only usable as part of a composite key.
Note: Transform will never implicitly create a composite key. If a data source happens to have team_id and user_id, we won't assume that data source can be joined with user_team unless explicitly defined.
#
MeasuresDefine the fields from your data source to be aggregated as inputs to metrics (e.g. in metrics.yaml
). Each measure has an aggregation and optional description. A measure can simply reference a column or may be calculated using a SQL expression. Measures have a default expression of their name. The available aggregations are sum
, max
, min
, count_distinct
, and sum_boolean
.
Example:
measures: - name: transaction_amount_usd description: The total USD value of the transaction. agg: sum - name: transactions description: The total number of transactions. expr: "1" agg: sum - name: quick_buy_amount_usd description: The total USD value of the transactions that were purchased using the “quick buy” button. expr: CASE WHEN transaction_type_name = ‘quick buy’ THEN transaction_amount_usd ELSE 0 END agg: sum - name: quick_buy_transactions description: The total transactions bought as quick buy. expr: quick_buy_flag agg: sum_boolean
#
DimensionsDefine the dimensions from your source. Transform currently supports dimensions of type time and categorical. Dimensions are qualitative values such as names, dates, or geographical data. Dimensions provide context to measures and are associated with metrics created from those measures to provide “metric by dimension” data slicing. Dimensions can either directly reference a column or may be calculated using a SQL expression.
Example:
dimensions: - name: ds type: time type_params: is_primary: # true or false time_format: YYYY-MM-DD # only format currently supported time_granularity: day # only format currenlty supported - name: quick_buy_transaction type: categorical expr: | CASE WHEN transaction_type_name = ‘quick_buy’ THEN 1 ELSE 0 END
Time dimensions
It's a requirement to add the type parameters all time dimensions. In addition, for data sources with a measure, a primary time dimension is required.
- The primary time dimension is used for graphing the x-axis. For materializations, it's used as the common time dimension for all metrics.
- Granularity currently supports day but in the future will support additional granularities.
- Format currently supports YYYY-MM-DD but in the future, we will support additional formats.
NOTE: The name of your primary time dimension must be the same across all your data sources.
For data sources that use a SQL query, Transform's framework makes an optimization to prevent the query from executing multiple times if not necessary. A common example of the benefit of this optimization is if multiple metrics need data from a single data source, the query is not run multiple times to retrieve the data from those metrics. This process where this table is produced is called priming
, and it's beneficial especially when queries may be doing full table scans and are expensive to run. In certain cases, some configuration in the data source is needed to tell Transform's framework how to prime in order to get the best performance.
#
Mutability Configurationcaution
An incorrect mutability setting will lead to inaccurate metric values. To avoid this, be sure to understand the various options and the update process for the underlying dataset
Mutability refers to how the underlying data in this configuration (a sql table or the results of a sql query) changes.
The options are as follows:
immutable
: the data source rarely changes and the framework only runs the query once.
full mutation
: the data changes frequently and we will update it on a periodic basis.
append_only
: the data source will update with only new rows appended to it, and there's a mutability column that indicates when there are new rows.
ds_append_only
: this is similar to append_only
except that the mutability column that indicates there are new rows is by default ds
.
For more information, please visit Mutability Settings.