Skip to main content

Mutability

caution

An incorrect mutability setting will lead to inaccurate metric values. To avoid this, be sure to understand the various options and the update process for the underlying dataset.

Mutability defines how your data source changes through an update process. Mutability settings create a contract between the update process for your datasets and the Metrics Framework. Mutability settings allow the Framework to create the most efficient queries possible while returning accurate datasets.

The options for mutability are below:

Immutable#

With the immutable mutability configuration setting, the Framework only creates the dataset once and uses the resulting data for all queries going forward. This applies to both cached datasets downstream as well as primed tables (e.g., in the case of data sources produced through a sql_query).

mutability:  type: immutable

Full Mutation#

With the full_mutation configuration setting, the Framework updates the data on periodic basis. By default, it's configured to refresh the data hourly. Optionally, the time of day and the freqency in which Transform picks up the updates to your data can be configured in the data source using cron. If you are specifying a cron string, we recommend you set the timing of your update_cron setting to be scheduled for whenever the upstream data processes are definitively complete, as this will ensure Transform is picking up the updates to your data in a timely manner. You can find additional details on cron syntax at the bottom of this document.

mutability:  type: full_mutation  type_params:    # 6:00 Pacific is 13:00 UTC    update_cron: 0 13 * * *

Append-Only#

With the append-only mutability configuration setting, an additional column is required in the configuration that tells the framework which column can be used to determine when new data exists. The query can be parameterized on this column to help produce faster queries. For example:

sql_query: |  SELECT    bookings  FROM    fct_bookings  WHERE    ds BETWEEN '{{ ds.min }}' AND '{{ ds.max}}'

Initially, the query is run as supplied to produce a primed table. The min and max values for the append-only column can be configured in the data source as well. The min and max values tell the framework what are the possible values for the append-only column. Generally, the min and max values should be left at the default and should cover all values that are possible to exist in the data set for all time. e.g. min=1000-01-01 and max=9999-12-31. It's unlikely that these min / max values in the mutability configuration need to be changed. On subsequent priming (based on the schedule), the framework knows what's the largest value of ds that was seen, and runs the SQL query to only get the new values. As an example, the initial query to produce the primed table might be

  SELECT    bookings  FROM    fct_bookings  WHERE    ds BETWEEN '1000-01-01' AND '9999-12-31'

Let's say that the min and max values turn out to be min=2000-01-01 and max=2020-01-01 after running this query on 2020-01-02. After a day passes and it's 2020-01-03, there might be new data in the fct_bookings table and it includes rows for ds=2020-01-02 . On the next priming operation, the framework will run the query to get the new rows for the data source.


SELECT  *FROM (  SELECT    bookings  FROM    fct_bookings  WHERE    ds BETWEEN '2020-01-01' AND '9999-12-31')-- Where clause is added to cover the case where SQL query is not parameterized.WHERE  ds > '2020-01-01'

It will then insert those rows into the existing primed table. The primed table with the new rows added will be used for subsequent queries. Note that since older values are not scanned, updates to fct_bookings where ds<='2020-01-02' will not be seen in the metrics produced by the framework. With append-only mutability, cached data sources are not invalidated.

Because the framework uses the existence of a value in the column to prune the query, there can be issues if partial data lands in the table. For example the following set of operations will result in inconsistent results

1. fct_bookings with append only mutability on `ds` contains
booking_id ds1 2020-01-012 2020-01-013 2020-01-02
2. Priming occurs.
3. fct_bookings is updated to include a few more rows:
booking_id ds1 2020-01-012 2020-01-013 2020-01-024 2020-01-025 2020-01-03
4. Priming occurs.
5. booking_id 4 is ignored by the framework, and subsquent metrics   reflect partial data.

The priming schedule can also be specified for this mutability configuration - please see the section in the full mutation mutability section.

DS Append Only#

The DS-append-only mutability setting is similar to append-only mutability, only that it has the default configuration that the mutability column is named ds in the data source.

Cron Options#

In addition to specifying mutability, you can also specify the time of day for the cache to be "invalidated". By default, this occurs every hour for full_mutation, append_only, and ds_append_only. After the data is "invalidated", the next time a query is executed against the data source in question, the data source will be reloaded from the original source and stored in the cache for future queries. This will then be invalidated upon the expiry of the next cron period. To specify a cadence, use update_cron. In the example below, the invalidation will occur at 00:00 UTC every day for every month.

mutability:  type: ds_append_only  type_params:    update_cron: 0 0 * * *

Crontab Syntax and Operators

For more information checkout Crontab Guru.

Each line contains five fields separated by a space

- - - - -| | | | || | | | ----- Day of week (0 - 7) (Sunday=0 or 7)| | | ------- Month (1 - 12)| | --------- Day of month (1 - 31)| ----------- Hour (0 - 23)------------- Minute (0 - 59)

The first five fields may contain one or more values, separated by a comma or a range of values separated by a hyphen.

  • * - The asterisk operator means any value or always. If you have the asterisk symbol in the Hour field, it means the task will be performed each hour.
  • , - The comma operator allows you to specify a list of values for repetition. For example, if you have 1,3,5 in the Hour field, the task will run at 1 am, 3 am and 5 am.
  • - - The hyphen operator allows you to specify a range of values. If you have 1-5 in the Day of week field, the task will run every weekday (From Monday to Friday).
  • / - The slash operator allows you to specify values that will be repeated over a certain interval between them. For example, if you have /4 in the Hour field, it means the action will be performed every four hours. It is same as specifying 0,4,8,12,16,20. Instead of asterisk before the slash operator, you can also use a range of values, 1-30/10 means the same as 1,11,21.