Skip to main content

Intervals and Windows

When defining a new job, you need to define the interval (cron) at which it will be triggered. This parameter can give you a precise value when the job is scheduled for execution but only a rough estimate exactly when the job is executing. It is very common in a ETL pipeline to know when the job is exactly executing as well as for what time window the current transformation will consume the data.

For example, assume there is a job that querying from a table using below statement:

SELECT * FROM table WHERE
created_at >= '{{.START_DATE}}' AND created_at < '{{.END_DATE}}'

START_DATE and END_DATE could be replaced at the time of compilation with based on its window configuration. Without the provided filter, we will have to consume all the records which are created till date inside the table even though the previous rows might already been processed.

The DSTART and DEND values of the input window could vary depending on the ETL job requirement.

  • For a simple transformation job executing daily, it would need to consume full day work of yesterday’s data.
  • A job might be consuming data for a week/month for an aggregation job, but the data boundaries should be complete, not consuming any partial data of a day.

Window Configuration

Optimus allows user to define the amount of data window to consume through window configurations. The configurations act on the scheduletime of the job and applied in order to compute _DSTART and DEND.

The following is the list of available configuration the user can setup a window:

  • Size: size enables the user to define the duration for which the data needs to be consumed by job. Size can be defined in in units like "1h", "1d", "1w", "1M" to define the respective size of data to consume.
  • Shift By: optional configuration to allow shifting the window by the specified value, e.g., a configuration with size 1d, with shift by -1d, means that it will consume data of 1 day, the day before yesterday.
  • Truncate_to: optional configuration to override the time unit for the window interval, e.g., a config with size 1d, with truncate_to "h", will mean data for last 24 hours, to the end of previous hour.
  • Location: optional configuration to define the time zone to be used for this window configuration, if not defined the default value of location will be UTC.
window:
size: 1d
shift_by: -1d
truncate_to: "h"
location: "Asia/Jakarta"

Will provide output for reference time 2023-12-01T03:00:00Z (2023-12-01T10:00:00+07:00)

  • DEND:'2023-11-30T10:00:00+07:00'
  • END_DATE: '2023-11-30'
  • START_DATE='2023-11-29'
  • DSTART: '2023-11-29T10:00:00+07:00'

Window configuration version 1 and 2

  • Truncate_to: The data window on most of the scenarios needs to be aligned to a well-defined time window like month start to month end, or week start to weekend with week start being monday, or a complete day. Inorder to achieve that the truncate_to option is provided which can be configured with either of these values "h", "d", "w", "M" through which for a given schedule_time the end_time will be the end of last hour, day, week, month respectively.
  • Offset: Offset is time duration configuration which enables user to move the end_time post truncation. User can define the duration like "24h", "2h45m", "60s", "-45m24h", "0", "", "2M", "45M24h", "45M24h30m" where "h","m","s","M" means hour, month, seconds, Month respectively.
  • Size: Size enables user to define the amount of data to consume from the end_time again defined through the duration same as offset.

To further understand, the following is an example with its explanation. Important note, the following example uses window version: 2 because version: 1 will soon be deprecated.

For example, previous-mentioned job has 0 2 * * * schedule interval and is scheduled to run on 2023-03-07 at 02.00 UTC with following details:

ConfigurationValueDescription
Truncate_todEven though it is scheduled at 02.00 AM, data window will be day-truncated (00.00 AM).
Offset-24hShifts the window to be 1 day earlier.
Size24hGap between DSTART and DEND is 24h.

Above configuration will produce below window:

  • DSTART: 2023-03-05T00:00:00Z
  • DEND: 2023-03-06T00:00:00Z

This means, the query will be compiled to the following query

SELECT * FROM table WHERE
created_at >= DATE('2023-03-05T00:00:00Z') AND
created_at < DATE('2023-03-06T00:00:00Z')

Assume the table content is as the following:

namecreated_at
Rick2023-03-05
Sanchez2023-03-06
Serious2023-03-07
Sam2023-03-07

When the job that scheduled at 2023-03-07 runs, the job will consume Rick as the input of the table.

Window configuration can be specified in two ways, through custom window configuration and through window preset.

Custom Window

Through this option, the user can directly configure the window that meets their requirement in the job spec YAML. The following is an example of its usage:

version: 3 # decides window version
name: sample-project.playground.table1
owner: sample_owner
schedule:
...
behavior:
...
task:
name: bq2bq
config:
...
window:
size: 1d
labels:
...
hooks: []
dependencies: []

Notice the window configuration is specified under field task.window. Important note, the version field decides which version of window capability to be used. Currently available window version are version: 1, version: 2 and version: 3. Version 3 is recommended to be used as version 1 and version 2 will soon be deprecated. To play around with window configuration, try playground feature for window:

# executes playground for v3
optimus playground window

# executes playground for v1 and v2
optimus playground window --v1-v2

Window Preset (since v0.10.0)

Window preset is a feature that allows easier setup of window configuration while also maintaining consistency. Through this feature, the user can configure a definition of window once, then use it multiple times through the jobs which require it. Important note, window preset always use window version: 2. The main components of window preset are as follow.

  • Window Preset File

Presets configuration is put in a dedicated YAML file. The way to configure it still uses the same window configuration like truncate_to, shift_by, location and size. Though, there are some additions, like the name of the preset and the description to explain this preset. The following is an example of how to define a preset under presets.yaml file (note that the file name does not have to be this one).

presets:
yesterday:
description: defines yesterday window
window:
size: 1d
last_month:
description: defines last 30 days window
window:
size: 1M

In the above example, the file presets.yaml defines two presets, named yesterday and last_month. The name of preset SHOULD be in lower case. All of the fields are required, unless specified otherwise.

  • Preset Reference under Project

If the preset file is already specified, the next thing to do is to ensure that the preset file is referenced under project configuration. The following is an example to refer the preset file under project configuration:

version: 1
log:
...
host: localhost:9100
project:
name: development_project
preset_path: ./preset.yaml # points to preset file
config:
...
namespaces:
...

In the above example, a new field is present, named preset_path. This path refers to where the preset file is located.

  • Preset Reference for Job Specification

Now, if the other two components are met, where the window preset file is specified and this file is referenced by the project, it means it is ready to be used. And the way to use it is by referencing which preset to be used in whichever job requires it. The following is an example of its usage:

version: 1 # preset always use window version 3
name: sample-project.playground.table1
owner: sample_owner
schedule:
...
behavior:
...
task:
name: bq2bq
config:
...
window:
preset: yesterday
labels:
...
hooks: []
dependencies: []

Important note, preset is optional in nature. It means that even if the preset is specified, the user can still use the custom window configuration depending on their need.