Ask AI

You are viewing an unreleased or outdated version of the documentation

Backfills#

Backfilling is the process of running partitions for assets or ops that either don’t exist or updating existing records. Dagster supports backfills for each partition or a subset of partitions.

After defining a partition, you can launch a backfill that will submit runs to fill in multiple partitions at the same time.

Backfills are common when setting up a pipeline for the first time. The assets you want to materialize might have historical data that needs to be materialized to get the assets up to date. Another common reason to run a backfill is when you’ve changed the logic for an asset and need to update historical data with the new logic.


Launching backfills for partitioned assets#

To launch backfills for a partitioned asset, click the Materialize button on either the Asset details or the Global asset lineage page. The backfill modal will display.

Backfills can also be launched for a selection of partitioned assets as long as the most upstream assets share the same partitioning. For example: All partitions use a DailyPartitionsDefinition.

backfills-launch-modal

To observe the progress of an asset backfill, navigate to the Runs details page for the run. This page can be accessed by clicking Runs tab, then clicking the ID of the run. To see all runs, including runs launched by a backfill, check the Show runs within backfills box:

backfills-launch-modal

Launching single-run backfills using backfill policies
Experimental
#

By default, if you launch a backfill that covers N partitions, Dagster will launch N separate runs, one for each partition. This approach can help avoid overwhelming Dagster or resources with large amounts of data. However, if you're using a parallel-processing engine like Spark and Snowflake, you often don't need Dagster to help with parallelism, so splitting up the backfill into multiple runs just adds extra overhead.

Dagster supports backfills that execute as a single run that covers a range of partitions, such as executing a backfill as a single Snowflake query. After the run completes, Dagster will track that all the partitions have been filled.

Single-run backfills only work for backfills that target assets directly, i.e. those launched from the asset graph or asset page. Backfills launched from the Job page will not respect the backfill policies of assets included in the job.

To get this behavior, you need to:

  • Set the asset's backfill_policy to BackfillPolicy.single_run

  • Write code that operates on a range of partitions instead of just single partitions. This means that, if your code uses the partition_key context property, you'll need to update it to use one of the following properties instead:

    Which property to use depends on whether it's most convenient for you to operate on start/end datetime objects, start/end partition keys, or a list of partition keys.

If you're using an I/O manager, you'll also need to make sure that the I/O manager is using these methods. Note: File system I/O managers do not support single-run backfills, but Dagster's built-in database I/O managers - like Snowflake, BigQuery, or DuckDB - include this functionality out of the box.

Use the following tabs to check out examples both with and without I/O managers.

Without I/O managers#

from dagster import (
    AssetExecutionContext,
    AssetKey,
    BackfillPolicy,
    DailyPartitionsDefinition,
    asset,
)


@asset(
    partitions_def=DailyPartitionsDefinition(start_date="2020-01-01"),
    backfill_policy=BackfillPolicy.single_run(),
    deps=[AssetKey("raw_events")],
)
def events(context: AssetExecutionContext) -> None:
    start_datetime, end_datetime = context.partition_time_window

    input_data = read_data_in_datetime_range(start_datetime, end_datetime)
    output_data = compute_events_from_raw_events(input_data)

    overwrite_data_in_datetime_range(start_datetime, end_datetime, output_data)

Launching backfills for partitioned jobs#

To launch and monitor backfills for a job, use the Partitions tab in the job's Details page:

  1. Click the Launch backfill button in the Partitions tab. This opens the Launch backfill modal.

  2. Select the partitions to backfill. A run will be launched for each partition.

  3. Click Submit [N] runs button on the bottom right to submit the runs. What happens when you click this button depends on your Run Coordinator:

    • For the default run coordinator, the modal will exit after all runs have been launched
    • For the queued run coordinator, the modal will exit after all runs have been queued

Note: If the job targets assets that have backfill policies, the assets' backfill policies will control which runs are launched.

After all the runs have been submitted, you'll be returned to the Partitions page, with a filter for runs inside the backfill. This page refreshes periodically and allows you to see how the backfill is progressing. Boxes will become green or red as steps in the backfill runs succeed or fail:

partitions-page

Launching backfills of jobs using the CLI#

Backfilling all partitions in a job#

Backfills can also be launched using the backfill CLI.

Let's say we defined a date-partitioned job named trips_update_job. To execute the backfill for this job, we can run the dagster job backfill command as follows:

$ dagster job backfill -j trips_update_job

This will display a list of all the partitions in the job, ask you if you want to proceed, and then launch a run for each partition.

Backfilling a subset of partitions#

To execute a subset of a partition set, use the --partitions argument and provide a comma-separated list of partition names you want to backfill:

$ dagster job backfill -j do_stuff_partitioned --partitions 2021-04-01,2021-04-02

Alternatively, you can also specify ranges of partitions using the --from and --to arguments:

$ dagster job backfill -j do_stuff_partitioned --from 2021-04-01 --to 2021-05-01