Skip to main content
search
0
Scalefree Knowledge Webinars Expert Sessions dbt Talk Microbatch Incremental Models: A New Approach to Large Time-Series Data

What is a Microbatch?

Microbatch is an innovative incremental strategy designed for large time-series datasets. Introduced in dbt Core version 1.9 (currently in beta), it complements existing incremental strategies by offering a structured and efficient way to process data in batches.



Key features of Microbatch include:

  • Utilizes a time column to define batch ranges.
  • Supports reprocessing failed batches.
  • Auto-detects parallel batch y
  • Eliminates complex conditional logic for backfilling.

However, it’s not suitable for datasets lacking a reliable time column or requiring fine-grained control over processing logic.

How Microbatches Work

Microbatching works by splitting model processing into multiple queries (batches) based on:

  • event_time: The time column defining batch ranges.
  • batch_size: The time period for each batch (hour, day [default], month, year).

Each batch functions as an independent, atomic unit, meaning:

  • Batches can be processed, retried, or replaced individually.
  • Parallel execution enables separate, idempotent batch processing.

Batch replacement strategies vary by database adapter:

  • Postgres: Uses merge.
  • BigQuery, Spark: Uses insert_overwrite.
  • Databricks: Uses replace_where.
  • Redshift, Snowflake: Uses delete + insert.

Microbatch Model Configurations

When setting up a Microbatch model, the following configurations are required:

  • event_time: Specifies the time column in UTC.
  • batch_size: Defines batch granularity (hour, day, month, year).
  • begin: Sets the start point for initial or full-refresh builds.

Optional configurations include:

  • lookback: Processes prior batches for late-arriving records.
  • concurrent_batches: Controls parallel execution (auto-detected by default).

Running Batches in Parallel

Parallel execution is automatically detected based on batch conditions and adapter support. However, users can override this behavior using the concurrent_batches setting.

Parallel execution is possible when:

  • The batch is neither the first nor last in the sequence.
  • The database adapter supports parallel execution.
  • The model logic does not depend on execution order.

How to Backload Data

Backloading allows reprocessing historical data within a specific time range using the following command:

dbt run --event-time-start "2025-02-01" --event-time-end "2025-02-03"

This ensures that only batches within the defined range are processed independently.

Microbatch vs. Other Incremental Strategies

Microbatch differs from traditional incremental strategies by:

  • Using independent queries for time-based batches.
  • Eliminating the need for is_incremental() and complex SQL logic.
  • Automatically selecting the most efficient operation (insert, update, replace) for each platform.

Conclusion

Microbatch is a powerful new approach to incremental data processing in dbt Core. By breaking down large datasets into manageable, parallelizable chunks, it simplifies data modeling while improving efficiency and scalability. However, it is essential to consider whether Microbatch suits your data pipeline’s requirements before implementing it.

Watch the Video

Meet the Speaker

Dmytro Polishchuk profile picture

Dmytro Polishchuk
Senior BI Consultant

Dmytro Polishchuk has 7 years of experience in business intelligence and works as a Senior BI Consultant for Scalefree. Dmytro is a proven Data Vault 2.0 expert and has excellent knowledge of various (cloud) architectures, data modeling, and the implementation of automation frameworks. Dmytro excels in team integration and structured project work. Dmytro has a bachelor’s degree in Finance and Financial Management.

Leave a Reply

Close Menu