Why Incremental Loading Matters

Full data loads are slow and expensive. Incremental loading ensures that only updated rows are processed, reducing cost and improving efficiency.

In this guide, you will learn:

Why incremental loading is important
The pros and cons of popular approaches
The best method for handling 30–40 BigQuery tables
Sample metadata table design
ADF pipeline workflow

Approach 1: CDC JSON File in ADLS

Some teams use a JSON file stored in ADLS to maintain watermark values, such as:

{
  "Sales": "2025-01-10T10:00:00Z",
  "Orders": "2025-01-09T08:00:00Z"
}

Advantages:

Simple to set up
No database required

Disadvantages:

No transaction control
Risk of partial updates
Poor scalability for 30–40 tables
No audit or history tracking

➡ Not recommended for production pipelines.

Approach 2: SQL Metadata Table (Recommended)

This is the most reliable and scalable solution for incremental loading.

Sample metadata table structure:

table_name	incremental_column	last_watermark	last_successful_run	in_progress
Sales	updated_at	2025-01-10 10:00:00	2025-01-10 10:02	0
Orders	updated_at	2025-01-09 08:00:00	2025-01-09 08:05	0

Key Benefits:

Full transaction support
Atomic updates ensure data safety
Prevents concurrency conflicts
Supports retries & error logging
Scales perfectly for 30–40 tables

How the ADF Incremental Pipeline Works

1. Lookup Metadata

ADF reads table name, incremental column, and last watermark from the metadata table.

SELECT table_name, incremental_column, last_watermark
FROM metadata_table
WHERE is_active = 1;

2. ForEach Loop

Mark start (in_progress = 1)
Query BigQuery for rows updated after last watermark
Copy data to ADLS
Update SQL metadata table

3. BigQuery Incremental Query

Place this inside the Source → Query field of the Copy Data activity:

SELECT *
FROM `project.dataset.sales`
WHERE updated_at > @last_watermark
  AND updated_at <= @current_watermark;

ADF dynamic SQL version:

@concat(
    'SELECT * FROM `project.dataset.', 
    item().table_name, 
    '` WHERE ', 
    item().incremental_column,
    ' > TIMESTAMP("', 
    item().last_watermark, 
    '")'
)

4. Update Metadata Table

UPDATE metadata_table
SET last_watermark = @new_watermark,
    last_successful_run = GETUTCDATE(),
    in_progress = 0
WHERE table_name = @table_name;

Why Metadata Table Wins

Reliable and safe compared to JSON files
No corruption risk
Better debugging and logging
Enterprise-grade architecture

Conclusion

For incremental loading of 30–40 BigQuery tables into ADLS using Azure Data Factory, the SQL metadata table approach is the most robust, scalable, and production-ready method.

Frequently Asked Questions

1. Should the incremental BigQuery query be added in the Source of the Copy Data activity?

Yes. You must place the incremental SQL inside Copy Data → Source → Query mode.

SELECT *
FROM `project.dataset.table`
WHERE updated_at > TIMESTAMP(@last_watermark);

ADF dynamic version:

@concat(
'SELECT * FROM `project.dataset.', item().table_name,
'` WHERE updated_at > TIMESTAMP("',
activity('LookupWatermark').output.firstRow.last_watermark, '")'
)

2. Do we need to manually add metadata details for each table?

Yes, but only once during setup.

INSERT INTO metadata_table (table_name, incremental_column, last_watermark)
VALUES ('Sales', 'updated_at', '2025-01-01T00:00:00Z');

After this, ADF updates the watermark automatically.

3. Why am I getting empty files with only column names?

This happens when BigQuery returns 0 rows — meaning no new data was found after the last watermark.

To avoid empty output files, add a row-count check in ADF:

Step 1 — Lookup row count

SELECT COUNT(*) AS rowcount
FROM `project.dataset.table`
WHERE updated_at > @last_watermark;

Step 2 — If Condition

@greater(int(activity('LookupRowCount').output.firstRow.rowcount), 0)

Step 3 — Only run Copy Data if rows exist

If false → skip Copy activity → no empty file.

Best Way to Implement Incremental Loading from BigQuery to ADLS Using Azure Data Factory (ADF)