Why Incremental Loading Matters
Full data loads are slow and expensive. Incremental loading ensures that only updated rows are processed, reducing cost and improving efficiency.
In this guide, you will learn:
- Why incremental loading is important
- The pros and cons of popular approaches
- The best method for handling 30–40 BigQuery tables
- Sample metadata table design
- ADF pipeline workflow
Approach 1: CDC JSON File in ADLS
Some teams use a JSON file stored in ADLS to maintain watermark values, such as:
{
"Sales": "2025-01-10T10:00:00Z",
"Orders": "2025-01-09T08:00:00Z"
}
Advantages:
- Simple to set up
- No database required
Disadvantages:
- No transaction control
- Risk of partial updates
- Poor scalability for 30–40 tables
- No audit or history tracking
➡ Not recommended for production pipelines.
Approach 2: SQL Metadata Table (Recommended)
This is the most reliable and scalable solution for incremental loading.
Sample metadata table structure:
| table_name | incremental_column | last_watermark | last_successful_run | in_progress |
|---|---|---|---|---|
| Sales | updated_at | 2025-01-10 10:00:00 | 2025-01-10 10:02 | 0 |
| Orders | updated_at | 2025-01-09 08:00:00 | 2025-01-09 08:05 | 0 |
Key Benefits:
- Full transaction support
- Atomic updates ensure data safety
- Prevents concurrency conflicts
- Supports retries & error logging
- Scales perfectly for 30–40 tables
How the ADF Incremental Pipeline Works
1. Lookup Metadata
ADF reads table name, incremental column, and last watermark from the metadata table.
SELECT table_name, incremental_column, last_watermark
FROM metadata_table
WHERE is_active = 1;
2. ForEach Loop
- Mark start (in_progress = 1)
- Query BigQuery for rows updated after last watermark
- Copy data to ADLS
- Update SQL metadata table
3. BigQuery Incremental Query
Place this inside the Source → Query field of the Copy Data activity:
SELECT *
FROM `project.dataset.sales`
WHERE updated_at > @last_watermark
AND updated_at <= @current_watermark;
ADF dynamic SQL version:
@concat(
'SELECT * FROM `project.dataset.',
item().table_name,
'` WHERE ',
item().incremental_column,
' > TIMESTAMP("',
item().last_watermark,
'")'
)
4. Update Metadata Table
UPDATE metadata_table
SET last_watermark = @new_watermark,
last_successful_run = GETUTCDATE(),
in_progress = 0
WHERE table_name = @table_name;
Why Metadata Table Wins
- Reliable and safe compared to JSON files
- No corruption risk
- Better debugging and logging
- Enterprise-grade architecture
Conclusion
For incremental loading of 30–40 BigQuery tables into ADLS using Azure Data Factory, the SQL metadata table approach is the most robust, scalable, and production-ready method.
Frequently Asked Questions
1. Should the incremental BigQuery query be added in the Source of the Copy Data activity?
Yes. You must place the incremental SQL inside Copy Data → Source → Query mode.
SELECT *
FROM `project.dataset.table`
WHERE updated_at > TIMESTAMP(@last_watermark);
ADF dynamic version:
@concat(
'SELECT * FROM `project.dataset.', item().table_name,
'` WHERE updated_at > TIMESTAMP("',
activity('LookupWatermark').output.firstRow.last_watermark, '")'
)
2. Do we need to manually add metadata details for each table?
Yes, but only once during setup.
INSERT INTO metadata_table (table_name, incremental_column, last_watermark)
VALUES ('Sales', 'updated_at', '2025-01-01T00:00:00Z');
After this, ADF updates the watermark automatically.
3. Why am I getting empty files with only column names?
This happens when BigQuery returns 0 rows — meaning no new data was found after the last watermark.
To avoid empty output files, add a row-count check in ADF:
Step 1 — Lookup row count
SELECT COUNT(*) AS rowcount
FROM `project.dataset.table`
WHERE updated_at > @last_watermark;
Step 2 — If Condition
@greater(int(activity('LookupRowCount').output.firstRow.rowcount), 0)
Step 3 — Only run Copy Data if rows exist
If false → skip Copy activity → no empty file.