Why Most Data Pipelines Drift From Reality
Your data warehouse says you had 14,200 active users last month. Your application database says 15,800 users logged in. The product team uses the warehouse number. The engineering team uses the database number. Neither team knows the other's number is different.
This is data drift. It is not a bug in one system. It is the natural, inevitable consequence of maintaining two copies of the same data with different update mechanisms, different schemas, and different timing.
Every data pipeline drifts from its source of truth. The question is whether you detect the drift before it affects decisions.
ETL vs Operational Truth
A data pipeline extracts data from production systems, transforms it, and loads it into a warehouse. The warehouse is a copy. Copies diverge.
The divergence starts small. A pipeline run takes 45 minutes. During those 45 minutes, the production system continues processing transactions. The warehouse snapshot is already stale when the load completes.
Then it compounds. A pipeline fails at 3 AM. The on-call engineer restarts it at 7 AM. Four hours of data arrive late. The morning dashboard showed yesterday's numbers as if they were complete. Decisions were made based on incomplete data.
Production DB Data Warehouse
(source of truth) (derived copy)
───────────────── ──────────────────
Day 1: 15,000 users 15,000 users ✓
Pipeline runs normally
Day 2: 15,200 users 15,150 users ✗
Pipeline lag: 50 users arrived
during extraction window
Day 3: 15,800 users 15,150 users ✗✗
Pipeline failed at 3 AM
Restarted at 7 AM
Dashboard showed Day 2 numbers all morning
Day 4: 16,100 users 15,950 users ✗
Backfill recovered most data
But 150 records had schema mismatch
and were silently dropped
Cumulative drift: 150 users (0.9%)
Appears small. Affects every metric derived
from user counts.The warehouse is never exactly right. It is at best a delayed, filtered approximation of the production system. Most teams treat it as the authoritative source for analytics, reporting, and ML features without acknowledging this fundamental limitation.
Schema Drift Is Inevitable
Source systems change. Columns are added, renamed, retyped, and deprecated. Each change has the potential to break downstream pipelines.
The dangerous changes are not the ones that break the pipeline - those are caught immediately. The dangerous changes are the ones the pipeline handles gracefully but incorrectly.
# Production schema change: added nullable column
# Before:
{"user_id": 123, "plan": "pro"}
# After:
{"user_id": 123, "plan": "pro", "trial_end": null}
# Pipeline transformation (unchanged):
def compute_paying_users(rows):
return [r for r in rows if r["plan"] != "free"]
# Still works. But now "trial" users with trial_end
# dates are counted as paying users.
# No error. Wrong number.
Schema drift does not always break pipelines. Often it changes the semantics of existing transformations without producing any errors. The pipeline runs green. The numbers are wrong. Nobody notices for weeks.
Schema registries, contract tests between producers and consumers, and automated schema comparison between the source and warehouse catch these issues. Few teams implement all three. Most discover schema drift when a stakeholder says "these numbers don't look right."
Partial Data Is Worse Than Missing Data
When a pipeline fails completely, it is obvious. The dashboard is empty. Alerts fire. Someone fixes it.
When a pipeline partially fails - processing 80% of records and silently dropping 20% - the dashboard looks plausible. The numbers are in the right ballpark. But every metric is systematically wrong by a consistent margin.
Common causes of partial data:
- Encoding errors. A source system starts emitting records with non-UTF-8 characters. The pipeline's JSON parser rejects them. Only records with special characters are lost.
- Join failures. A pipeline joins events with user profiles. Users created in the last hour do not have profiles yet. Their events are dropped from the join.
- Filter drift. A WHERE clause filters data by date. The source system changes its timezone. Records near midnight are included or excluded incorrectly.
- Size limits. A record exceeds a column's size limit. The database truncates or rejects it. The pipeline continues with an incomplete record.
-- Detecting partial data: compare row counts
-- between source and warehouse
SELECT
DATE(source.created_at) AS day,
COUNT(source.*) AS source_count,
COUNT(warehouse.*) AS warehouse_count,
COUNT(source.*) - COUNT(warehouse.*) AS missing,
ROUND(100.0 * (COUNT(source.*) - COUNT(warehouse.*))
/ NULLIF(COUNT(source.*), 0), 1) AS missing_pct
FROM production.orders source
FULL OUTER JOIN warehouse.orders warehouse
ON source.id = warehouse.id
AND DATE(source.created_at) = DATE(warehouse.created_at)
GROUP BY DATE(source.created_at)
HAVING COUNT(source.*) != COUNT(warehouse.*)
ORDER BY day DESC;
This reconciliation query is simple. Most teams do not run it. They discover partial data loss when a quarterly business review produces numbers that do not match the financial audit.
Pipeline Freshness Is an SLA
The business expects updated dashboards by a specific time. This is not a preference. It is an operational requirement on the same level as service uptime.
When a pipeline misses its freshness SLA, the business operates on stale data. Marketing allocates budget based on yesterday's incomplete numbers. Operations staffs based on outdated demand forecasts. Product teams prioritize features based on metrics that do not reflect reality.
class FreshnessMonitor:
def check(self, pipeline_name):
last_run = self.get_last_successful_run(pipeline_name)
expected_freshness = self.get_sla(pipeline_name)
staleness = datetime.utcnow() - last_run.completed_at
if staleness > expected_freshness:
self.alert(
severity="high",
message=f"{pipeline_name} is {staleness} stale "
f"(SLA: {expected_freshness})",
)
self.metrics.gauge(
"pipeline.staleness_seconds",
staleness.total_seconds(),
tags={"pipeline": pipeline_name},
)
Freshness monitoring should trigger alerts before the business notices. If the SLA is "data ready by 8 AM," the alert should fire at 7:30 AM when the pipeline is running late, not at 9 AM when the VP asks why the dashboard shows yesterday's numbers.
How to Build Pipelines That Stay Honest
Data pipelines will always drift from their sources. The goal is not to prevent drift - it is to detect and measure it.
- Run reconciliation daily. Compare row counts, aggregate metrics, and key distributions between the source and the warehouse. Publish the drift as a metric.
- Monitor schema changes. Detect column additions, removals, and type changes before they reach the pipeline. Alert on breaking changes and log non-breaking changes for audit.
- Treat freshness as an SLA. Define expected freshness per pipeline. Alert when pipelines are late, not when they are missing.
- Quarantine instead of drop. When records fail validation, write them to a quarantine table instead of silently discarding them. Quarantined records are inspectable and recoverable.
- Version pipeline transformations. When business logic changes, version the transformation so historical data remains interpretable under the old logic.
The pipeline is not the source of truth. The production system is. Every number from the warehouse is an approximation. The best you can do is measure how far the approximation drifts and keep that distance small enough to trust.
References
- Data Quality at Airbnb- Airbnb Engineering
- Modular Data Modeling Technique- dbt Labs
- Apache Spark Documentation- Apache Spark
- The Log: What every software engineer should know- LinkedIn Engineering