Three months into a production migration, we discovered that 14 of our 47 deployments had quietly drifted from their declared state. Not in a dramatic, pager-firing way. In the slow, invisible way that turns a Tuesday afternoon into a Friday incident.
That's the thing about configuration drift. It doesn't announce itself. It accumulates.
Here's what happened, what we built to fix it, and why I think most teams are one bad deploy away from the same problem.
Setup The Problem
We were running a mid-sized Kubernetes cluster across three environments: dev, staging, and production. Standard GitOps workflow. ArgoCD handling deployments. Helm charts checked into Git. Everything was "declarative." Everything was "source-of-truth."
Except it wasn't.
Engineers were patching things manually under pressure. kubectl edit became a
habit. Resource limits got tweaked directly on pods. ConfigMaps were updated in-cluster
without touching the repo. Nobody flagged it because nothing broke. The cluster kept humming.
The dashboards stayed green.
Then we started seeing weird behavior:
- A service running at 2Gi memory when its limit was declared at 512Mi
- A deployment with two replicas when the Helm chart declared three
- A sidecar container version six weeks behind what we'd intended to ship
None of it was catastrophic. All of it was real. And we had no idea how long it had been that way.
01 GitOps Sync Status Isn't the Same as Drift Detection
This is the part that trips people up. ArgoCD told us our apps were "Synced." And technically,
they were, at the moment of last sync. But sync status is a snapshot, not a continuous
assertion. If someone runs kubectl edit after a sync, ArgoCD doesn't know.
It's not watching for that.
Drift detection means continuously comparing what's running in the cluster against what's declared in Git, and alerting when they diverge. That's a different problem than deployment sync. Most teams conflate them and pay for it later.
We built a reconciliation loop using ArgoCD's resource tracking combined with a custom controller that scraped live cluster state on a 5-minute interval, diffed it against our Helm-rendered manifests, and pushed the deltas into a monitoring pipeline.
// simplified drift detection loop
func reconcileLoop(ctx context.Context) {
ticker := time.NewTicker(5 * time.Minute)
for range ticker.C {
liveState := scrapeClusterState(ctx)
declState := renderHelmManifests()
deltas := diff(liveState, declState)
pushMetrics(deltas) // → Prometheus exporter
}
}
Nothing fancy. About 400 lines of Go and a Prometheus exporter. The first run returned 14 drifted resources. Four of them in production.
02 The Real Problem Is Toil and Pressure, Not Malicious Intent
Every one of those manual edits had a story:
- A memory limit bumped because an OOMKill was happening at 2 AM
- A replica count changed because load spiked and autoscaling hadn't kicked in fast enough
- A ConfigMap updated because a third-party API changed its endpoint and we needed 30 seconds to fix it, not 30 minutes to run a pipeline
These aren't reckless engineers. These are engineers solving real problems with the tools in front of them.
Without drift detection, that 2 AM fix becomes permanent. Nobody goes back. The PR never gets opened. The Helm chart never gets updated. And six weeks later, someone deploys from Git and rolls back the fix that's been holding production together.
The fix isn't telling people to stop using kubectl edit. It's making the correct
path faster than the escape hatch, and making drift visible so it can't
quietly accumulate.
03 Alerting on Drift Changes the Culture
Once engineers could see a drift dashboard, broken down by namespace, by team, by resource type, behavior shifted. Not because we mandated it. Because visibility creates accountability in a way that process documents never do.
We tagged each drift event with three pieces of metadata:
drift_event:
resource: deployment/api-gateway
namespace: production
last_modifier: engineer@company.com # from audit logs
time_since: 6d 14h
severity: high # resource limit 4x declared
delta:
memory_limit: { declared: 512Mi, live: 2Gi }
Severity was tiered deliberately:
- Low — replica count variance, annotation changes
- Medium — resource limit changes within 2x declared
- High — security context changes, limits beyond 2x, missing probes
- PagerDuty alert — anything in production flagged high severity
Within three weeks of launching the dashboard, the team had self-corrected 11 of the 14 original drifted resources without us asking. They just didn't want to see red in their namespace.
04 Drift Detection Has to Be Cheap to Maintain
Here's where most homegrown solutions fall apart. You build the thing, it works, then it becomes another system someone has to babysit. We kept ours deliberately simple.
- No custom UI. A Grafana dashboard pulling from Prometheus.
- The controller runs as a standard Kubernetes deployment with a
ServiceAccountscoped to read-only cluster access. - The diff logic uses server-side apply dry-runs, which Kubernetes gives you for free.
- Total compute overhead is negligible.
We've been running it for eight months. It's needed exactly two bug fixes and one config update when we migrated Helm chart versions. Complexity is debt. Every additional feature you bolt on is another thing that can fail or get abandoned.
↗ The Takeaway
Drift detection isn't a Kubernetes problem. It's a systems problem. The cluster just happens to be where the drift lives.
If you're running GitOps and you've never run a diff between your declared manifests and your live cluster state, you probably have drift. You just don't know what it looks like yet.
What decisions are you currently making based on cluster state that you think matches Git, but doesn't?