My drift detector keeps a history so you can see a trend line: how much your AWS
diverged from Terraform over time, run by run. To draw that line I record a
DriftSnapshot after every scan, every tfstate import, every S3 sync.
Here's the part I didn't think through: I record one even when there's zero
drift. That's on purpose — a flat line is data too; "nothing drifted for three
weeks" is exactly what you want the chart to show. But it also means a scheduler
ticking every minute writes a row every minute, forever, whether or not anything
changed. Append-only, no ceiling. A slow-motion time bomb in a table nobody looks at.
Why I didn't notice for a while
The history view caps what it shows:
snapshots = list(DriftSnapshot.objects.filter(environment=env)[:100])
So the UI looked fine — a tidy 100 rows. Meanwhile the table underneath had tens
of thousands. The display limit was quietly hiding the growth from the one place
I'd have spotted it. The lesson that stuck: a LIMIT in your read path is not
retention. It bounds the query, not the data.
The fix has to live where the rows are born
My first instinct was a nightly cron that trims old rows. That works, but it
leaves the table unbounded between runs, and it's one more moving part to forget.
The rows all come from a single function — every scan/import/sync funnels through
the same _record_drift_snapshot(). So the cleanest place to enforce a ceiling is
right there, at the moment of creation:
snapshot = DriftSnapshot.objects.create(
environment=environment,
source=source,
changed_count=len(changed),
added_count=len(added),
unchanged_count=unchanged,
detail={'changed': changed, 'added': added},
)
# we append a row every run (even at zero drift) — trim the oldest beyond the cap
DriftSnapshot.prune(environment)
return snapshot
One chokepoint, so the table is bounded continuously — not "eventually, when the
cron next fires."
Keeping the newest N, per environment
The prune itself is two queries. Find the PKs of the newest N you want to keep,
then delete everything else for that environment:
@classmethod
def prune(cls, environment, keep=None):
if keep is None:
from django.conf import settings
keep = getattr(settings, 'DRIFT_SNAPSHOT_RETENTION', 0)
if not keep or keep <= 0:
return 0 # 0 / unset = unlimited, opt out cleanly
keep_ids = list(
cls.objects.filter(environment=environment)
.order_by('-detected_at')
.values_list('pk', flat=True)[:keep]
)
deleted, _ = (
cls.objects.filter(environment=environment)
.exclude(pk__in=keep_ids)
.delete()
)
return deleted
Two things I'd flag if you copy this:
-
Scope by
environment. The cap is "newest N per environment," not N total. A global[:keep]would let a busy prod environment evict a quiet staging one's entire history. -
.values_list('pk', ...)[:keep]then.exclude(pk__in=...). The slice becomes aLIMITin SQL, so you never pull thousands of rows into Python just to decide what to delete. For per-write pruning wherekeepis small, that matters.
The cap is a setting, defaulting to something sane, with <= 0 meaning "unlimited"
so anyone who actually wants the full append-only log can opt out:
DRIFT_SNAPSHOT_RETENTION = int(os.getenv('DRIFT_SNAPSHOT_RETENTION', '500') or '0')
Don't forget the rows you already have
The chokepoint guard fixes new growth. It does nothing for the table that's
already bloated — those old rows were written before the cap existed and will sit
there until something deletes them. So I added a management command to backfill the
cleanup across every environment, with a dry run because deleting history deserves a
look-before-you-leap:
$ python manage.py prune_drift_snapshots --dry-run
prod: would delete 41,902 (of 42,402)
staging: would delete 0 (of 88)
Done: would delete 41,902 snapshot(s), keeping 500 per environment.
$ python manage.py prune_drift_snapshots
prod: deleted 41,902 (kept 500)
Done: deleted 41,902 snapshot(s), keeping 500 per environment.
Same prune() underneath, just iterated over all environments — the per-write
guard and the bulk cleanup share one implementation.
One nice side effect: none of this needed a migration. It's a classmethod and a
call site, not a schema change. The existing rows are fine; they just stop being
immortal.
Takeaways
- An "append-only history" feature that records even when nothing changed is unbounded by design. Decide its retention the day you build it, not the day the disk fills.
- A
LIMIT/[:N]in your read path bounds the query, not the table. It will happily hide unbounded growth from your own UI. - Enforce the ceiling at the single point where rows are created, so the table is bounded continuously — not at a periodic job that leaves gaps.
- Ship a separate cleanup command for the rows that accumulated before the cap, and
give it a
--dry-run. New-growth guard and historical backfill aren't the same job.
This is the drift-history view in a self-hosted tool that tracks how your live AWS
drifts from Terraform over time — open source (MIT), one docker compose up:
syncvey.com. How do you handle append-only tables that exist
to be charted — hard cap, time-based TTL, or roll-ups into a summary table?










