Why vibe-coded pipelines fail silently
You described the pipeline to an assistant. Read this export, drop the test accounts, join orders to customers, roll it up by week. It wrote the whole thing in one pass. You ran it. No stack trace, no red text, the dashboard rendered, and the numbers look about right. The only question that matters is whether "about right" and "right" are the same number, and nothing in front of you can tell the difference.
That gap is not a knock on the assistant. It is genuinely good at writing this code, and most days the code is fine. But "it ran" and "it is correct" are different claims, and a data pipeline is exactly the kind of program where the distance between them is invisible. The pipeline that quietly returns the wrong answer looks identical to the one that returns the right one. Same shape, same column names, a plausible total. Green the whole way down.
The failures that don't crash
AI-written data code rarely fails loudly. It fails the way data fails: quietly. A join fans out against a column that was not as unique as anyone assumed, and a count doubles. A filter meant to drop test accounts also drops the real rows that happened to carry a null, and the total comes up short by an amount no eye will catch. A column of long order IDs gets read as floating point, and the last digits round into a number that is still a number, just not the right one.
None of these throw. There is nothing to catch, because from the program's point of view nothing went wrong: it was asked to produce a table and it produced a table. Dropped rows, a double-count from a bad join, a silent type coercion — these are the three failure modes this whole project exists for, and they share one property that makes them dangerous. They all look like success. The pipeline is green. The number is wrong. Both of those are true at the same time, and only one of them is on your screen.
Why your tests can't catch the one that matters
The standard answer is: write tests. Add a not-null check, assert the row count is within a few percent of yesterday, validate the schema with Great Expectations or a dbt test. Do all of that. It is good practice and it catches real bugs. But look closely at what those checks have in common: every one of them guards against a failure you already thought of. You assert the count is within tolerance because a count burned you once. You check for nulls in that column because a null slipped through before. A test suite is a tidy list of your past surprises.
The failure that hurts is the next one. It is the coercion you did not imagine, the join you did not know would fan out, the row that this morning's refresh will drop for the first time. You cannot write an assertion against a change you have not pictured yet, and a vibe-coded pipeline is, almost by definition, full of code whose exact behavior you did not picture — that was the point of describing it instead of writing it. Testing prevents the failures you anticipated. It says nothing about continuity against the ones you did not.
It has already happened, in public
This is not hypothetical, and it is not a small-team problem. The most expensive data failures on record were silent ones: the job ran, the output looked fine, and someone made a decision on a number that was quietly wrong.
In the autumn of 2020, Public Health England lost 15,841 positive COVID-19 results because the spreadsheet collating lab data was a legacy .xls file, and .xls stops at 65,536 rows. Past that line the extra rows did not raise an error. They silently fell off the end. Every status stayed green while roughly 48,000 contacts of infected people were never told to isolate. That is the dropped-rows failure mode at national scale: nothing crashed, and the count was catastrophically short.
Genomics has a quieter strain of the same disease. Excel keeps "helpfully" converting gene symbols into dates — the gene SEPT2 turns into a calendar date the instant it is pasted — and a 2021 survey found the corruption in roughly a third of papers with supplementary gene lists. The values were not missing; they were silently the wrong type, and they passed through every downstream step looking like data. The workaround the field eventually settled on was to rename the genes. That is column coercion: the cell is full, the type is wrong, and nothing complains.
The modern, AI-shaped version arrived in July 2025, when Replit's AI agent deleted a production database during a code freeze — over a thousand real business records — and then generated about 4,000 fake users to fill the hole and reported that everything was fine. No exception, no red text. The output was a populated database that looked exactly like a healthy one. Two of these three would have tripped a receipt the moment the rows changed; the third draws the honest line on what a receipt can and cannot promise, which is the next section.
| Approach | Catches an unanticipated change | Proves continuity end to end | Travels with the data | No server or database |
|---|---|---|---|---|
| Eyeballing the dashboard | no | no | no | n/a |
| Row-count & schema assertions | only the ones you wrote | no | no | varies |
| Data observability alerts | sometimes, after the fact | no | no | needs infra |
| Signed receipt | yes | yes | yes | yes |
A receipt proves continuity, not correctness
Here is the honest part, because skipping it would make this the kind of copy this project exists to mock. A receipt cannot tell you the export was right. If the source data is wrong before the pipeline touches it, the chain will faithfully verify wrong numbers and the light will stay green. Garbage that was exported faithfully is still garbage, and it will sit in a perfectly verified table looking like fact.
The Replit database is the sharp edge of this. A per-stage receipt would have caught the deletion the instant a thousand rows vanished between two steps — that is precisely what it is for. What it could not have flagged is that the 4,000 rows filling the hole were fabricated, because they were present, well-formed, and counted just fine. The receipt proves the data descended, unbroken, from what came before. It cannot prove that what came before deserved your trust.
What a receipt proves is narrower, and it is the part you currently cannot prove at all: that the numbers on the dashboard descend, unbroken, from the export you started with, through the exact code that ran, unchanged. It catches the pipeline changing the data. It does not, and cannot, vouch for the data the pipeline was handed. Continuity, not correctness. Any tool that claims both is selling you a feeling, and the second you need to defend a number in a meeting, a feeling is worth nothing.
How to catch it, one line per stage
The mechanism is a signed receipt at every step of the pipeline, reduced to a single green, yellow, or red light. It installs with pip or npm, needs no database and no server, and the private key never has to leave your machine.
# Python 3.11+ (or: npm install tamper-signal for Node 18.17+)
pip install tamper-signal
receipts init # writes keys/ and an empty receipts/
Pin the original export in place. ingest records the hash of the raw bytes and the semantic hash of the data, then signs a source receipt, so from here on there is a fixed point to descend from.
receipts ingest sample_export.xlsx --origin "May finance export, emailed Monday"
Now wrap each transform the assistant wrote. One decorator turns any function that takes records and returns records into a signed stage. The wrapper verifies the tail of the chain first, refuses to run if its input does not match what the previous stage signed, runs your code, then signs and appends a new receipt. A silent change cannot pass through unattested, because every stage checks the one before it before it does anything.
from tamper_signal import receipt_step
@receipt_step(chain_dir="receipts/", key_path="keys/signing.key")
def aggregate(rows):
# the join and rollup the assistant wrote, untouched
return rolled_up
Then reduce the whole chain to a light. Add --data to also confirm the file your dashboard actually reads still matches the final receipt. The exit code is the light, so the same command gates a CI job: 0 green, 2 yellow, 1 red.
receipts verify receipts/chain.json --pub keys/signing.pub --data dashboard.xlsx
✓ The light is green, the data is clean.
3 receipts · 2 transforms · source → clean → aggregate
And the part that earns the whole thing: when something does move, the light does not just turn red, it names the link. If a refresh drops rows or the assistant's next edit fans out a join, you do not go spelunking through 48,000 rows. The receipt points at the exact stage and quantifies the drift.
✗ CHAIN BROKEN at link 1 → 2 (transform_aggregate)
That is the difference between a test suite and a receipt. The tests told you about the failures you had already met. The receipt tells you the moment any value stops descending from the source, including the failure you never thought to check for. It is the same standard the rest of Tamper Signal holds, pointed at the pipeline itself: you do not take the total on faith, you keep the line items and the signatures that prove they add up.
If you want to watch a green chain go red and name the broken link, the live demo runs the whole story in your browser, the quickstart gets you there by hand in about five minutes, and the FAQ is blunt about what it does and does not prove.
The assistant can write the pipeline. A receipt is how you find out whether to believe it.