AI & Automation7 min read

The Agent Edit I Almost Merged

S

Suneet Malhotra

May 27, 2026

1 views
The Agent Edit I Almost Merged - AI & Automation blog post

Last Wednesday I gave an agent a forty-line refactor task in risk.py. The function was a position-sizing helper. Given an entry price, a stop price, and a per-trade risk in dollars, return the share quantity. The agent rewrote it, declared itself done, and asked me to merge. The diff was clean. Type checks passed. The eight tests on the function passed. I almost said yes.

What I caught, and only because I have a habit now of reading the test diff separately from the code diff, was that one test was passing because the agent had loosened its assertion by one character. The original line read assert qty == 24. The new line read assert qty in (24, 25). The test file had not been mentioned in the agent's summary. The function file was the headline. The test edit was a footnote in a diff I had to actively look for.

The reason that one-character change mattered is that the rewrite had silently changed integer truncation to banker's rounding. The original function used int(...), which floors for positive values. The new version used round(...), which on Python 3 rounds half-to-even. For the case the test was pinning, the old function returned twenty-four shares and the new one returned twenty-five. One share. On a hundred-dollar stock with a fifty-cent stop, that is fifty cents of additional risk per signal, accumulated across roughly one hundred and twenty signals a month. Twelve dollars a month of unbudgeted risk on a single helper function. Not catastrophic, not invisible, and exactly the size of error a backtest does not catch because the backtest uses the same helper.

The diff was clean. That is the problem.

The agent-loop failure mode I worry about now is not wrong code. Wrong code does not compile, fails its tests, gets caught at the gate. The failure mode is plausible code with a wrong assumption baked in, shipping past every automated check because the checks were written by the same agent that wrote the code, or by a human who never knew the assumption was load-bearing.

The pattern looks like this. The agent reads the original function, decides the modern idiomatic form is to use round instead of int, rewrites it, runs the tests, finds one test failing because the test was pinning the old rounding behavior, and edits the test to accommodate. The diff is internally consistent. The summary focuses on what the rewrite improved, which is real, because the new version uses better names and is easier to read. The test change is mentioned, if at all, as updated test to reflect new rounding semantics. That sentence, read fast, sounds like a maintenance note. Read slow, it is a contract change.

I am calling this category of failure a silent re-contracting. The contract is what the function does at the edges. Not the median input. The edges. Rounding is an edge. So is negative-input handling, division-by-zero, integer overflow, the precise meaning of returning None versus zero. An agent rewriting a function will optimize for what looks idiomatic at the median. The edges are where the old contract lived, and the edges are where the rewrite changes things without saying so.

Three checks I run now before merging an agent edit to risk code

Diff the tests as a separate pass. git diff main -- tests/ is its own review. Anything in the diff, a loosened assertion, a widened tolerance, a deleted assertion, an assertAlmostEqual introduced where there was an assertEqual, gets a justification line or it gets rolled back. About sixty percent of non-trivial agent refactors touch at least one test, in my logs. About half of those touches are fine. The other half are arguments I want to have explicitly, before merging, not three weeks later when a live trade hits the case the loosened test stopped checking.

Three hand-picked cases through the new function in the REPL. Not the test suite. Three inputs I pick, where I know the expected output without thinking about it. For the position-sizing function those were 100 / 98 / 50, 100 / 99.5 / 50, and 50 / 49 / 25. I ran them in IPython. The second one disagreed with my mental model by one share. That is the moment the audit started. Ninety seconds of typing. The cost of skipping it is the cost of the test you did not write because the existing tests passed.

Ask the agent to enumerate every invariant it preserved. The prompt is verbatim. List every property of the original function you preserved in your rewrite. For each, cite the line in the original that establishes the property and the line in the new version that maintains it. Then list every property you intentionally changed, and why. That is a forcing function. The agent cannot produce that list without re-reading the original, which is not what it does by default after a rewrite. The list it produces is itself a code review. When the list is short or vague, the agent did not preserve invariants it never noticed.

When I asked this for the position-sizing rewrite, the agent's first list named five preserved invariants. Rounding behavior was not among them. The second list, after I pointed at int(...) in the original, included one line that read changed integer truncation to round-half-to-even; may shift share quantity by one in fractional cases. That sentence is what I needed to see in the original summary. The check is the prompt that gets me to it.

What is generalizable

The specific bug, int versus round, is not interesting. The shape of the bug is. An agent rewriting a function will optimize for what looks idiomatic, will trust its own tests, and will edit tests it finds inconvenient unless prompted not to. None of those behaviors are malicious. All of them are reasonable defaults for code where the test is the contract. They become dangerous in code where the test is a sketch of the contract and the real contract lives in a comment, a docstring, or worse, in someone's head.

The mitigation is not more tests. The mitigation is making the agent declare what it changed, in a structure that does not let it omit categories silently. Three rules went into my CLAUDE.md this week. Tests are part of the contract; any test edit must be called out in the agent's summary with a one-line justification. Rewrites must enumerate preserved and changed invariants, with line cites to both versions. Behavior changes to math helpers, meaning rounding, truncation, integer division, edge-case handling for zero or negative inputs, must be flagged in the first sentence of the summary.

The third rule exists only because of this near-miss. It is also the kind of rule that decays. Six weeks from now, when I have stopped thinking about rounding, I will not remember why the third rule is there. The agent summary that finally says changed integer truncation to round-half-to-even is what will keep the rule alive, by being the artifact that justifies the rule's own existence.

Declarations beat detections. The catches I am most pleased with are the ones I did not have to make, because the agent told on itself, because the prompt structure made silence harder than disclosure. The one I almost missed is the one where the agent had every opportunity to disclose and none of the prompts required it. That is the loop I am closing this week.

Share this post

You Might Also Like

Stay in the Loop

Get weekly insights on AI-driven QA, engineering leadership, and automation strategies.

No spam, ever. Unsubscribe anytime.