All Our Tests Passed. The Agent Was Still Broken.

Share
All Our Tests Passed. The Agent Was Still Broken.
The future of testing looks a lot like 1968.

Agentic COBOL: how a 1959 idea shipped our 2026 LLM plugin


Disclosure: This post reflects independent personal experimentation and my own hands-on work on personal open-source projects. It reflects only my personal views, is not professional advice, and does not represent any organization, employer, or official position.

Last week, CI was a wall of green. Every Python unit test passed. The plugin manifest validated. The deterministic backend ran cleanly when invoked directly. And every time a user typed "what's in my portfolio right now?" into Claude Code, the agent would shrug and ask to install the plugin. The plugin was already installed. The plugin was already loaded.

The bug was not in the code. The bug was upstream of the code, in the LLM's tool-selection layer. Modern testing pyramids assume the system under test is the code. For Claude Code plugins, MCP servers, Cursor extensions, the whole post-2024 ecosystem of "natural language calls the right function," the system is the LLM-plus-tools combo. Half of it is which tool the LLM picks when a human types a sentence at it.

We needed a different kind of test. The pattern we landed on dates back to 1959.

Agentic COBOL. Testing agent systems by feeding real natural-language prompts into real runtimes, then scoring whether the correct tool was invoked. No mocks, no SDK fixtures, no faith.

This is the story of how we got there, what it cost us, and how we shipped the InvestorClaude plugin to the Anthropic Marketplace this week with 100% routing accuracy across both consult-provider modes. Six trials, 180 routing decisions, zero failures.

If you came for the code, it is at the bottom. Full harness, corpus, and scorer in the public repo.


This is the bug your tests don't see

Concrete example from our InvestorClaw fleet last cycle.

A user asks: "Any big mergers or acquisitions in the news today?"

The plugin has a market-news command with a tested Python implementation that fetches M&A headlines correctly. Unit tests on the function pass.

The agent does not invoke that command. It does not invoke anything. It answers from training data. The user gets stale, hallucinated output that sounds current.

How did this happen? The command's description was three words: "News headlines fetcher." Accurate, technically. Invisible to the LLM routing layer. "M&A" was not in the description. "Mergers" was not. "Acquisitions" was not. "Today" was not. The agent had no signal that this command was relevant to a question about M&A.

This is the silent-misroute bug class. Not a code bug. Not a logic bug. A description-as-API bug. The LLM-facing surface of the tool was undercommunicated. Unit tests cannot see it because the LLM is not in their loop. End-to-end Selenium-style tests cannot enumerate enough natural-language phrasings to catch it. LLM-eval frameworks such as RAGAS, DeepEval, and LangSmith evaluate output quality. They grade the agent's final answer for accuracy and coherence. They do not measure tool selection. The agent can produce a beautifully fluent hallucinated answer with zero tool calls and score perfectly on output-quality metrics while shipping wrong data to the user.

The class is ubiquitous in plugin, skill, and MCP-server ecosystems. Every product whose value proposition is "agent picks the right tool from natural language" has it. Every CI suite that does not include the LLM in the test loop is blind to it.


Why we didn't want a Python-driven API test

Our first instinct was to write Python tests. We are Python people. The test pyramid says: stub the LLM with a mock that returns "you should call news_fetch," then assert your harness dispatches correctly to the right function.

We wrote some of that. They were the worst tests we ever shipped.

The bug was the LLM not calling the right function. Mock the LLM with a fixture that says "call news_fetch," and you have asserted that, given a perfect oracle, your dispatcher works. It does. That is not the class of bug we needed to catch. The mocked-LLM test passes identically whether the agent in production routes well or routes badly. It is a confidence trap.

API-level Python tests have the same disease one layer down. You write assert agent_client.complete(prompt).tool_call.name == "news_fetch". That looks like a real test. It is mocking the LLM at the SDK boundary. The actual agent process never runs. Not Claude Code's slash-command planner. Not OpenClaw's tool-use loop. Not Hermes Agent's ReAct loop. You are testing your shim, not the system.

The only test that catches silent misrouting is one where:

  1. A real human-style prompt goes in, character for character.
  2. The actual agent runtime processes it, with the actual model, the actual tool catalog, the actual description text users see.
  3. The agent emits a tool invocation, or does not.
  4. We score whether the invocation matches what we expected.

That test is slow. Each iteration is a real LLM call. Ten to thirty seconds, real API spend, non-deterministic output across runs. You cannot run it in a 50ms unit-test loop. You cannot even run it in a five-minute CI step without thinking carefully about parallelization and budget.

That slowness is the point. The test must be slow, deliberate, and real because the surface under test is exactly that. A real human typing a real sentence at a real LLM, hoping the right tool fires. There is no faster substitute that retains the signal.

If this sounds familiar, it should. Every generation rediscovers it.


The 1959 idea we never got rid of

You thought we got rid of all the COBOL. We didn't. Banks still run on it. The IRS still runs on it. Every recession, a state unemployment system buckles and reminds everyone that we just hid the COBOL, never replaced it.

Turns out it had one more job in it. Testing the systems meant to replace it.

COBOL was designed for a specific audience: domain experts who were not programmers. An accountant should be able to read the source and verify the program did what they asked.

ADD MONTHLY-PAY TO YEAR-TO-DATE-EARNINGS GIVING NEW-TOTAL.
IF NEW-TOTAL > BONUS-THRESHOLD THEN PERFORM CALCULATE-BONUS.

That is not pseudocode. That is executable COBOL. The acceptance test was the source's readability.

English-as-interface, machine-as-router. The domain expert speaks. The machine routes to the right operation. The acceptance test is "read the prompt aloud and check the system did what was asked."

That is exactly the problem we have with agent-skill products in 2026. Except the parser is now stochastic. Same prompt, different routing across runs, with empirical noise floors around 80% on a tuned surface. Compile-time guarantees become empirical sampling. Multi-trial averaging. Per-runtime gates that respect each model's floor.

The methodology transfers cleanly. Write down the prompt. Write down what the system should do. Run it. Score whether it did.

The lineage is not unique to COBOL. Knuth's literate programming was after the same thing: source should be readable to non-programmers. BDD's Given/When/Then is the same. Cucumber's .feature files: the spec IS the test IS the readable English. They were all pointing at the same insight. For systems whose correctness includes a human-language layer, the test must include that layer.

We forgot it for two decades because most systems lacked a human-language layer. APIs took JSON, tests sent JSON, both spoke the same syntax. The test pyramid (unit, integration, e2e) was sufficient. Agent-skill products bring the human-language layer back, and the 1959 acceptance pattern is exactly what we need.


How we got here: harness v6 → v13

The methodology did not arrive whole. Each version was forced into existence by the failure mode of the previous one.

v6.x. Drive the agent, not the code. Single runtime, single host, phased workflow (W0 lifecycle through W8 reporting). The move that mattered: stop calling Python skill scripts directly, drive the agent with natural-language prompts, and record everything. The result row schema was already 150 fields wide. Routing verification, model-config readback, raw I/O hashes, fabrication flags. The COBOL discipline of "the spec IS the test" first appeared as "the test record IS the audit trail."

What v6 missed: silent misrouting on the unenumerated long tail. Tightening one tool description shifted attention across the catalog. The canonical 15-prompt suite would pass while the agent was still misrouted in production.

v7.x. Multi-host plus reviewer separation. Added Raspberry Pi targets so the same 30-prompt corpus exercised ZeroClaw on a 2GB edge device, an 8GB Pi, and the x86 dev host simultaneously. The finding: routing accuracy on the 2GB Pi matched the 8GB Pi. The routing layer does not care about edge memory pressure. Only synthesis does. You only catch that when the same test runs across substrates.

v7 also separated the reviewer session from the harness session. Earlier versions mixed orchestration with oversight in one context-accumulating session, making late-run results unreliable. v7.1 made fresh reviewer sessions mandatory and recorded the model-under-test before any prompts hit the agent. Fixes a whole class of "the test passed, but we don't know which model it tested" reporting bugs.

v13. The matrix becomes the spec. v13 retired the Python harness entirely. The spec itself is the runbook. Claude reads it, calls each runtime's CLI in sequence, validates the deterministic envelope, and records exact evidence. Eight capabilities, three hosts, three runtimes, with an explicit drift-stop list for the "I'll just write a Python script to check" temptation that everyone falls into when the real test feels slow.

The slowness is the point. v13 is openly an orchestration matrix: 30 prompts × 3 runtimes × 3 hosts × N model combinations × N consult-provider modes. Each cell is a real agent invocation, recorded in raw form. You cannot run the whole matrix in under a couple of hours. That is the cost of catching bugs the test pyramid cannot see.

The current cycle (v2.6.0 release for InvestorClaude) used a dimensional slice of v13. Just Claude Code on the Linux x86 host, two consult-provider modes (Together AI cloud, local GPU), N=3 trials each. 180 routing decisions. Zero variance. That is six cells of the larger matrix. The rest is what we run quarterly and what catches the cross-runtime regressions any single cell would miss.

The progression follows a familiar testing-discipline arc. Unit, integration, end-to-end. The COBOL framing did not make our v13 harness more complex than it needed to be. It gave us a stable pattern that scaled cleanly as we added dimensions. The schema we wrote in v6.x is mostly unchanged in v13, just with more dimensions of context per record.


The harness in practice

The corpus is a single JSON file: harness/cobol/nlq-prompts.json. 30 prompts, each with an ID, the natural-language utterance a user might type, and the canonical tool the agent should invoke. Keyed per-runtime, because the same product exposes a slightly different surface in OpenClaw vs. Claude Code vs. Hermes.

{
  "id": "p16-news-merger",
  "intent": "news-merger",
  "prompt": "Any big mergers or acquisitions in the news today?",
  "expected_routes": {
    "investorclaude": ["ask"],
    "investorclaw":   ["portfolio_market section=news topic=merger"]
  }
}

There is a runner per runtime family. For Claude Code, it is a 60-line bash script that drives claude -p --plugin-dir <path> --output-format=stream-json against each prompt and pipes the streamed JSON events through a small Python parser that extracts tool_use invocations and scores against the expected route. For OpenClaw, ZeroClaw, and Hermes, a similar Python runner that docker execs into the running agent container.

Per-runtime acceptance gates live in the same JSON, calibrated to the standard inference-stack configurations we recommend for each runtime. The configs are reproducible by anyone. No internal fleet-only orchestration involved. End-users running with different stacks (smaller models, different providers, etc.) should expect different noise floors.

Runtime Stack strict publish
OpenClaw Together MiniMax-M2.7 narrative + local Gemma4 consult, primary+fallback (no consensus voting) 21/30 24/30
ZeroClaw Single-provider LLM-driven routing 21/30 24/30
Hermes Smaller open-weights models 17/30 20/30
Claude Anthropic-hosted 21/30 24/30

The fleet verdict is the conjunction. Every runtime above its publish_bar = ship. Any below strict = block. The gates are empirical. Re-derived whenever we change a stack-default. Not universal targets.

That is it. A JSON corpus, a runner per runtime, a markdown report that prints PASS/FAIL per prompt, and a verdict. No new framework. No SDK. No vendored mock library. The whole methodology fits in 800 LOC.

If "Agentic COBOL" sounds like marketing fluff so far, here is what the methodology actually looks like in real COBOL syntax. Written the way Grace Hopper's committee would have written it in 1959, because the pattern translates directly:

       IDENTIFICATION DIVISION.
       PROGRAM-ID. AGENT-ROUTING-ACCEPTANCE.
       
       DATA DIVISION.
       WORKING-STORAGE SECTION.
       01  TEST-CASE.
           05  PROMPT-ID                PIC X(20).
           05  USER-PROMPT              PIC X(200).
           05  EXPECTED-ROUTE           PIC X(40).
       01  AGENT-RESPONSE.
           05  TOOLS-INVOKED            PIC X(200).
       01  COUNTERS.
           05  PASS-COUNT               PIC 99 VALUE ZERO.
           05  FAIL-COUNT               PIC 99 VALUE ZERO.
           05  STRICT-FLOOR             PIC 99 VALUE 21.
           05  PUBLISH-BAR              PIC 99 VALUE 24.
       
       PROCEDURE DIVISION.
       
       EVALUATE-EACH-PROMPT.
           READ TEST-CORPUS INTO TEST-CASE
               AT END GO TO REPORT-VERDICT.
           CALL "SEND-PROMPT-TO-AGENT"
               USING USER-PROMPT, AGENT-RESPONSE.
           IF EXPECTED-ROUTE IS PRESENT IN TOOLS-INVOKED
               ADD 1 TO PASS-COUNT
               DISPLAY "  PASS " PROMPT-ID
           ELSE
               ADD 1 TO FAIL-COUNT
               DISPLAY "  FAIL " PROMPT-ID
                       " expected=" EXPECTED-ROUTE
                       " detected=" TOOLS-INVOKED
           END-IF.
           GO TO EVALUATE-EACH-PROMPT.
       
       REPORT-VERDICT.
           IF PASS-COUNT >= PUBLISH-BAR
               DISPLAY "VERDICT: PUBLISH"
           ELSE IF PASS-COUNT >= STRICT-FLOOR
               DISPLAY "VERDICT: STRICT-PASS"
           ELSE
               DISPLAY "VERDICT: FAIL"
           END-IF.
           STOP RUN.

Read that aloud. An accountant could verify whether it represents what they asked for. Exactly the original COBOL design ethos. Substitute the CALL "SEND-PROMPT-TO-AGENT" for an HTTPS call to Claude or OpenClaw or Hermes, and the IS PRESENT IN check for a substring match against tool-name and Skill invocations in the returned transcript, and you have our actual harness translated back to the 1959 idiom. The Python implementation is shorter and less verbose. The control flow and test contract are preserved beat for beat.

That is the point of "Agentic COBOL." Not the language. The acceptance pattern. The pattern survives 67 years of substrate churn (punched cards, mainframes, UNIX, JVM, Python, and stochastic LLMs) because the human-language layer it tests remains the one constant.


How it shipped InvestorClaude v2.6.0 to the Anthropic Marketplace

The empirical narrative since the v2.3.x cycle is a textbook case for why this pattern earns its keep:

  • v2.3.4 baseline on a 15-prompt set: 9/15 = 60% on Claude Code. Eight failures, all silent misroutes. The agent answered without invoking the right tool. Description-as-API debt was ubiquitous.
  • v2.3.5 description tuning: 12/15 = 80%. Three remaining failures looked like over-routing. The setup command was greedily matching every portfolio query.
  • v2.3.6 narrowed setup: 11/15 = 73%. Three new regressions in commands whose descriptions were not even touched. Important discovery: LLM routing has a global attention layer. Tightening one description shifts attention across the whole catalog.
  • v2.3.7 rebalanced description weights: 12/15 = 80%, plateau confirmed.
  • v2.4.0 architectural correction: 27 granular commands → 9 consolidated tools. Claude Code 19/30 = 63% on the new 30-prompt set. Better surface, lower score. The new corpus exposed failure modes the 15-prompt set did not reach.
  • v2.5.0 adapter consolidation onto ic-engine v2.5.0: 24/30 = 80%. Five failures clustered on news/market deflects and cross-skill ambiguity (investorclaude vs investorclaw).
  • v2.5.2 initial published release: Claude Code 30/30 = 100%, with a scorer-correctness story worth its own aside.
  • v2.6.0 Anthropic Marketplace submission cycle. Codex adversarial-review caught a pile of pre-submission issues. allowed-tools too permissive, plugin cache path wrong in install docs, NVIDIA-specific config leaking, license SPDX form inconsistent. We fixed those. We then ran the canonical 30-prompt barrage three times under each of two consult-provider modes. Together AI's google/gemma-4-31B-it for the cloud path, and a local llama-server running Gemma 4 on a CERBERUS GPU host for the local path. Six trials. 180 routing decisions.

The result:

Mode Mean Stdev Min Max Verdict
cloud (Together / Gemma-4-31B) 30/30 0.0 30 30 PUBLISH
local (CERBERUS / Gemma4-consult) 30/30 0.0 30 30 PUBLISH

Zero variance. Zero deterministic failures. Zero noisy prompts. Zero cross-mode delta. Every single prompt deterministic-pass in both consult modes. The plateau at 12/15 from the v2.3.x cycle, broken by v2.4.0's structural consolidation and held all the way through to a clean release, was the kind of empirical evidence a 30-second chat with the agent never gives you.

Cross-runtime snapshot at v2.6.0 (the publicly reproducible configuration. No internal-only orchestration, no Anthropic-as-Claw, no XAI):

Runtime Stack Score Verdict
Claude Code Anthropic-hosted, default model 30/30 × 6 trials PUBLISH
OpenClaw Together MiniMax-M2.7 + local Gemma4 consult + Groq fallback 22/30 STRICT_PASS
Hermes Google Gemini-2.5-Flash + local Gemma4 consult 17/30 STRICT_PASS (at the strict gate)
ZeroClaw (provider config under repair) n/a DEFERRED

The heterogeneity is the signal. Same product, same prompts, very different routing accuracy depending on the LLM substrate. Claude Code's perfect score is not what every Claw-family runtime sees. Publishing to a marketplace requires being honest about what end-users will actually experience.

We submitted to the Anthropic Marketplace on 2026-04-29 with the pinned SHA from the cleansed v2.6.0 release as our shipping artifact. The COBOL evidence (180/180 on Claude Code across two provider modes, six trials, zero variance) was the ship signal for the marketplace listing. Not "the unit tests passed." Not "the manifest validated." The empirical fact that, sampled across 180 real LLM invocations, the routing layer behaved deterministically.


What it's catching across the Claw fleet

Same harness, same corpus, four runtimes. Each fails differently.

OpenClaw runs Together MiniMax-M2.7, plus a local Gemma4 consult; no consensus orchestration. OSS, a fully configurable inference stack, can be replicated directly by anyone with the same provider keys. Failure mode: conversation-state contamination in the agent's session memory. Once the skill returns "no data" twice, the agent stops invoking it. Routing accuracy depends entirely on the configured models. Gate sits at 24/30 publish, same as everyone else's.

ZeroClaw on Raspberry Pi 4 and 5 edge hardware. Rust runtime, single-provider LLM routing, 16GB-or-less RAM. No failure mode is specific to the edge. The routing layer does not care about memory pressure. Compute matters only for the downstream synthesis step. Routing is identical to the x86 host's score on the same provider.

Hermes Agent with smaller open-weight models. The context window is too tight to weigh many competing tool descriptions. Falls back to its training data prior on familiar topics. Produces fluent-sounding answers about news, prices, etc. with zero tool calls. If your pretraining "knows" the topic, the smaller model will skip your tool. Diagnostic, and unfixable from the prompt-tuning side.

Claude Code, Anthropic-hosted. The marketplace target. No failure mode observed at v2.6.0. 30/30 across six trials × two consult-provider modes (Together cloud + local CERBERUS Gemma4). The consult endpoint does not influence routing. The decision is upstream. We tested both modes to verify. The assumption held.

Same product, same prompts, same tool catalog, behaving differently across LLM substrates. That is not a bug. That is the signal. A single-runtime harness lets a vendor hide model-specific weaknesses. A cross-runtime one, with gates tuned per substrate, makes the comparison legible to operators choosing where to deploy.


Aside: when the test fixture lies

The v2.5.1 → v2.5.2 jump from "1/30 fail-the-publish-bar" to "30/30 sail past it" on the same recorded agent runs was the most uncomfortable lesson of this cycle.

The harness records the full agent transcript for every prompt. The v2.5.1 scorer parsed claude -p --output-format=stream-json events looking for tool invocations on Bash or directly-named slash-command tool calls. It missed the actual shape Claude Code now ships. Plugin slash commands surface as a Skill tool call with input.skill = "investorclaude:ask". The slash name lives in the input, not the tool name.

The agent had been routing the entire time perfectly. The scorer had not.

Rescoring the captured stream JSON with a Skill-aware extractor turned the apparent 1/30 failure into a real 30/30 pass. Same agent. Same prompts. Same model. Different lens.

The discipline that protects you here: always commit the raw artifact. The tool_invocations field in the JSONL is the truth. detected is the interpretation. When you can prove the interpretation was wrong without re-running the agent, you have a shippable fix in minutes. When you have to re-run, you have already lost a day to provider quotas.

The test pyramid for agentic systems needs a layer that the unit-test era never had. Scorer correctness. Treat the scorer like production code. Test it against recorded transcripts. Version its detection logic. Make it auditable. Our scorer regression test (test_parse_stream_json.py) runs the buggy v2.5.1 logic alongside the fixed v2.5.2 logic against a synthetic fixture. The old logic must reproduce its failure mode. The new logic must catch every shape. Both invariants are asserted on every CI run.

That is the second lesson Agentic COBOL teaches. Measurement is also code, also subject to bugs, also worth its own discipline.


When this matters, who should use it

Anyone whose product is "the agent picks the right tool from natural language." Claude Code plugin authors, MCP server developers, Cursor and Windsurf and Codex extension builders, and the next wave of agent ecosystems we do not yet have names for.

The trade-off is unavoidable. A v2.6.0-scale release run is 180 real LLM calls, ~100 minutes of wall clock time, and roughly $20 to $30 in tokens. Run weekly, that is a $100/month line item. Run on every PR, budget accordingly.

Not running it costs you shipping wrong answers to users silently. The bugs this catches are exactly the ones the test pyramid cannot see. Descriptions that look fine to humans, fail to the LLM, slip past every Python unit test you have written, and manifest only when a real user types a real sentence.

After the v2.4.0 plateau break, every subsequent description-tuning we tried either improved the score or regressed it visibly. The methodology ratchets you forward instead of letting you spin in circles. That is the value proposition.


The deeper takeaway

The lesson is not "use COBOL." It is the discipline COBOL was after. For systems whose correctness includes a human-language layer, the test must include that layer.

The test pyramid we inherited from Beck and Fowler was built for a generation of software whose interfaces took JSON and emitted JSON. Both sides spoke the same syntax. The human language was the spec, not the runtime. When something broke, the breaking surface was always machine-readable.

Agent-skill products break that assumption. The runtime now includes human language as a first-class input. The agent is the parser. The parser is stochastic. Tests have to follow the parser into its vernacular.

We did not invent a new testing paradigm. We rediscovered one. The difference is that now the parser is stochastic, and pretending it is not is how these systems fail.


Code + corpus

The full Agentic COBOL harness (corpus, runners, scorer, aggregators) ships in the public InvestorClaw repository:

If you are shipping agent-skill products and are considering this approach, the corpus is a reasonable starting point. Fork it, swap the per-prompt expected routes for your tool surface, and you have a working harness inside an afternoon. If you want to compare notes on runtime noise floors, find me on Bluesky.

Read more