All four agents tied at F1 ~0.95. The number was a mirage.

Three harnesses, two driver models, four runs. The traces show why an outcome metric on this dataset doesn’t mean what the agents reported it means — and a 22-point F1 gap when you actually test for generalisation.

coding-agents
hugging-face
agent-traces
glam
datasets
Same one-line prompt, three harnesses, two driver models. The headline F1 tied within 1.5pp. A source-stratified eval the agents didn’t run shows the headline number was hiding a 22 F1-point generalisation gap.
Author

Daniel van Strien

Published

May 6, 2026

Last week I gave two coding agents the same one-line prompt — fine-tune a classifier on biglam/on_the_books, train via hf jobs, push to the Hub. Both shipped working models with F1 within 1.5pp.

This week I added two more runs through the ml-intern harness — one driven by Claude Opus 4.6, one by Kimi K2.6. The point of the follow-up wasn’t to add data; the original two-run race couldn’t tell harness from model effects apart, and a closer read of the dataset surfaced a structural issue the original missed. Four runs total, traces all public at davanstrien/agent-race-traces.

The 2×2:

Claude Opus 4.7 Kimi K2.6
Claude Code F1 0.962
Pi F1 0.947
ml-intern F1 0.949 F1 0.947

By the metric that nominally defines success on this task, they tied. Each run used a different random 80/20 split — so the 1.5pp spread is also within seed noise.

The dataset has a source field

biglam/on_the_books is 1,785 NC session laws (1866–1967), labelled binary for whether each section is a Jim Crow law. ~29% positive. It’s a merge of three different curators:

paschal and murray are pre-existing curated lists of segregation laws — Pauli Murray’s landmark 1951 States’ Laws on Race and Color, and the Paschal compilation. They are 100% and 92% positive by curatorial design, not by any signal the model could learn. project experts is the broader human-annotated sample with both classes.

So source is correlated with the label because of how the dataset was assembled. The classic “label leak” framing isn’t quite right — none of the agents trained on source as a feature; all four took section_text only by default. But every agent’s reported F1 rests on a random 80/20 split, which puts paschal and murray rows in both the train and eval sets. That doesn’t test generalisation across sources — it tests memorisation within a single distribution.

The eval the agents didn’t run

To check whether the headline F1 numbers are honest, train and evaluate the same way the agents did, then again with a held-out source. A simple TF-IDF + logistic regression baseline is enough — the eval-methodology gap matters more than the model:

A 22 F1-point gap between random and source-stratified eval, on the same baseline model. The strong model classes the agents picked (ModernBERT, RoBERTa, legal-BERT) will narrow this gap but won’t close it — the issue is the eval setup, not the architecture. None of the four agents reported a number that tells you whether their model would handle a new compilation of laws curated under a different process. They all reported the random-split number.

This isn’t a bug in the dataset — biglam/on_the_books does what its README says. It’s a bug in how all four agents thought about the eval. They reported the metric they were given, not the metric the task actually needed.

What each agent did notice

Three of four agents trained on section_text only — by default — and dodged the worst of it. None of the four ran the source-stratified eval. Only one agent surfaced the issue at all:

The wild bit: ml-intern + Kimi computed the per-source breakdown during EDA. The numbers (paschal: 100.0%, murray: 92.1%) were on its screen at event 26. It then trained, evaluated on a random split, pushed a card with metrics, and never mentioned the field. Claude Code computed the same breakdown and immediately wrote source would leak the label (paschal is 100% positive, murray is 92% positive) into its model card.

Same data, same evidence, very different downstream outcomes.

Process tells four different stories

Outcome F1 was uniform. The traces are not:

  • Pi + Kimi has the leanest tool palette (bash, edit, write only) and the most reactive failure mode: four progressive transformers API errors fixed iteratively (KeyError: 'label', evaluation_strategyeval_strategy, tokenizerprocessing_class, "model did not return a loss").
  • ml-intern + Kimi is the noisiest — 67 tool calls, five tracebacks, multiple PATH issues, two cancelled jobs, a write→edit→compile→bash debugging cycle, and the legal-BERT pick that didn’t justify itself against a 512-token context budget.

The 2×2 partly isolates harness from model. Same model, different harness (Kimi in Pi vs ml-intern) burned iterations on different friction surfaces — transformers API drift in Pi, env plumbing in ml-intern. Same harness, different model (Opus 4.6 vs Kimi in ml-intern) split 54 vs 127 events for the same task, with Opus picking from data and Kimi browsing docs first. Both Kimi runs missed the eval-methodology issue. With N=4 these are gestures at attribution, not measurements — but tentatively: harness shapes process; model shapes proactive judgment, the catching what wasn’t asked for part.

What this is for

Last month I argued that sentiment is a thin lens on agent traces and that behavioural patterns hold richer signal. F1 on a coding-agent eval is the same kind of thin lens — easy to read, hard to learn from, sometimes outright misleading. The signal is in the trace.

If you’ve used a coding agent for an ML task recently — especially one that nearly worked or failed in an interesting way — the agent-trace viewer renders Claude Code, Codex, and Pi sessions directly from your local session directories. hf upload your-username/your-traces ~/.claude/projects --repo-type dataset. Tag it format:agent-traces. The next iteration of this kind of analysis can include yours.