Two agents, one prompt

Pi + Kimi K2.6 and Claude Code both fine-tuned a Jim Crow law classifier from a single sentence. The interesting gap wasn’t F1.

coding-agents

hugging-face

glam

datasets

I gave two coding agents the same one-line prompt and watched what each decided to do. Both shipped working classifiers via hf jobs. The headline metrics were within 1.5pp — the interesting differences were everywhere else.

Author

Daniel van Strien

Published

May 1, 2026

Agents are getting more and more capable of training models. This means domain experts can have agents fine tune models for them without needing to write code themselves. But how do different agents approach the same task? Do they make the same choices? How good are the models they produce?

I gave two coding agents the same one-line task and watched what each one decided to do.

The task: fine-tune a model on biglam/on_the_books — UNC Chapel Hill Libraries’ labelled training set from On the Books: Jim Crow and Algorithms of Resistance — and push the trained classifier to the Hub. Train via hf jobs.

Preview the dataset

One agent was Pi running on Kimi K2.6 (open weights, via API). The other was Claude Code on Opus 4.7. Same prompt, parallel runs, ~13 minutes each.

The prompt

Fine-tune a model on biglam/on_the_books to identify Jim Crow laws.
Train via hf jobs and push the trained model to my namespace.

Run `hf --help` to understand the Hub CLI and `hf jobs uv run --help`
to understand how to submit uv scripts. You can use `uv run --with`
to run small scripts for exploring the dataset.

Start by exploring the dataset structure, then proceed to choose
and fine-tune an appropriate model.

Push the final model to davanstrien/<repo-name>.

The race

What each one shipped

Both produced working binary classifiers. The headline numbers are close:

	Claude Code (Opus 4.7)	Pi + Kimi K2.6
Base model	ModernBERT-base (8K context)	RoBERTa-base
Wall-clock	~13 min	~13 min
F1 (jim_crow)	0.962	0.947
Accuracy	0.978	(not reported)
Hardware	L4 via `hf jobs`	L4 via `hf jobs`
Class imbalance handling	inverse-frequency weighted loss	not addressed
Model card	full (use, limits, ethics, hparams, citation)	auto-generated `Trainer` placeholder
Domain tags	7	0

Claude’s model: davanstrien/jim-crow-laws-claude-code
Pi+Kimi’s model: davanstrien/jim-crow-laws-pi-kimi

The interesting bit

The F1 gap is real but small. What’s more striking is what each agent decided to do beyond producing weights:

Base model choice. Claude Code picked ModernBERT — the obvious 2025 choice for legal text given its 8K context window. Pi+Kimi went with RoBERTa-base. Both produce viable classifiers; only one of those is a current decision.
The label-leak gotcha. The dataset’s source field is 100% positive for paschal and 92% positive for murray — using it as a feature would leak the label. Claude noticed this in the dataset card and explicitly excluded source. Pi+Kimi didn’t mention it.
Class imbalance. ~29% of the data is positive. Claude added inverse-frequency class weights to the loss; Pi+Kimi trained with default cross-entropy.
The model card. Claude wrote a full card with intended use, limitations, OCR-noise caveats, ethical framing carried over from the dataset, full hyperparameters, per-epoch metrics, and a citation back to the On the Books project. Pi+Kimi shipped the auto-generated Trainer template with three “More information needed” placeholders.
Discoverability. Claude added seven domain tags (legal, glam, jim-crow, north-carolina, history, etc.). Pi+Kimi added zero. One of these models is findable by an archivist; the other isn’t.

What this means

Agents are able to produce working models with minimal prompting and no code. Using Hugging Face Jobs means you don’t need local GPU access or ML expertise to get a model trained and pushed to the Hub. One of the reasons in the past people didn’t bother training task and domain-specific models was the friction of setting up training pipelines; agents + hf jobs are changing that.

IMO the biggest gap is still datasets (sorry broken record). In my experience agents still struggle when working with domain specific data “from scratch” but likely with some hand holding from a domain expert the process from unlablled data -> labelled dataset -> trained model can be done with agents by a domain expert who is not an ML engineer.

Browse the agent traces

Both agents’ full session traces are on the Hub using the new agent trace viewer. You can step through each turn, tool call, and model response:

Direct links to each session file:

Claude Code: claude-code.jsonl
Pi + Kimi: pi-kimi.jsonl

Reproduce / inspect

Dataset: biglam/on_the_books
Claude Code’s training script: claude-workspace/train_jim_crow.py
Pi+Kimi’s training script: pi-workspace/train.py
Claude’s model: davanstrien/jim-crow-laws-claude-code
Pi+Kimi’s model: davanstrien/jim-crow-laws-pi-kimi
Agent traces: davanstrien/agent-race-traces

Try it yourself

HF Jobs — launch GPU training from the CLI without managing infrastructure
Agent trace viewer — host and browse coding-agent sessions on the Hub

Caveats

N=1, single dataset, single prompt, single run per agent. Run-to-run variance from random seeds and agent non-determinism would shift the numbers within a few points either way. Don’t read this as a benchmark — read it as a snapshot of what each agent chose to do given the same minimal brief. A fairer comparison would repeat both runs across multiple datasets and seeds, and would also try frontier-open (e.g. Claude Code on Kimi K2.6) and open-frontier (Pi on Opus 4.7) crosses to disentangle agent-vs-model effects.

Credit to the On the Books project at UNC Chapel Hill Libraries for the underlying data and the algorithms of resistance framing.