flowchart LR
W[Weights<br/>Learned numbers] --> M((Model))
C[Code<br/>Instructions] --> M
M --> T[Does tasks]
Open and local models for digital humanities and cultural heritage
Hugging Face — Machine Learning Librarian
2026-05-05
When someone asks: “What can we do with AI for our collection?”
The default answer in 2026 is still:
“Let’s build a chatbot.”
A frontier model like GPT-4 is like a Ferrari. Obvious triumph of engineering, designed to win races. But it takes a special pit crew just to change the tires.
A smaller specialised model is like a Honda Civic. Engineered to be affordable, reliable, extremely useful. And that’s why they’re absolutely everywhere.
— Adapted from “Finally, a Replacement for BERT”
In 2026, our community has the tools to build its own.
What infrastructure does our community want built — for ML and AI for and with DH and cultural heritage collections?
Open weights
Closed weights
There’s a stricter sense — “fully open” includes training code, data, and evaluation. Important, but a separate conversation. (Appendix.)
Privacy — unpublished and sensitive material stays on your infrastructure
Longevity — vendors deprecate; archives don’t. The artifact has to outlast the company.
Reproducibility — research depends on the model still being available, unchanged, in five years.
Many DH/CH tasks don’t need a chat model:
| Shape | What it looks like | Example tools |
|---|---|---|
| Local | Model file on your machine | llama.cpp, transformers, MLX |
| Portable | OpenAI-compatible API, many providers | HF Inference Providers, Together, Fireworks |
| Rent on-demand | Cloud GPU for one job | hf jobs, vLLM for batch inference |
Open ≠ local. Three shapes — different ways to not depend on one provider.
This is what web-scale data curation looks like (FineWeb-Edu, WebOrganizer).
It’s also what your archive looks like — scaled the same way.
The Hub is the platform where the open AI community builds together — and where the durable artifacts get parked.
Live walkthrough — switch to browser.
Today’s example: classify historical legal text as discriminatory or not.
Same shape, applied differently:
All variants of: assign a label to text.
You could send each document to a closed API. Works for prototyping.
For research infrastructure:
Train a small classifier instead. Used to be hard. With data and agents, much less so.
biglam/on_the_booksjim_crow / no_jim_crowThe core ask is one sentence:
“Fine-tune a model on
biglam/on_the_booksto identify Jim Crow laws.
Train viahf jobsand push to my namespace.”
No notebook. No training script. Replace the dataset and the label, and it’s your task.
Fine-tune a model on biglam/on_the_books to identify Jim Crow laws.
Train via hf jobs and push the trained model to my namespace.
Run `hf --help` to understand the Hub CLI and `hf jobs uv run --help`
to understand how to submit uv scripts. You can use `uv run --with`
to run small scripts for exploring the dataset.
Start by exploring the dataset structure, then proceed to choose
and fine-tune an appropriate model.
Push the final model to davanstrien/dhd-demo.The extra lines point the agent at the right docs and naming. The core task is still the first line.
Trained from the same one-line prompt, different agents:
davanstrien/jim-crow-laws-claude-code
Claude Code + Opus 4.7 (closed)
davanstrien/jim-crow-laws-pi-kimi
Pi + Kimi K2.6 (open-weight)
Full writeup: danielvanstrien.xyz/posts/2026/agent-race
Plurality, not monoculture. Use the big general-purpose tools to craft the small, durable ones.
A different bottleneck — same kind of solution.
hf jobs (~31% accurate)→ 99% mAP in three rounds, one afternoon.
“The bottleneck was always collecting training data. This is becoming much less of a barrier.”
Curated, well-documented collections.
Domain expertise.
Multilingual / under-represented languages.
Move past “vibe checks”.
Domain experts know what good output is.
Edge cases models routinely miss.
Fine-tuned for your domain.
OCR for historical scripts.
Domain-specific embeddings.
Back to the question we opened with.
We don’t need fancy AI infrastructure projects.
We need datasets.
YOLO26n, 99.1% mAP
Built with the same Claude Code + hf jobs stack you just saw
— the agent-built path
Yiddish OCR
150k training lines
→ ~99% accuracy uplift
→ Mass digitisation now feasible
— the community-curated path
Europeana Newspapers — 4M+ documents, multiple languages
→ Powers projects like German Commons (154B tokens of training data)
FineWeb-C — 500+ contributors building multilingual training infrastructure together
Beyond Words / Newspaper Navigator — 16M pages with visual annotations from Library of Congress
Most LLM benchmarks are:
DH/CH benchmarks would test things models actually struggle with:
smolagents/ml-internhuggingface.co/biglamhuggingface.co/small-models-for-glamhuggingface.co/learnIf you want to put a dataset on the Hub, get in touch — that’s part of what I’m here for.
Daniel van Strien
davanstrien on Hugging Face
linktr.ee/danielvanstrien
daniel@huggingface.co
Questions?
Reference slides — not part of the linear talk. Reachable for Q&A.
flowchart LR
W[Weights<br/>Learned numbers] --> M((Model))
C[Code<br/>Instructions] --> M
M --> T[Does tasks]
The weights are the “brain” — patterns learned from training data.
flowchart TD
D[(Training Data<br/>Books, web, OCR'd archives)] --> L[Learning Process]
L --> W[Weights File<br/>.safetensors]
W --> E["Billions of numbers:<br/>[0.023, -0.891, 0.442, ...]"]
These numbers encode everything the model “knows”.
Truly open source AI also includes:
Pros
Cons
Pros
Cons
Pros
Cons
mindmap
root((Local model<br/>ecosystem))
Python Libraries
Transformers
Sentence Transformers
Diffusers
ONNX Runtime
JavaScript
Transformers.js
Inference Frameworks
llama.cpp
LM Studio
Ollama
MLX (Apple Silicon)
mlx-lm
mlx-vlm
vLLM
SgLang
Quantization
GGUF
AWQ
GPTQ
UIs
Open WebUI
Jan
DHd AGKI Webinar · 5 May 2026 · danielvanstrien.xyz