Why collaboration and shared data matter more than servers
Hugging Face — Machine Learning Librarian
2026-06-26
Most of this already exists:
Lots of options: commercial cloud, the European Cloud for Heritage Open Science, the data space for Cultural Heritage, the Hub.
Building infrastructure is hard. Maintaining it is harder.
Is it worth building your own platform to host 100 TB datasets?
Where does your effort actually add something no one else will?
Preservation might be the exception, where keeping your own copy matters. For most things, use what already exists.
My honest answer: government-funded is not a safe bet in 2026 either.
The Data Rescue Project exists to preserve public data at risk of disappearing.
Why does it need to exist?
LOCKSS — Lots Of Copies Keep Stuff Safe.
What actually helps:
Use good infrastructure, but don’t rely on only one. Commercial or public.
The way we build is changing too.
AI agents can now do a lot of the building work themselves.
So the hard part isn’t really the building. It’s having the data, and the people who know how to make it.
Collaboration and shared datasets, for evaluation and for training, are infrastructure.
They have a long tail of plausible benefit: you can’t predict who will reuse them, or how.
So fund and maintain them like anything else you depend on.
In 2022 I put ~25 million pages of digitised British Library books on the Hub, through BigLAM.
Four years later, someone trained a Victorian chatbot from scratch on them.
The Europeana Newspapers dataset turned up inside German Commons, a 154-billion-token German-language dataset.
Put data somewhere people can find it, and they build things you never imagined.
For libraries, this is mostly an open question.
Does this OCR model read 18th-century print? Handwriting? Gaelic? Your catalogue cards?
There’s usually no benchmark that tells you. Shared evaluation sets for real collections are themselves infrastructure.
A 15-trillion-token open dataset. What made it good was evaluation: train small models, run benchmarks, keep what helps. Open data, open recipe, open evals.
A FineWeb-style effort for digitised library collections. Still early.
Not one big corpus, but a series of reusable artifacts: tools libraries can run, example datasets, small models. The corpus emerges from adoption, not central control.
Already doable: re-OCR’d Britannica (1771), 2,724 pages, for ~$5.
This already happens:
With shared data and shared benchmarks, labs can choose tools based on evidence.
Building a shared dataset teaches you how AI actually works.
For GLAM labs, that understanding matters more and more.
And it stays with your staff, even when funding or platforms change.
Infrastructure is the friends you made along the way.
Pick one dataset or benchmark, and build it with another institution.
Daniel van Strien · Machine Learning Librarian, Hugging Face
GLAM Labs Futures · 26 June 2026 · Edinburgh · danielvanstrien.xyz