13  Infrastructure Considerations

One of the first questions institutions face when planning AI work is: “What infrastructure do we need?” The answer depends less on specific products — which change constantly — and more on a few key factors: how you’ll use AI, how often, and what constraints your institution operates under.

This chapter won’t recommend specific hardware or services. Instead, it lays out the decisions you need to make and the trade-offs involved, so you can evaluate your options as the landscape evolves.

13.1 Two Things to Understand First

13.1.1 Models Need Memory

AI models need to be loaded into memory to run. A model requiring 40GB of memory won’t run on hardware with only 32GB available. Workarounds exist — offloading to CPU RAM, splitting models across multiple GPUs — but they add complexity and reduce speed. For most practical use cases, it’s simpler to match hardware to model requirements.

Model memory requirements vary enormously and are falling rapidly. A small text classifier might need under 1GB; a large vision-language model might need 80GB or more. Techniques like can significantly reduce requirements. The key point isn’t memorising specific numbers — it’s checking the model card or documentation for your chosen model before making infrastructure decisions.

13.1.2 Usage Patterns Shape Everything

The type of AI work you’re doing matters more than which model you pick:

Interactive use — staff tools, live demonstrations, chat interfaces — needs instant response. The model stays loaded and ready. Infrastructure must be always-on or fast to start.

One-time batch processing — digitising a historical collection, processing 50,000 index cards — needs throughput, not instant response. You can rent capacity for days or weeks, then stop paying.

Regular batch processing — 300 PDFs per week, daily OCR corrections — needs reliable, predictable capacity. The economics favour fixed-cost infrastructure since you’ll be running it indefinitely.

Most institutions end up with a mix. Understanding which pattern each project follows is the first step toward sensible infrastructure decisions.

13.2 Four Ways to Access AI Infrastructure

Cloud infrastructure for AI isn’t simply “rent a GPU.” There are distinct service models, each suited to different use cases.

13.2.1 Pay-Per-Use APIs

You send data to a model hosted by someone else and pay per request (usually per token processed). No infrastructure to manage. You can switch between models easily and only pay for what you use.

Best for: Testing different models. Variable or unpredictable workloads. Getting started without infrastructure investment.

Trade-offs: Costs scale linearly with volume — economical for light use, expensive at scale. Your data leaves your premises. You’re limited to models the provider offers.

13.2.2 Dedicated Endpoints

A model runs continuously on reserved infrastructure. You pay for uptime regardless of how many requests you make — whether you process 10 items or 10,000 in an hour, the cost is the same.

Best for: Regular API usage where you need guaranteed availability. Predictable costs at moderate-to-high volumes.

Trade-offs: Wasteful if usage is sporadic. More economical than pay-per-use at high volumes. Still requires sending data externally unless you use a provider with strong data policies.

13.2.3 GPU Rental

You rent GPU-equipped virtual machines and manage the software stack yourself. Full control over the environment — install any model, any framework, any configuration.

Best for: Batch processing. Development and experimentation. Projects requiring specific software setups.

Trade-offs: Requires technical expertise. Most flexible option. You only pay while actively using the hardware.

13.2.4 Job-Based Services

You submit a job — your code and data — and the service runs it and returns results. No infrastructure stays running between jobs. You pay for compute time used.

Best for: One-time batch processing. Sporadic workloads. Automating pipelines that run on a schedule.

Trade-offs: Excellent for intermittent needs. May have startup overhead (minutes, not seconds). Less suited for interactive work.

13.2.5 Which Model Fits?

Usage Pattern Best Service Models
Testing and exploration Pay-per-use APIs
Interactive staff tool (ongoing) Dedicated endpoint or local hardware
One-time large batch GPU rental or job-based
Regular weekly processing Local hardware or job-based
Sporadic, unpredictable Job-based or pay-per-use

13.3 Buying Your Own Hardware

Local hardware means a one-time purchase and fixed capacity. No ongoing costs, no data leaving your premises, no dependence on external services. But it also means maintenance, depreciation, and capacity limits.

Two broad categories of hardware exist for running AI models:

Unified memory systems share a single pool of memory between the CPU and GPU. This simplifies running large models — if the system has 128GB of unified memory, a model that needs 100GB just works. The trade-off is usually lower throughput compared to discrete GPUs.

Discrete GPU systems use separate, specialised GPU memory (VRAM). This memory is faster but limited — a GPU with 24GB of VRAM can only run models that fit in 24GB. Multiple GPUs can be combined, but this adds complexity.

For batch processing — the most common pattern for collections work — either approach works. For interactive use cases that demand fast response, discrete GPUs typically perform better.

When local hardware makes sense:

  • You’ll run AI workloads regularly across multiple collections
  • Data privacy or copyright restrictions prevent using external services
  • You want to encourage experimentation without per-use costs
  • The institution is committing to AI as an ongoing capability, not a one-off project

When it doesn’t:

  • You have a single, one-time project (rent instead)
  • You’re still exploring whether AI is useful (use APIs first)
  • You lack IT capacity to maintain specialised hardware

13.4 Off-the-Shelf vs Custom Models

Your infrastructure needs differ depending on whether you’re using existing models or training your own.

Off-the-shelf models — pre-trained models used as-is — only need infrastructure for inference (running the model). Any of the service models above work. This is the lower-friction starting point, and many collections tasks work well with existing models.

Custom fine-tuned models — models adapted to your specific collections — require training infrastructure (typically more demanding than inference) and somewhere to host the resulting model. This favours GPU rental, dedicated endpoints, or local hardware.

Most institutions should start with off-the-shelf models and only consider fine-tuning when evaluation shows the general models aren’t accurate enough for a specific task.

13.5 Institutional Factors

Technical specifications are only half the picture. Several organisational factors should inform your decision — and in practice, these often matter more than raw performance.

13.5.1 Existing Infrastructure and Expertise

If your institution already uses cloud platforms for other services, you have existing accounts, expertise, compliance approvals, and cost management practices. Starting there is usually easier than adopting a new platform specifically for AI.

If you have minimal cloud experience, the friction of cloud adoption may favour local hardware initially — at least you can hand it to your IT team rather than building cloud expertise from scratch.

13.5.3 Budget and Procurement

Cloud services are operational expenditure (ongoing monthly costs). Hardware is capital expenditure (one-time purchase). Most institutions have different approval processes and budget lines for each, and this administrative reality can matter more than the raw economics.

Cloud costs are variable and harder to predict. Hardware costs are fixed and front-loaded. For ongoing work, hardware often works out cheaper over 2-3 years — but only if you’ll actually use it consistently.

13.5.4 Organisational Culture

Some institutions find that having physical hardware signals commitment to AI as a capability, encourages staff experimentation (no per-use costs to worry about), and builds internal expertise. Others prefer the flexibility of cloud services and the ability to scale up or down without hardware commitments.

Neither approach is wrong. The best choice depends on how your institution operates.

13.6 Matching Infrastructure to Use Case

Key question: Is AI work occasional projects or an ongoing institutional capability?

  • Occasional projects → Cloud services (pay only when needed, no maintenance)
  • Ongoing capability → Local hardware or dedicated endpoints (predictable costs, full control)

Most institutions benefit from a hybrid: cloud for experimentation and one-off projects, local or dedicated infrastructure for regular production work.

Five decision factors, in rough order of importance:

  1. Data constraints: Privacy, copyright, and governance requirements may rule out external services entirely
  2. Volume and frequency: Regular high-volume work favours fixed-cost infrastructure
  3. Response time: Interactive tools need always-on infrastructure; batch processing doesn’t
  4. Existing infrastructure: Leverage current cloud platforms if available
  5. Budget structure: Capital vs operational expense approval paths differ
Is this interactive/API-style?
├─ Yes → Needs fast response
│  ├─ High volume / regular use → Dedicated endpoint or local
│  └─ Low volume / testing → Pay-per-use API
│
└─ No → Batch processing
   ├─ One-time large job → GPU rental or job-based service
   ├─ Regular moderate volume → Local hardware or job-based
   └─ Sporadic / unpredictable → Job-based service

Consider data constraints:
├─ Copyright/privacy restrictions → Local or enterprise-tier cloud
├─ Already using cloud platforms → Leverage existing infrastructure
└─ Starting fresh → Start with APIs, invest in hardware when patterns are clear

13.7 Scenarios

13.7.1 A Large Digitisation Project

Context: 250,000 index cards needing metadata extraction, one-time.

This is classic batch processing — high volume, flexible timeline, no need for infrastructure afterwards. GPU rental or a job-based service makes sense: rent capacity for the duration of the project, then stop paying. If the materials contain personal information or unpublished research, check whether the service provider’s data policies are acceptable — or process locally.

If your institution anticipates similar projects in the future, local hardware may be worth the upfront investment.

13.7.2 Ongoing Metadata Enrichment

Context: 300 PDFs per week requiring metadata extraction, indefinitely.

Regular, predictable volume with no end date. The economics favour fixed-cost infrastructure — either local hardware or a dedicated endpoint. Local hardware has the added advantage of keeping copyrighted materials on-premises.

A job-based service also works if the weekly volume is modest and you prefer not to manage hardware.

13.7.3 An Interactive Staff Tool

Context: A tool for staff to query collections or get AI-assisted suggestions, needing instant responses.

Interactive use demands always-on infrastructure with fast response. A dedicated endpoint works if data policies allow external processing. Local hardware works if privacy is a concern or you plan multiple interactive tools.

Pay-per-use APIs can work for low-volume interactive use during a pilot phase, but costs add up if usage grows.

13.8 Key Takeaways

  1. Start with usage patterns — interactive, one-time batch, or regular batch? This shapes everything else.

  2. Four service models exist — pay-per-use, dedicated endpoints, GPU rental, and job-based services each suit different patterns.

  3. Institutional factors often dominate — data privacy, budget structures, existing expertise, and organisational culture matter as much as technical specifications.

  4. Local hardware suits ongoing capability — if AI is becoming a regular part of your operations, fixed-cost local infrastructure usually makes sense.

  5. Start with APIs, invest when patterns are clear — use pay-per-use services to validate your approach before committing to hardware.

  6. Hybrid approaches work well — cloud for experimentation and burst capacity, local for routine and sensitive work.

13.9 Questions to Guide Your Decision

  1. Are your AI applications interactive or batch processing?
  2. Is this a one-off project or an ongoing capability?
  3. What are your data privacy and copyright constraints?
  4. Does your institution already use cloud platforms?
  5. Is your budget structured for capital or operational expenditure?
  6. Will you use existing models or need to host custom fine-tuned models?
  7. What technical capacity does your team have for managing infrastructure?
  8. Could you start with cloud services to validate your approach before investing in hardware?