Why the Framework Desktop

Mike Saunders · 4th March 2026

The first decision is whether to have a local machine or a cloud-based solution. The advantages of cloud are spin up time, the ability to leverage cloud services without managing infrastructure, and scalability. But from the point of view of a public organisation, you lose control over the data; the ability to freely experiment without the potential of spiralling token costs; and insight into energy consumption and carbon footprint. I go into a bit more detail on this in kickoff.

It's a big leap for heritage organisations to think about buying a local machine exclusively for doing AI work. The Framework Desktop (at least when we bought it - prices have risen twice since the global ram shortage) was comparatively cheap in terms of AI-capable hardware, but it was still a non-trivial amount that needs to be justified and found in ever-shrinking budgets. It's also kind of scary because it's a physical commitment - both practically for the IT staff managing physical machines, and for the Library to indicate 'yes, we're trying this work, and in a considered and particular way'.

From a technical point of view, the Framework Desktop's arrival on the scene was really fortuitous to our discussions around this. It doesn't have a dedicated GPU, and was one of the first machines to integrate AMD's 'Strix Halo' (Now confusingly called Ryzen AI Max+ 395) CPU/APU, meaning the iGPU and CPU both have access to the full 128GB of RAM. What initially seems like a drawback became a set of key advantages: - Memory. On a normal setup you're limited by VRAM - a consumer graphics card tops out around 24GB, and to get anywhere near 80GB you're into datacenter cards that cost many times more than the whole machine. Because the iGPU can address the full unified pool (we get about 112GB usable), we can keep a 35B model and a separate embedding model loaded at the same time, with a 128k-token context window on top of that, all in memory. This would be a real push on a single consumer GPU, and you would probably end up buying several cards. - The power draw (during both inference and fine-tuning) seems to be orders of magnitude smaller than the equivalent with a CPU/GPU combo - we see around ~60w while inferencing. - There's no copying weights back and forth between system RAM and the GPU, because there's only one pool of memory. If a model doesn't fit in VRAM on a dedicated card, this would become a potential bottleneck. - The memory isn't carved up into a fixed VRAM partition - the OS and GPU share the 128GB flexibly, so whatever the model doesn't need is still available to the rest of the system. - Because it does both inference and training, the whole pipeline lives on one box. We trained the card detector from scratch on this machine in about an hour and a half, no cloud GPU involved. One physical commitment covers both jobs. - It's small, quiet, and headless - a mini PC (built on a standard Mini ITX board) that sits in the server room rather than a workstation-sized rig with the cooling and power draw to match. That makes it a much easier thing for IT to actually house and look after.

A small mention to token generation speed - it won't beat a top-end dedicated GPU on single-request speed (we see roughly 53 tokens/second on its own, and per-request throughput drops as you run things in parallel). But for the work we're actually doing, holding larger models matters much more than how fast a single response streams. We're working on Library Time, after all.

The other main trade-off is the current maturity of the compute stack - AMD's ROCm against Nvidia's much more established CUDA ecosystem. For inference we can sidestep ROCm altogether - Vulkan is vendor-neutral and works completely fine, and it's our usual backend. Fine-tuning is a different story: there's no Vulkan route, so you need ROCm - a fairly frustrating day of trial and error using nightly releases of TheRock. This isn't really a 'benefit' but it did make finally training a model feel like more of an achievement. AMD are also actively contributing to Lemonade, our inference server of choice, to improve all this over time - I think we'll see significant improvements on stability and performance pretty quickly over the next year.

This isn't an ad for Framework and there are a bunch of other hardware choices out there - the most notable Nvidia parallel being the DGX Spark, which we also considered, and there and more and more machines released each week using Strix Halo chips. But so far it fits the size of our ambitions, and allows experiment and rapid development.