Blog

Why AI Inference Hardware Must Specialize - Starting with Prefill

May 21, 2026
Lumai Editorial Team

The debate about whether AI inference should be disaggregated to prefill and decode is no longer needed. Recently the market has delivered several unmistakable signals:

  • Cerebras’ blockbuster IPO became one of the largest and most oversubscribed technology offerings of the year, driven largely by investor confidence in inference infrastructure demand.  
  • NVIDIA’s recent acquisition activity around inference-focused infrastructure companies underscores the strategic importance of deploying and serving models at scale.
  • Anthropic continues to report explosive enterprise usage growth, reflecting the reality that AI workloads are rapidly shifting from model training into production deployment.

Disaggregated serving is shipping in vLLM, SGLang, and llm-d, running in production at major labs and hyperscalers, and validated by a substantial research literature. Prefill and decode are different workloads, they stress hardware in different ways and running them on the same hardware is a compromise.  

The follow-up question is how different should the hardware be for each stage?

The trajectory of specialisation

There is a useful pattern in how computing architectures evolve. A workload starts out running on general-purpose hardware. As it grows in economic importance, the inefficiencies of running it on hardware not designed specifically for it become harder to justify. Specialisation arrives in stages. First the workload moves to a dedicated pool of the same general-purpose hardware. Then the pool is tuned with different memory and interconnect profiles. Eventually, purpose-built hardware arrives that no longer pretends to be general at all.

Graphics took this path. Networking took this path. Training took this path. Inference is doing the same.

Right now, most (but not all) production disaggregation runs prefill on one pool of hardware and decode on another pool of the same hardware - perhaps with different parallelism strategies, perhaps with different memory configurations, but fundamentally the same hardware does two different jobs. This is the first stage towards specialisation, and it is delivering real gains. But it is not the destination.

Why prefill is the side that needs new hardware

The two phases of large language model inference place opposite demands on the system underneath.

Prefill is compute-bound. It processes the entire input prompt in parallel, saturating the available arithmetic throughput, and it scales quadratically with context length. Long-context workloads (which are now the norm rather than the exception) turn prefill into the dominant cost in the pipeline. This is the stageof inference where purpose-built hardware offers the most advantages over general-purpose.  

Decode is memory-bandwidth-bound. It generates tokens one at a time, each step requiring the full model weights and growing KV cache to be streamed from memory. The arithmetic is modest; the memory bandwidth is not. Current general-purpose solutions can manage these steps effectively.  

This isn’t about replacing GPUs - their HBM bandwidth makes them well-suited to the memory-bound decode workload. It is about specialised prefill hardware that sits alongside existing decode hardware to make the whole inference path better.  

This is the architectural premise behind the Lumai Iris Nova server: a prefill engine that uses optical compute to deliver the throughput and efficiency that the compute-bound prefill demands.

What comes next

Inference demand is growing faster than data centre power budgets can expand. New inference infrastructure has to deliver substantially more compute within the same power envelope. This requires rethinking compute from the physics upwards - asking what operations the workload actually needs, and what physical mechanisms can perform those operations with the least energy.

This is the question Lumai was founded to answer. Optical compute does linear algebra (the operation that dominates prefill) - using photons rather than electrons. Lumai Iris is a family of servers designed to deliver the prefill compute capacity that inference now requires, within the power budgets that operators actually have.  

With the disaggregation debate settled, the specialisation trajectory is underway, and the constraint now defining the next phase of AI infrastructure is power. It will not be answered by doing the same thing slightly better. It will be answered by rethinking the fundamental physics of compute and building the hardware that lets operators serve the next phase of inference within the power budget they actually have.