Your inference stack is optimized on one side. The other is running on hope.

Blog

Your inference stack is optimized on one side. The other is running on hope.

June 25, 2026

Phil Burr

One stack. Two fundamentally different problems.

Every LLM inference request runs two distinct workloads back-to-back. And the AI infrastructure community has only built purpose-built hardware for one of them.

Specialized hardware now being used in the decode phase has attracted serious investment. Memory-optimized, bandwidth-focused accelerators are commercially deployed at scale across hyperscalers and neoclouds right now. The workload was understood, the hardware followed.

Prefill is a different story. It is running on the same silicon as everything else, on hardware designed for a fundamentally different compute profile. That mismatch compounds silently as context windows grow and model sizes push into the hundreds of billions of parameters.

What each phase actually demands from hardware

Prefill is compute-bound. When a prompt arrives, the model processes every input token in parallel through attention and feed-forward layers. For a long prompt on a frontier model, that is trillions of operations, dominated by large matrix multiplications applied to fixed weight matrices, layer after layer. Raw arithmetic throughput and energy is the ceiling.

Decode is memory-bandwidth-bound. Each new output token is generated sequentially. Producing it means streaming the full model weights and the growing KV cache through the compute unit. The ceiling is data movement speed, not multiply-accumulate throughput. More FLOP/s changes nothinghere.

These are structurally different problems. A chip optimized for fast memory access is the wrong chip for the matrix-heavy parallelizable work of prefill. A chip designed for dense matrix computation is overkill, and expensive overkill, for decode. Running both on identical hardware means neither phase runs on optimal infrastructure.

Why this matters more as you scale

At low inference volumes, the inefficiency is tolerable. At scale, it becomes structural.

Frontier AI companies project roughly a 1,000x increase in effective compute demand over the next five years. Delivering that on conventional digital accelerators requires approximately $100 trillion in infrastructure investment and around 1,000 GW of additional electrical capacity. Neither number is achievable.

The constraint is not transistor density. It is energy. And prefill is where that energy problem bites hardest as contexts become longer. Matrix multiplication at hundreds of billion-parameter scale is relentlessly power-hungry on silicon. Every prefill request burns watts that a purpose-built architecture could reclaim. Right now, that decision is being made by default rather than by design.

The gap in the ecosystem nobody has closed

Decode has a maturing ecosystem of specialist hardware. Memory-optimized accelerators, custom silicon for KV cache access, dedicated decode pools running in production today. The market identified the workload and built for it.

Prefill does not have an equivalent. There is no commercially deployed, purpose-built prefill processor. It runs on whatever capacity is available, tolerated rather than designed for. For architects building disaggregated inference infrastructure, that means one half of the stack is an engineering decision and the other is a gap waiting to be filled.

Closing that gap requires something specific: the ability to execute very large matrix multiplications at high throughput, with low energy per operation, at data center scale. Incrementally faster silicon is not the answer. The architecture itself needs to change.

Light is the right compute for a matrix problem

Lumai exists to close this gap. Our Lumai Iris Server uses light rather than electricity to perform matrix multiplications, the operation that defines the prefill workload. Each vector-matrix multiplication completes in a single optical cycle. Energy scaling of optical compute is fundamentally different from silicon: as matrix size increases, compute scales quadratically while energy grows at most linearly.

The result is approximately 10x less energy per inference versus silicon-based equivalents, validated on LLMs at billion-parameter scale. Iris Nova, the first generation of Lumai Iris Server, deploys in existing air-cooled data center racks with no liquid cooling required. It supports INT4 and INT8 precision and integrates with standard ML frameworks including PyTorch.

This is a complete server deployment, not a co-processor or a research chip. It is built on similar technology already operating at high volume in data center communications. The supply chain exists. The hardware is available for evaluation today.

The prefill decision is in front of you now

Disaggregated inference is becoming the default architecture for production LLM deployments. As prefill shifts from an implicit silicon workload to an explicit infrastructure decision, the question is no longer whether to specialize it. It is what hardware you choose.

If you are building or optimizing a disaggregated inference stack and want to understand what purpose-built prefill hardware changes about your system, we would like to talk. Lumai Iris Nova is available for evaluation now.

‍

Your inference stack is optimized on one side. The other is running on hope.

WHAT WE DO

About LUMAI

Follow us

Your inference stack is optimized on one side. The other is running on hope.

Light Always Wins: How Optics Is Completing Its Journey Into the Heart of AI

Why AI Inference Hardware Must Specialize - Starting with Prefill

Light Over Silicon: Why Optical Compute Is AI Infrastructure's Next Chapter

WHAT WE DO

About LUMAI

Follow us