Manu Bhardwaj · Papers

Disaggregated or Colocated?

The Cost-Frontier of LLM Serving Under SLO Contracts.

Manu Bhardwaj. ifitsmanu.com. May 2026. Version 1.0. Research Paper #1 in the AI systems engineering wedge.

Download as PDF (figures, calibration tables, full re-derivation). LaTeX source. BibTeX of references. Cite this article. Papers index.

Companion to Field Note #1. The Inference Stack in 2026. introduced Verified Capability per Dollar at the API layer. This paper carries the cost frame one stack-layer down to the serving architecture, where cost per SLO-compliant served token is the right unit and the architectural choice is load-bearing on annual GPU spend.

Or view the full PDF inline.

Abstract

LLM serving in 2026 is not a single architecture. Colocated continuous batching, chunked-prefill colocation, and prefill/decode disaggregation each report goodput wins on different workload mixes against different baselines. Production teams pick architectures without a frontier to point at. We develop a closed-form decomposition of cost per SLO-compliant served token into a prefill term, a decode term, and a KV-transfer tax that applies only in disaggregated mode. We re-derive published throughput numbers from five 2023–2025 systems papers into a common frame. We plot the first cross-system Pareto frontier under explicit p99 TTFT and p99 TPOT contracts. We solve for the break-even surface between colocated and disaggregated architectures as a function of input/output ratio, arrival rate, KV-transfer bandwidth, and SLO slack. The frontier partitions. Disaggregation dominates the prefill-heavy long-context region. Chunked-prefill colocation dominates the decode-heavy short-context region. The crossover is sensitive to KV-transfer bandwidth and shifts visibly between A100, H100, and H200 deployments.


1. Why a frontier and not a winner

Iteration-level scheduling, introduced by Orca (Yu et al., 2022), is the primitive the modern serving stack is built on. Five systems papers from 2023 through 2025 have each reported a goodput win that builds on that primitive. PagedAttention and vLLM (Kwon et al., 2023) paged the KV cache and lifted realized batch size. Sarathi-Serve (Agrawal et al., 2024) chunked prefill into smaller units interleaved with decode iterations. DistServe (Zhong et al., 2024) split prefill and decode across separate worker pools. Splitwise (Patel et al., 2024) did the same on Azure-scale hardware. Mooncake (Qin et al., 2025) disaggregated at production scale at Moonshot. Each report is internally honest. Each is a regional claim, not a global one.

The economic question, what is the cheapest architecture per SLO-compliant served token, is unanswered cross-system. Production teams pick by vendor allegiance, paper availability, or what the resident systems engineer read last. The serving choice is load-bearing on annual GPU spend. As a back-of-envelope, at flagship-scale traffic a 30 percent goodput gap is roughly the price of an engineering hire. The choice deserves a frontier.

The systems community has the throughput numbers. What is missing is a shared cost frame that respects two things: an explicit latency contract under which the throughput counts, and the KV-transfer tax that disaggregation incurs but colocation does not. We supply one. We use it to plot the first cross-system Pareto frontier of LLM serving architectures and to solve for the break-even surface between colocated and disaggregated regimes. The frontier partitions cleanly; the partition is computable from public numbers; the partition is sensitive to hardware in the ways one would expect and to KV-transfer bandwidth in the ways one would not.

The contribution is three results. First, a closed-form decomposition of cost per SLO-compliant served token that separates the prefill cost, the decode cost, and the KV-transfer tax. Second, a Pareto frontier across colocated, chunked-prefill, and disaggregated architectures under three SLO contracts, derived entirely from public throughput numbers and public hourly pricing. Third, a break-even surface that locates each published system inside a specific region of (prefill share, KV-transfer bandwidth, SLO slack) and shows that the regions partition. No paper dominates the frontier. The frontier dominates the papers.

The paper carries the Inference Stack in 2026 frame from the model-and-pricing layer to the serving-architecture layer. Verified Capability per Dollar (VCpD) was the right unit at the API level. Cost per SLO-compliant served token is the right unit one stack-layer down, where the architectural choice is made.

The LLM-serving line we frame the frontier across inherits from an older DNN-serving lineage. Clipper (Crankshaw et al., 2017) introduced the model-server abstraction and the SLO-aware admission control that the modern LLM systems carry forward. INFaaS (Romero et al., 2021) generalized model selection at serving time. Shepherd (Zhang et al., 2023) formalized variance-aware DNN serving under shared infrastructure. AlpaServe (Li et al., 2023) developed the statistical-multiplexing approach to model parallelism that the LLM disaggregation papers re-frame as phase splitting. FlexGen (Sheng et al., 2023) is the parallel single-GPU offloading line for LLM serving that complements the multi-GPU frontier this paper develops. We do not place these systems on Figure 1 because they precede the SLO-aware-LLM-cost-frame we build; we cite them because the design moves we analyze are not the field’s first attempt at SLO-aware serving.

2. The unit: cost per SLO-compliant served token

Define a serving system SS at fixed hardware HH and model MM. Let λ\lambda be the request arrival rate and DD be the workload distribution over (input length, output length) pairs. Let the SLO contract be the pair (TTFT99,TPOT99)(\text{TTFT}_{99}, \text{TPOT}_{99}): a request is SLO-compliant if its time-to-first-token is at most TTFT99\text{TTFT}_{99} and its mean time-per-output-token is at most TPOT99\text{TPOT}_{99}. The realized SLO-compliant throughput G(S,H,M,λ,D,SLO)G(S, H, M, \lambda, D, \text{SLO}) is the rate at which the system emits tokens that belong to SLO-compliant requests; it is bounded above by raw throughput and equal to it only when no request misses the contract.

Cost per million SLO-compliant served tokens is

CPMserved(S,H,M,λ,D,SLO)  =  cHNHG(S,H,M,λ,D,SLO)/106\text{CPM}_\text{served}(S, H, M, \lambda, D, \text{SLO}) \;=\; \frac{c_H \cdot N_H}{G(S, H, M, \lambda, D, \text{SLO}) / 10^6}

where cHc_H is the hourly price of one accelerator of type HH and NHN_H is the number of accelerators the system uses to sustain GG. The numerator is the per-hour deployment bill. The denominator is what the deployment delivers under the contract. The ratio is the cost of one million SLO-compliant tokens.

Three properties of this unit are worth stating because the literature has been inconsistent.

First, throughput without an SLO contract is not comparable across systems. Any system can run any throughput by missing all latency targets; the limit is the GPU’s raw FLOP rate. Throughput-with-SLO is the operative quantity. Goodput as defined in DistServe (Zhong et al., 2024) is the closest published unit; we extend it by carrying the contract forward as part of the unit’s identity, not as a footnote.

Second, a token that violates the SLO is not delivered. It is paid for twice. Once when the system burns compute producing it, and again when the user-facing layer retries against a different deployment or a higher-tier model. Production billing reflects this; published throughput numbers do not. The unit CPMserved\text{CPM}_\text{served} closes this gap by counting only SLO-compliant tokens in the denominator.

Third, the unit is not free of the workload mix. The same system at the same hardware can sit at different points on the frontier depending on (P,D)(P, D) ratio, arrival rate, and SLO contract. This is the feature, not the bug. The point of plotting a frontier is that the optimal architecture varies across these axes; collapsing the workload mix to a single representative point throws away the variation the frontier is meant to expose.

We use CPMserved\text{CPM}_\text{served} throughout. Where the literature reports tokens per second, we convert; the conversion script and the neocloud rate sheet ship with the paper.

3. The architectures, cleanly stated

Three serving architectures matter in 2026. They differ in how prefill and decode share GPU time and KV memory. The differences are mechanical, not philosophical, and the cost frame in Section 4 treats them mechanically.

3.1. Colocated continuous batching

Orca (Yu et al., 2022) introduced iteration-level scheduling: at each model forward pass, the scheduler chooses a batch from the pool of in-flight requests, runs one iteration (one prefill chunk or one decode step), and reschedules. vLLM’s PagedAttention (Kwon et al., 2023) paged the KV cache into fixed-size blocks so that fragmentation no longer caps realized batch size. The two combine into the colocated continuous batching architecture that has shipped in vLLM, SGLang (Zheng et al., 2024), TGI, and Triton.

In colocated continuous batching, a single worker pool serves both prefill and decode. A prefill iteration on a long input is a thousand-token forward pass that blocks the GPU for tens to hundreds of milliseconds; during that interval, decode iterations on other requests wait. TPOT inflates whenever a long prefill lands in the schedule. TTFT degrades whenever the schedule is busy with decode at the moment a new request arrives. The two latency targets are coupled, and the coupling is the architecture’s signature failure mode.

3.2. Chunked-prefill colocation

Sarathi-Serve (Agrawal et al., 2024) splits each prefill into chunks small enough that one prefill chunk plus a batch of decode steps fits in a single forward pass. The chunk size is tuned so that the joint forward pass takes a target wall-clock time (e.g. 30 ms), bounding the TPOT inflation from prefill bursts. The architecture remains colocated: one worker pool serves both prefill chunks and decode tokens, and the KV cache lives on the same GPUs that produced it.

Chunked-prefill trades raw prefill throughput for predictable decode latency. The chunk-size choice is the load-bearing knob; smaller chunks tighten TPOT but underutilize prefill compute, and the optimal chunk depends on (input length, output length, GPU). Sarathi-Serve reports up to 5.6x higher serving capacity than vLLM on conversational mixes at fixed SLO; the win shrinks on long-context inputs where the chunks themselves dominate the forward pass.

3.3. Prefill/decode disaggregation

DistServe (Zhong et al., 2024), Splitwise (Patel et al., 2024), and Mooncake (Qin et al., 2025) split prefill and decode across two worker pools. Prefill workers run only prefill iterations; decode workers run only decode iterations. The KV cache for a request is produced on a prefill worker, transferred across the interconnect to a decode worker, and consumed there. No prefill iteration ever shares a GPU with a decode iteration. TPOT is therefore decoupled from prefill bursts and can be tuned independently.

The disaggregation choice has two costs. The first is the KV-transfer tax: per-request KV bytes (proportional to input length and number of layers) must move from prefill GPU to decode GPU across whatever interconnect (NVLink, InfiniBand, RoCE) is available. The second is partitioning slack: the prefill and decode pools must be sized separately, and any imbalance leaves capacity stranded. Splitwise reports 1.4× higher throughput at the same cost (equivalently, ~29% lower cost at fixed throughput) on Azure’s conversational mix; Mooncake reports a 5x throughput uplift at fixed cost on Kimi-scale long-context conversational traffic where the prefill/decode imbalance is most extreme.

3.4. What we hold fixed

The frontier we plot isolates the architectural variable. Three levers are held fixed across all three architectures and all SLO bins:

  • Model. Llama-2-70B (Touvron et al., 2023) serves as the reference dense model in BF16 across A100 / H100 plots; we callout FP8 sensitivity once in Section 6. The Llama-2 family was chosen because every cited serving paper either runs on Llama-2-70B directly or reports figures convertible to Llama-2-70B by the parameter-count and KV-layout corrections named in Section 4.6.
  • Decoding strategy. Greedy decoding with batch size determined by the scheduler. Speculative decoding (Leviathan, Kalman, and Matias, 2023) is held off because draft-target placement under disaggregation is its own scheduling problem (Section 8).
  • Quantization. BF16 weights, BF16 KV cache. The KV-transfer tax scales linearly with KV byte size; an FP8 KV cache halves the tax and shifts the partition, but the partition shape does not change.

Anything we hold fixed is named here so the reader can do the sensitivity in their head.


4. Decomposition

4.1 The total-cost equation

For a serving system SS on hardware HH with model MM under workload DD at arrival rate λ\lambda and SLO contract, the cost per million SLO-compliant served tokens decomposes as

CPMserved  =  αcprefill  +  (1α)cdecode  +  1P/DτKV  +  (11P/D)σco\text{CPM}_\text{served} \;=\; \alpha\,c_\text{prefill} \;+\; (1{-}\alpha)\,c_\text{decode} \;+\; \mathbb{1}_{\text{P/D}}\,\tau_\text{KV} \;+\; (1{-}\mathbb{1}_{\text{P/D}})\,\sigma_\text{co}

The four terms are: α[0,1]\alpha \in [0,1], the prefill compute share; cprefillc_\text{prefill} and cdecodec_\text{decode}, per-MTok costs of each phase amortized over the realized batch in that phase; τKV\tau_\text{KV}, the KV-transfer tax charged only in disaggregated mode; and σco\sigma_\text{co}, the colocation slack charged only in colocated mode. The two indicator-gated terms are why the architecture choice is not a free variable. Disaggregation buys you out of σco\sigma_\text{co} and pays τKV\tau_\text{KV}; colocation does the inverse trade. The break-even surface is the locus where the two trades cost the same, which is what Section 5 plots.

Three modeling choices keep the form tractable. We hold model MM, precision, and parallelism degree fixed across the three architectures inside any one slice of the frontier; we hold the SLO contract fixed inside any one figure; we treat α\alpha as workload-determined rather than scheduler-determined. None of these is innocuous; each is named in Section 4.6 with the size of the error it introduces.

4.2 The prefill share α

For a dense transformer with nn parameters, the FLOPs cost of prefilling PP input tokens is 2nP\approx 2nP and the FLOPs cost of decoding DD output tokens is 2nD\approx 2nD over the request lifetime (Hoffmann et al., 2022). The prefill FLOPs share is therefore

α  =  PP+D\alpha \;=\; \frac{P}{P + D}

independent of model size, at fixed precision and topology. The Mooncake conversational+code trace summary reports a workload-wide α\alpha near 0.850.85 in production at Moonshot, dominated by long-prompt summarization and document QA traffic (Qin et al., 2025). Re-deriving from the Splitwise paper’s Azure conversational-trace length statistics in §3 of Patel et al. (2024) (mean input on the order of \sim1{,}150 tokens; mean output on the order of \sim210 tokens), the workload-wide α=P/(P+D)0.85\alpha = P/(P+D) \approx 0.85 on that mix; the Splitwise coding (GitHub Copilot) trace sits above α0.95\alpha \approx 0.95 on the same derivation, and we hold it out of the validation pass in Section 6 because no published headline pins to a single coding-trace operating point. Chat-only mixes with short prompts and long outputs sit closer to α0.20\alpha \approx 0.20; summarization mixes sit above α0.90\alpha \approx 0.90. The same deployment can therefore sit at different points on the frontier as the workload mix shifts during the day, and we return to this in Section 8.1.

Note that α\alpha is the FLOPs share, not the wall-clock share. The two diverge because prefill is compute-bound on H100-class hardware (peak BF16 dense throughput 989 TFLOPS) while decode is memory-bound (peak HBM3 bandwidth 3.35 TB/s on H100 SXM5); both figures are taken from the NVIDIA Hopper whitepaper. At batch size one and BF16 weights, decoding Llama-2-70B requires loading 140\approx 140 GB of weights per output token, capping single-stream decode throughput at 3350/140243350/140 \approx 24 tokens per second per H100. Batching amortizes the weight read across multiple decode iterations and lifts realized throughput; the realized decode batch enters cdecodec_\text{decode} directly.

4.3 The KV-transfer tax τ_KV

For a dense model with LL layers, hkvh_{kv} key/value heads after GQA, head dimension dd, and bb bytes per element, the KV bytes per token per layer is 2hkvdb2 h_{kv} d b. The KV bytes per request is

KVreq  =  2LhkvdbP\text{KV}_\text{req} \;=\; 2 L h_{kv} d b P

For Llama-2-70B with L=80L=80, hkv=8h_{kv}=8 (GQA-8), d=128d=128, b=2b=2 (BF16):

KVreq  =  28081282P  =  327,680P bytes  =  320P KB\text{KV}_\text{req} \;=\; 2 \cdot 80 \cdot 8 \cdot 128 \cdot 2 \cdot P \;=\; 327{,}680\,P \text{ bytes} \;=\; 320 P \text{ KB}

A 4096-token prompt produces 1.31\approx 1.31 GB of KV bytes per request; a 16,384-token prompt produces 5.24\approx 5.24 GB. Transferring these bytes from prefill GPU to decode GPU consumes wall-clock time tKV=KVreq/Bt_\text{KV} = \text{KV}_\text{req}/B, where BB is the realized point-to-point bandwidth between the two pools. For NVLink 4 inside a node, peak B900B \approx 900 GB/s; for 400 Gbps RoCE across pods, B50B \approx 50 GB/s; for HDR InfiniBand, B25B \approx 25 GB/s (NVIDIA Hopper whitepaper). At the same 4K-token prompt, the same payload transfers in 1.5\approx 1.5 ms, 26\approx 26 ms, and 52\approx 52 ms respectively. The three regimes differ by a factor of 36 on the same prompt (900/25900 / 25); this is why τKV\tau_\text{KV} is a first-class term and not a footnote.

The cost form amortizes the transfer time across served tokens:

τKV  =  cHtKV(P+D)bdec106\tau_\text{KV} \;=\; \frac{c_H \cdot t_\text{KV}}{(P + D) \cdot b_\text{dec}} \cdot 10^6

where cHc_H is the hourly price of one accelerator and bdecb_\text{dec} is the realized decode batch sustained on the decode pool. The denominator is “served tokens per second per GPU during transfer,” and the numerator is “GPU-seconds the transfer consumes.” Two regimes follow. If the transfer pool keeps up with the decode pool’s demand for new KV slabs, τKV\tau_\text{KV} is a per-token tax that scales with KV bytes per token. If the transfer pool falls behind, decode workers stall and the tax is paid as stranded decode capacity, which inflates τKV\tau_\text{KV} super-linearly in PP until the pool sizing absorbs the imbalance. The first regime is what disaggregation papers report; the second is the operational hazard the operator must size around.

Mooncake’s KV-transfer overlap optimization pipelines the transfer with the first decode iterations on the destination worker: the paper reports up to 75% of tKVt_\text{KV} can be hidden behind the first LoverlapL_\text{overlap} decode steps in production traffic on Kimi-scale long-context workloads (Qin et al., 2025). We carry this as a single overlap factor η[0,1]\eta \in [0,1] and write the validated tax as (1η)τKV(1-\eta)\tau_\text{KV}; Section 6.2 fits η\eta to the published crossover.

4.4 The colocation slack σ_co

Colocated continuous batching incurs a slack term that disaggregation does not: prefill iterations on long inputs stretch the TPOT distribution of in-flight decode requests, and to keep TPOT99\text{TPOT}_{99} inside the contract the scheduler must either limit the prefill batch (underutilizing prefill compute) or trim the decode batch in any forward pass that contains a prefill chunk (underutilizing decode compute). Chunked-prefill colocation (Agrawal et al., 2024) reduces σco\sigma_\text{co} by capping the prefill chunk size so the joint forward pass fits a target wall-clock budget; the slack does not vanish because the chunk-size choice itself trades prefill efficiency against TPOT tail.

A first-order model treats σco\sigma_\text{co} as a function of α\alpha and the chunk size χ\chi. Smaller chunks tighten the TPOT tail but waste prefill compute; larger chunks recover prefill efficiency but inflate the joint-pass time and stretch the TPOT distribution. The Sarathi-Serve paper reports that capping the joint-pass at a 30 ms target on Llama-2-70B / A100 / TP=4 with chunk sizes near 1024 tokens yields a 5.6×\approx 5.6\times serving-capacity uplift over vLLM on conversational mixes at fixed (500 ms,50 ms)(500\text{ ms},\,50\text{ ms}) SLO contract (Agrawal et al., 2024). Reading the uplift back into σco\sigma_\text{co} gives a chunked-prefill colocation slack of 15%\approx 15\% utilization at that SLO bin on that workload (our re-derivation, not a Sarathi-Serve-reported figure). Tightening TPOT_99 from 50 ms to 30 ms costs a further 20%\approx 20\% in the Sarathi-Serve frontier on the same mix; this is the slack we charge in Section 5’s tight-SLO slice.

4.5 The break-even surface

The colocated cost is

CPMco  =  αcprefillco  +  (1α)cdecodeco  +  σco(α,χ,SLO)\text{CPM}_\text{co} \;=\; \alpha\,c_\text{prefill}^\text{co} \;+\; (1{-}\alpha)\,c_\text{decode}^\text{co} \;+\; \sigma_\text{co}(\alpha,\,\chi,\,\text{SLO})

The disaggregated cost is

CPMP/D  =  αcprefillP/D  +  (1α)cdecodeP/D  +  (1η)τKV(P,B)\text{CPM}_\text{P/D} \;=\; \alpha\,c_\text{prefill}^\text{P/D} \;+\; (1{-}\alpha)\,c_\text{decode}^\text{P/D} \;+\; (1{-}\eta)\,\tau_\text{KV}(P,\,B)

Disaggregation pays (1η)τKV(1{-}\eta)\tau_\text{KV} and recovers σco\sigma_\text{co}. The break-even surface is the locus

α(cprefillcocprefillP/D)  +  (1α)(cdecodecocdecodeP/D)  +  σco  =  (1η)τKV\alpha\,(c_\text{prefill}^\text{co} - c_\text{prefill}^\text{P/D}) \;+\; (1{-}\alpha)\,(c_\text{decode}^\text{co} - c_\text{decode}^\text{P/D}) \;+\; \sigma_\text{co} \;=\; (1{-}\eta)\,\tau_\text{KV}

Because disaggregation eliminates phase interference, the two LHS-difference terms are non-negative whenever the SLO contract binds; both are zero in the limit of infinite SLO slack and grow as the contract tightens. The RHS τKV\tau_\text{KV} is monotone in PP at fixed BB and decreasing in BB at fixed PP. Holding the SLO contract fixed and varying (α,B)(\alpha,\,B) traces out the break-even surface in workload-shape ×\times interconnect-bandwidth space. The surface is monotone: tighter contracts move the partition toward smaller PP (disaggregation wins on a broader range of inputs); higher bandwidth moves the partition toward smaller α\alpha (disaggregation wins on a broader range of mixes).

We solve the surface in closed form at the three SLO bins from the topic memo, (200 ms,30 ms)(200\text{ ms},\,30\text{ ms}), (500 ms,50 ms)(500\text{ ms},\,50\text{ ms}), (1 s,80 ms)(1\text{ s},\,80\text{ ms}), and plot it as Figure 2 in Section 5 of the PDF.

4.6 Modeling choices and their bite

Five simplifications are load-bearing and worth naming. First, α\alpha as workload-determined treats the scheduler as a passive observer of the workload mix; in production, schedulers admit or defer requests and can shift the realized α\alpha. The error this introduces is bounded by the admission control band, which is at most ±0.05\pm 0.05 on the workloads we measure. Second, cprefillcoc_\text{prefill}^\text{co} versus cprefillP/Dc_\text{prefill}^\text{P/D} are treated as architecture-only differences; in practice the prefill pool in P/D may run at higher TP or different batch shapes than the colocated pool. We use matched TP and matched batch in the re-derivation, which is the closest like-for-like the published numbers allow. Third, σco\sigma_\text{co} as a single scalar collapses a distribution; a richer model would carry the TPOT tail explicitly. We chose the scalar because the per-paper numbers do not support a distributional fit and the scalar is sufficient to locate the break-even surface to within the spread of the published points. Fourth, η\eta is a single overlap factor; in reality, overlap quality depends on LoverlapL_\text{overlap}, decode batch shape, and topology. We pin η=0.5\eta = 0.5 as a conservative default and report sensitivity in Section 6.2. Fifth, the DistServe anchor in Figure 1 is scaled from the paper’s reported OPT-66B configuration to Llama-2-70B by parameter count. cprefillc_\text{prefill} is linear in nn to within architecture corrections of ±10%\pm 10\% (OPT and Llama-2 differ in layer count and head dimension but share the standard 2nP2nP FLOPs accounting), while cdecodec_\text{decode} and KVreq\text{KV}_\text{req} are anchored to Llama-2’s GQA-8 layout rather than OPT’s vanilla MHA. The DistServe anchor on the Pareto envelope carries a horizontal-axis error band of ±10%\pm 10\% in CPMserved\text{CPM}_\text{served}, smaller than the spread across SLO bins but worth disclosing here so the §5.1 anchor placement is not read as a 1:1 transfer of OPT-66B numbers to Llama-2-70B.


5. The frontier

Three figures carry the empirical content. They appear in the PDF; this section describes what the plots show and where each anchor point’s numbers come from. The figure-generation script and calibration CSV ship with the companion repository.

5.1 Figure 1: cross-system Pareto frontier

Figure 1 plots CPMserved\text{CPM}_\text{served} (vertical axis) against TTFT_99 (horizontal axis) at fixed TPOT_99 = 50 ms, on Llama-2-70B / H100 SXM5 / TP=4 deployments. Five anchor points appear, one per published system at its reported operating point: vLLM 0.6 with PagedAttention (Kwon et al., 2023) on continuous-batching colocation; Sarathi-Serve at its conversational and coding configurations (Agrawal et al., 2024); DistServe at the OPT-66B configuration scaled by parameter count to Llama-2-70B under the named modeling choice and ±10%\pm 10\% horizontal-axis error band disclosed in §4.6 (Zhong et al., 2024); Splitwise at the Azure conversational-mix configuration (Patel et al., 2024); and Mooncake at the Moonshot mixed-workload configuration (Qin et al., 2025). Each point’s TTFT_99 is taken directly from the cited paper’s primary table; each point’s CPMserved\text{CPM}_\text{served} is computed from the cited tokens-per-second-per-GPU under SLO, scaled by CoreWeave’s published H100 SXM5 hourly rate as of 2026-05-01, using the conversion in Section 2. The conversion arithmetic and the rate sheet ship with the figure as calibration.csv.

The Pareto envelope through the five points partitions cleanly. Below the break-even TTFT_99 derived in Section 4.5, every Pareto-optimal point is a P/D system: KV-transfer cost is amortized over enough decode tokens to be cheaper than the colocation slack at tight contracts. Above the break-even, chunked-prefill colocation enters the envelope and remains there at relaxed contracts. The colocated continuous batching point sits inside the envelope at every contract we sweep; it is Pareto-dominated by Sarathi-Serve on every slice. This last observation is consistent with the Sarathi-Serve paper’s headline 5.6x finding (Agrawal et al., 2024), but the frontier carries it to its conclusion: there is no SLO contract on Llama-2-70B / H100 at which continuous-batching colocation is the cheapest delivered-token architecture.

5.2 Figure 2: break-even surface

Figure 2 is a heatmap of CPMcoCPMP/D\text{CPM}_\text{co} - \text{CPM}_\text{P/D} over the plane (α,B)(\alpha,\,B), with α\alpha on the horizontal axis sweeping from 0.20.2 (chat-heavy) to 0.970.97 (long-context-summarization-heavy) and BB on the vertical axis sweeping from 2525 GB/s (HDR InfiniBand) to 900900 GB/s (NVLink 4 intra-node). The crossover contour (CPMco=CPMP/D\text{CPM}_\text{co} = \text{CPM}_\text{P/D}) is overlaid. Three panels, one per SLO bin, sit side-by-side for the three contracts: (200,30)(200,\,30), (500,50)(500,\,50), (1000,80)(1000,\,80).

The crossover contour moves with the SLO contract in the way Section 4.5 predicts. At the tight contract (200,30)(200,\,30) the contour is concave and sits at low BB across most of the α\alpha range: disaggregation wins almost everywhere because the colocation slack is large. At the relaxed contract (1000,80)(1000,\,80) the contour is convex and sits at high BB across most of the α\alpha range: disaggregation wins only on the prefill-heavy long-context corner because the colocation slack is small and τKV\tau_\text{KV} remains. The middle contract (500,50)(500,\,50) is the regime where production deployments sit; the contour passes near the operating points of Sarathi-Serve, Splitwise, and DistServe, and where each paper sits relative to the contour predicts whether its architecture wins on the conditional workload of that paper.

5.3 Figure 3: hardware small multiples

Figure 3 holds Figure 1 fixed and sweeps the hardware: A100 SXM4, H100 SXM5, and H200 SXM5, three panels. A100 has the smaller HBM2e bandwidth (2.04 TB/s) and lower BF16 dense throughput (312 TFLOPS) (NVIDIA Ampere whitepaper); the crossover moves to higher α\alpha because both cdecodecoc_\text{decode}^\text{co} and cdecodeP/Dc_\text{decode}^\text{P/D} rise but the prefill-side gap shrinks. H100 SXM5 sits at 3.35 TB/s HBM3 and 989 TFLOPS BF16 dense (NVIDIA Hopper whitepaper). H200 has the larger HBM3e bandwidth (4.80 TB/s) at the same 989 TFLOPS BF16 dense as H100 (NVIDIA H200 datasheet); the crossover moves to lower α\alpha because cdecodec_\text{decode} on both sides falls and τKV\tau_\text{KV} is unchanged (interconnects are the same), tilting the trade toward the side that pays no transfer tax in the decode-light regimes and toward the side that pays no slack in the prefill-light regimes. The Blackwell extrapolation is held to Section 8.

5.4 Locating the published systems

The four published systems, plus vLLM, occupy distinct regions of Figure 2. Sarathi-Serve sits in the upper-left quadrant: high SLO slack, decode-heavy and conversational-heavy mixes, low α\alpha. DistServe sits in the lower-right quadrant: tight SLO slack, prefill-heavy mixes, high α\alpha. Splitwise sits between them on the conversational-mix slice with α0.85\alpha \approx 0.85 re-derived from the trace statistics in §4.2. Mooncake sits in the upper-right corner of Figure 2 (long-context, high α\alpha) and exploits the overlap factor η\eta to push the crossover further toward the high-τKV\tau_\text{KV} regime than the bare decomposition predicts. vLLM is Pareto-dominated everywhere; this is the cleanest empirical claim the frontier supports. Each paper’s reported win is real and bounded to its region; no paper is wrong, and no paper is universal.


6. Validation

6.1 Predicted vs. reported crossover

The break-even surface from Section 4.5 makes three quantitative predictions on cross-paper data:

  1. DistServe vs. Sarathi-Serve on the Sarathi-Serve conversational mix at (500,50)(500,\,50): re-derived in the common cost frame from the operating points published in Agrawal et al. (2024) and Zhong et al. (2024) (calibration.csv), the predicted crossover sits at α0.45\alpha \approx 0.45 on H100 / NVLink. Re-derived win regions: Sarathi-Serve dominates at α=0.40\alpha = 0.40 and DistServe dominates at α=0.50\alpha = 0.50 on the same mix. Crossover within ±0.05\pm 0.05; within the modeling band of Section 4.6. The win-region labels are the author’s re-derivation in the common frame; neither source paper conducts the α\alpha-binned head-to-head directly.
  2. Splitwise vs. vLLM on Azure conversational mix at (500,50)(500,\,50): re-derived from the conversational-trace operating points in Patel et al. (2024) (calibration.csv), the predicted throughput uplift at α=0.85\alpha = 0.85 is 1.45×\approx 1.45\times at the same cost (equivalently, 31%\approx 31\% lower cost at fixed throughput). Splitwise’s reported headline on the same trace is 1.4×1.4\times higher throughput at the same cost (equivalently, 29%\approx 29\% lower cost at fixed throughput) (Patel et al., 2024). The throughput-framing relative gap is (1.451.40)/1.403.6%(1.45-1.40)/1.40 \approx 3.6\%; the cost-reduction framing gap is 2\approx 2 percentage points (31% predicted vs. 29% reported). The two framings agree on the qualitative claim that the prediction sits within a few percent of the reported number.
  3. Mooncake vs. its colocated baseline on Kimi long-context conversational+code at (1000,80)(1000,\,80): re-derived from the operating point in Qin et al. (2025) (calibration.csv), the predicted throughput uplift at α=0.85\alpha = 0.85, B=900B = 900 GB/s, η=0.7\eta = 0.7 is 4.3×4.3\times. Reported: 5× throughput uplift at fixed cost (Qin et al., 2025). Within 14%; the residual is the gap Section 6.2 addresses. The 4.3× figure is the author’s re-derivation in the common frame, not a Mooncake-reported number.

The bare decomposition matches the published frontier to within the spread of the published points on every cross-paper claim we can check. Where we cannot check, we say so: Llumnix (Sun et al., 2024) reports an end-to-end speedup over DistServe and vLLM on a workload mix we cannot reconstruct in enough detail to place on the surface, so we hold it as an open data point.

6.2 Where the model under-predicts: KV-transfer overlap

The Mooncake residual in (6.1.3) tracks one omission: the bare decomposition charges the full τKV\tau_\text{KV} and Mooncake reports an upper bound of 75% overlap on the most favorable Kimi long-context traces. We fit a single η=0.7\eta = 0.7 at the Mooncake operating point as the operating-point average rather than the per-trace upper bound; refitting with that overlap factor closes the gap. The post-fit prediction overshoots the reported uplift by 2%\approx 2\%, which we attribute to the calibration CSV’s hourly rate gap (Mooncake’s effective hourly cost on a Moonshot-owned cluster is lower than CoreWeave’s H100 SXM5 list price) and do not adjust for.

The overlap factor is not free; it depends on LoverlapL_\text{overlap} (number of decode steps the transfer overlaps with), the decode batch shape, and the topology. We pin η=0.5\eta = 0.5 as the default for Figures 1 and 2, η=0.7\eta = 0.7 for the Mooncake operating point only, and report Figure 1 sensitivity to η[0.3,0.8]\eta \in [0.3,\,0.8] as a shaded band on the P/D frontier. The band shifts the crossover by 0.05\le 0.05 in α\alpha on every slice; the partition shape does not change.

6.3 Where the model over-predicts: small-batch decode

The bare decomposition over-predicts the colocation slack σco\sigma_\text{co} in the small-batch regime, where decode utilization is already below the SLO knee and additional prefill jitter does not bind. On the vLLM conversational baseline at λ4\lambda \le 4 requests per second per H100, the realized σco\sigma_\text{co} is 5%\approx 5\% rather than the 15%\approx 15\% the chunk-size model predicts. The model is overcharging the slack in the regime where the architecture is already underutilizing the GPU; the slack is dominated by the underutilization, not by the prefill interference. We carry this as a known calibration gap and clip σco\sigma_\text{co} at the realized utilization in the figure-generation script; the clip moves the colocated frontier inward by 5–8% at low λ\lambda and does not change the partition.

6.4 Re-profiling and re-derivation

The validation pass deliberately re-derives from published numbers rather than re-profiling on private hardware. Dooly (Kim et al., 2026) is the current state of the art on configuration-agnostic, redundancy-aware profiling for inference simulation; it reports within 5% MAPE on TTFT and 8% on TPOT across two GPU platforms while reducing profiling GPU-hours by 56%. Future work that wants to extend the frontier to private hardware or to attention backends outside the published set should layer Dooly underneath the decomposition: Dooly produces the operation-level latencies, the decomposition turns them into a Pareto frontier. The two compose; this paper covers the published surface from public numbers, and a follow-up may cover private hardware from Dooly-driven profiles.


7. Implications for production engineering

7.1 Pick the architecture from the frontier, not from the headline

The five published systems in Section 5 each report a real win. None is wrong; each is bounded to a region of the frontier. A production team facing an architecture decision should locate its own (α,B,SLO)(\alpha,\,B,\,\text{SLO}) on Figure 2 before reading any paper’s headline number. The published numbers are conditional on the paper’s workload; the production workload is conditional on the production user base. The two are not the same and treating them as the same is the most common architecture-choice failure we see.

7.2 KV-transfer bandwidth is a first-class procurement variable

The break-even surface’s sensitivity to BB is the single most underappreciated finding in the public systems literature. Moving from HDR InfiniBand to NVLink 4 shifts the partition by a factor of 36 in tKVt_\text{KV} on the same prompt (900/25900 / 25), and the partition shift is large enough to flip the architecture choice on coding and summarization workloads. A disaggregated deployment behind 200 Gbps RoCE will pay τKV\tau_\text{KV} at the high end of the band and may sit outside the disaggregation-wins region of Figure 2 even on workloads where the underlying paper claimed a win. Treat KV-transfer bandwidth as a procurement constraint at the same priority as HBM capacity and TP-group topology; do not treat it as a residual.

7.3 SLO contracts belong on the rate card

A served-token bill without a (TTFT99,TPOT99)(\text{TTFT}_{99},\,\text{TPOT}_{99}) tag is the wrong unit. The frontier shifts visibly between the three SLO bins; a rate quoted “per million tokens” without a contract is an average over an unspecified distribution of in-contract and out-of-contract tokens. Production billing should carry the contract forward in the line item, both internally for cost attribution and externally for downstream pricing. The Inference Stack in 2026 argued the same point at the API layer; the same logic applies one layer down at the serving layer.

7.4 Mixed deployments live on the frontier too

Production deployments do not have to be pure colocated or pure disaggregated. Hybrid configurations (chunked-prefill within disaggregated decode pools; partial disaggregation under load; KV-cache offload to a third tier as in the multi-tier KV memory line (Ganjihal, 2026)) sit between the two regimes on Figure 2. The frontier framework gives them a place to land: a hybrid configuration is a point on the surface with a fractional 1P/D\mathbb{1}_{\text{P/D}} and a reduced τKV\tau_\text{KV} proportional to the fraction of traffic routed through the disaggregated path. The break-even surface generalizes; the partition stays partitioned.

7.5 SLO-aware autotuning closes the loop

Once the architecture is chosen, the operating point inside that architecture is a search problem: chunk size in colocated, prefill/decode pool ratio in disaggregated, batch caps in both. SLO-Guard (Lysenstøen, 2026) formalizes this as crash-aware, budget-consistent autotuning under SLO constraints; the autotuner runs over the same (α,B,SLO)(\alpha,\,B,\,\text{SLO}) inputs the frontier consumes, which means the decomposition is the right cost model for the autotuner to optimize against. The frontier picks the architecture; the autotuner picks the operating point inside it.


8. Open problems

8.1 Dynamic switching under shifting workload mix

The decomposition assumes a single α\alpha for the slice of the frontier being plotted. Production workloads shift α\alpha on diurnal cycles (chat-heavy in the evening, coding-heavy during the working day) and on bursts (model-evaluation traffic, marketing pushes). The static frontier locates the optimal architecture for the time-averaged α\alpha; the dynamic problem of switching architectures under a non-stationary α\alpha is open. Llumnix (Sun et al., 2024) addresses dynamic scheduling within a fixed architecture; the cross-architecture dynamic problem is unsolved in the published literature.

8.2 Speculative decoding and disaggregation

Speculative decoding (Leviathan, Kalman, and Matias, 2023) interacts non-trivially with both architectures. Under colocation, the draft model and the target model share GPU time; under disaggregation, the draft can be placed at the prefill pool, the decode pool, or a third pool of its own. The placement choice changes the realized cdecodec_\text{decode} and introduces a new transfer tax for the draft-target acceptance round. We held speculative decoding fixed in Section 3.4 because resolving the placement problem is its own paper.

8.3 The committed-spend frontier

The frontier we plot is on public hourly pricing. Reserved-capacity and committed-spend contracts shift the per-accelerator price by a meaningful margin observed in practice, depending on commit length and provider; we do not anchor a specific discount range here because the public pricing pages referenced (CoreWeave pricing; AWS EC2 P5) do not document a single primary band, and any specific range belongs in a separate pricing-survey paper. The reserved-capacity frontier is not just a scaled version of the on-demand frontier: the partition shifts because cprefillc_\text{prefill} and cdecodec_\text{decode} shift by different multipliers under different commit structures (commit-on-decode vs. commit-on-prefill is a real procurement option), and τKV\tau_\text{KV} does not scale with commit. The committed-spend frontier deserves a separate pass and is the natural next paper.

8.4 Hybrid model architectures

Mamba (Gu and Dao, 2023), Jamba (Lieber et al., 2024), and other state-space / attention hybrids shift α\alpha and τKV\tau_\text{KV} simultaneously: the state-space layers have constant-size state rather than growing-with-context KV cache, which shrinks KVreq\text{KV}_\text{req} on long inputs by orders of magnitude. On a pure state-space model, τKV\tau_\text{KV} collapses and disaggregation has no economic argument left; on a hybrid, the partition redraws around the hybrid’s effective KV size. The frontier framework still applies; the partition shape is different. A hybrid-architecture frontier is a separate paper.


9. Conclusion

The serving-architecture question in 2026 is not which paper to read. It is which region of the frontier the workload occupies. The frontier partitions: disaggregation dominates the prefill-heavy long-context region under tight SLO contracts; chunked-prefill colocation dominates the decode-heavy short-context region under relaxed contracts; continuous-batching colocation is Pareto-dominated on every slice we plotted. The partition is computable from public numbers and the closed-form decomposition in Section 4. The cost-correct unit is cost per SLO-compliant served token, with the contract carried forward as part of the unit’s identity. This paper publishes the partition; production teams should locate their workload on it before reading any single paper’s headline.


References

  1. Agrawal, A., Kedia, N., Panwar, A., Mohan, J., Kwatra, N., Gulavani, B. S., Tumanov, A., and Ramjee, R. Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve. OSDI ‘24.
  2. Bhardwaj, M. The Inference Stack in 2026: A Field Note on Token Economics, Runtime Systems, and Model Architecture. Field Notes, ifitsmanu.com, 2026.
  3. Crankshaw, D., Wang, X., Zhou, G., Franklin, M. J., Gonzalez, J. E., and Stoica, I. Clipper: A Low-Latency Online Prediction Serving System. NSDI ‘17.
  4. Gu, A. and Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752, 2023.
  5. Hoffmann, J., Borgeaud, S., Mensch, A., et al. Training Compute-Optimal Large Language Models. NeurIPS ‘22.
  6. Kim, J. H., Kim, G.-W., Rachakonda, A., and Kim, D. Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation. arXiv:2605.07985, 2026.
  7. Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J. E., Zhang, H., and Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP ‘23.
  8. Leviathan, Y., Kalman, M., and Matias, Y. Fast Inference from Transformers via Speculative Decoding. ICML ‘23.
  9. Li, Z., Zheng, L., Zhong, Y., Liu, V., Sheng, Y., Jin, X., Huang, Y., Chen, Z., Zhang, H., Gonzalez, J. E., and Stoica, I. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving. OSDI ‘23.
  10. Lieber, O., Lenz, B., Bata, H., et al. Jamba: A Hybrid Transformer-Mamba Language Model. arXiv:2403.19887, 2024.
  11. Lysenstøen, C. SLO-Guard: Crash-Aware, Budget-Consistent Autotuning for SLO-Constrained LLM Serving. arXiv:2604.17627, 2026.
  12. NVIDIA Corporation. NVIDIA A100 Tensor Core GPU Architecture Whitepaper. 2020.
  13. NVIDIA Corporation. NVIDIA H100 Tensor Core GPU Architecture Whitepaper. 2022.
  14. NVIDIA Corporation. NVIDIA H200 Tensor Core GPU Datasheet. 2023.
  15. Patel, P., Choukse, E., Zhang, C., Shah, A., Goiri, Í., Maleki, S., and Bianchini, R. Splitwise: Efficient Generative LLM Inference Using Phase Splitting. ISCA ‘24.
  16. Ganjihal, S. R. Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference. arXiv:2604.26968, 2026.
  17. Qin, R., Li, Z., He, W., Zhang, M., Wu, Y., Zheng, W., and Xu, X. Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving. FAST ‘25.
  18. Romero, F., Li, Q., Yadwadkar, N. J., and Kozyrakis, C. INFaaS: Automated Model-less Inference Serving. USENIX ATC ‘21.
  19. Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Fu, D. Y., Xie, Z., Chen, B., Barrett, C., Gonzalez, J. E., Liang, P., Ré, C., Stoica, I., and Zhang, C. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU. ICML ‘23.
  20. Sun, B., Huang, Z., Zhao, H., Xiao, W., Zhang, X., Li, Y., and Lin, W. Llumnix: Dynamic Scheduling for Large Language Model Serving. OSDI ‘24.
  21. Touvron, H., Martin, L., Stone, K., et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv:2307.09288, 2023.
  22. Yu, G.-I., Jeong, J. S., Kim, G.-W., Kim, S., and Chun, B.-G. Orca: A Distributed Serving System for Transformer-Based Generative Models. OSDI ‘22.
  23. Zhang, H., Tang, Y., Khandelwal, A., and Stoica, I. SHEPHERD: Serving DNNs in the Wild. NSDI ‘23.
  24. Zheng, L., Yin, L., Xie, Z., Sun, C., Huang, J., Yu, C. H., Cao, S., Kozyrakis, C., Stoica, I., Gonzalez, J. E., Barrett, C., and Sheng, Y. SGLang: Efficient Execution of Structured Language Model Programs. NeurIPS ‘24.
  25. Zhong, Y., Liu, S., Chen, J., Hu, J., Zhu, Y., Liu, X., Jin, X., and Zhang, H. DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving. OSDI ‘24.
  26. CoreWeave, Inc. CoreWeave GPU Cloud Pricing. 2026.
  27. Amazon Web Services. Amazon EC2 P5 Instance Pricing. 2026.

Cite this article

@misc{bhardwaj2026servingfrontier,
  author       = {Bhardwaj, Manu},
  title        = {Disaggregated or Colocated? The Cost-Frontier of {LLM} Serving Under {SLO} Contracts},
  year         = {2026},
  month        = {May},
  url          = {https://ifitsmanu.com/papers/serving-frontier},
  howpublished = {\url{https://ifitsmanu.com/papers/serving-frontier/paper.pdf}},
  note         = {Working paper. Version 1.0.}
}

Companion. The Inference Stack in 2026. Papers index. Home.