The Inference Stack in 2026.
A field note on token economics, runtime systems, model architecture, and the stack changes behind public LLM API price compression.
Research Program / emerging
Runtime systems, state, scheduling, observability, and reliability for AI workloads.
A field note on token economics, runtime systems, model architecture, and the stack changes behind public LLM API price compression.
A decoding-time acceleration technique that drafts tokens with a smaller model and verifies them with a larger model.
Results are local, static, and index the same public surfaces exposed to crawlers.