Mamba and state-space models

Mamba is a selective state-space model (SSM) architecture for sequence modeling. It achieves transformer-class quality on language tasks while running in linear time and constant memory with respect to sequence length, in contrast to the quadratic-time, linear-memory attention layer.

Definition

State-space models are a family of sequence models drawn from classical control theory. They maintain a continuous-time hidden state that evolves through linear differential equations, summarizing the entire sequence history in a fixed-size representation. Mamba (Gu and Dao, 2023) made SSMs competitive with transformers by introducing selectivity: the parameters of the state-space dynamics depend on the input, allowing the model to selectively retain or forget information per token.

Why it matters

Pure-transformer attention is O(n²) in sequence length and requires a key-value (KV) cache that grows linearly with context. At long contexts (128K, 256K, 1M tokens), this becomes economically punishing both in memory and in compute.

Mamba and related selective SSMs are O(n) in sequence length with constant per-step memory. For long-context workloads, the throughput advantage is substantial.

Hybrid is the production frontier

Pure Mamba models underperform pure transformers on some short-context retrieval-style tasks, where attention’s ability to directly query any prior token is the right primitive. The 2025 to 2026 production frontier is hybrid: transformer attention layers interleaved with Mamba layers, often with Mixture-of-Experts on top.

The flagship example is Jamba 1.5 (AI21): 398B total parameters, 94B active, 256K-token context, with Mamba and attention layers at a 1:7 ratio and MoE every two blocks. Mamba-3 was published in 2026.

Tradeoffs

References


Glossary. Research index. Home.