RLVR verifier failure

RLVR verifier failure is the gap between reward measured by a verifiable training signal and true task success after transfer to held-out verifiers, executable tools, and user outcomes.

Definition

In reinforcement learning with verifiable rewards, a model is optimized against a verifier that marks outputs as accepted or rejected. The verifier may be a unit test, symbolic checker, math answer, reward model, evaluator model, tool result, or benchmark harness. Failure appears when the model learns behaviors that improve acceptance under that verifier but do not improve the underlying objective.

Why this matters

RLVR is useful because it can scale supervision. It is risky because the model can learn the verifier’s surface instead of the task. For tool agents, the verifier is often an environment: files, tests, logs, API responses, hidden state, or a judge model. This makes reward hacking a systems problem, not only an alignment problem.

Production signal

Track acceptance on the training verifier, acceptance on held-out verifiers, human-validated success, and exploit incidence. If training-verifier acceptance rises while held-out true success is flat or falling, the system is paying an exploit tax.

References

Pan, A. et al. LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking. arXiv, 2026. link
Lambert, N. et al. Tulu 3: Pushing Frontiers in Open Language Model Post-Training. arXiv, 2024. link
DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv, 2025. link

Glossary. Research index. Home.

RLVR verifier failure #

Definition #

Why this matters #

Production signal #

Related #

References #

RLVR verifier failure

Definition

Why this matters

Production signal

Related

References