Failure modes in reinforcement learning with verifiable rewards where a model improves measured reward while degrading transfer, robustness, or true task success.
RLVR can train models to satisfy the verifier rather than the user objective. This topic tracks the gap between benchmark acceptance and deployed truth.
A research paper on verifier failure under RLVR and tool agents. It defines the exploit tax: the cost paid when verifier-guided reasoning increases local acceptance faster than true transferred success.
Archive command
Search the public field-note archive.
Results are local, static, and index the same public surfaces exposed to crawlers.