Accepted exploit incidence #
Accepted exploit incidence is the share of verifier-accepted outputs or trajectories that pass the local verifier while failing a held-out or human-grounded audit.
Definition #
An exploit is accepted when the local verifier scores it as valid even though an external audit rejects it. Accepted exploit incidence is computed over accepted outputs, not all attempts. That makes it a direct measure of how much trust the verifier is leaking into false success.
Why this matters #
Verifier-guided systems often improve by generating more candidates and selecting the one that passes. If the verifier has a loophole, more search can increase accepted exploits. The incidence metric detects that failure mode.
Production signal #
Sample accepted trajectories for held-out audit. Break down incidence by task class, tool type, verifier version, and failure mechanism. Treat rising incidence as a release blocker for autonomous tool agents.
Related #
- RLVR verifier failure
- Tool-agent reward hacking
- Verifier transfer audit
- Cost per true success
- The Exploit Tax