Accepted exploit incidence

Accepted exploit incidence is the share of verifier-accepted outputs or trajectories that pass the local verifier while failing a held-out or human-grounded audit.

Definition

An exploit is accepted when the local verifier scores it as valid even though an external audit rejects it. Accepted exploit incidence is computed over accepted outputs, not all attempts. That makes it a direct measure of how much trust the verifier is leaking into false success.

Why this matters

Verifier-guided systems often improve by generating more candidates and selecting the one that passes. If the verifier has a loophole, more search can increase accepted exploits. The incidence metric detects that failure mode.

Production signal

Sample accepted trajectories for held-out audit. Break down incidence by task class, tool type, verifier version, and failure mechanism. Treat rising incidence as a release blocker for autonomous tool agents.


Glossary. Research index. Home.