Manu Bhardwaj · Papers

The Exploit Tax. Why Verifier-Guided Reasoning Needs a Transfer Audit.

A research note on RLVR verifier failure, tool-agent reward hacking, and cost per true success

Manu Bhardwaj. ifitsmanu.com. 28 May 2026. Last updated 28 May 2026. Version 0.1. Working paper.

Cite this article. Research index. Verification economics. RLVR verifier failure.

Working paper. This is the web-first research hub for the exploit-tax thesis. It is not an arXiv submission yet. The purpose of this version is to define the metric layer, connect the live literature, and create a stable canonical URL for the verifier-transfer line of work.

Abstract

Verifier-guided reasoning improves measured performance when the verifier captures the target task. It fails economically when optimization transfers to the verifier surface rather than the objective. This paper defines the exploit tax: the cost paid when local verifier acceptance increases faster than true transferred success. The tax is visible in two now-converging regimes. In RLVR, a policy can learn the reward channel supplied by a verifiable training signal. In tool-agent systems, an agent can manipulate executable state, tests, logs, or judge models to receive credit without accomplishing the external task. Existing reports usually publish local acceptance, benchmark pass rate, or judge score. Those numbers are not enough. We separate training-verifier acceptance, held-out verifier acceptance, tool-grounded success, and human-validated success. We propose a verifier transfer audit and three operational metrics: verifier transfer coefficient, accepted exploit incidence, and cost per true success. The result is a practical bridge between verification economics and agent reliability.

1. The problem

The verification-economics frame starts from a simple observation. A reasoning model is economically useful only when it returns correct, usable outputs at acceptable cost. Cost per token is not the right unit. Cost per accepted answer is closer, but still incomplete when the acceptor can be gamed. The binding unit for deployed systems is cost per true success.

Verifier-guided methods are powerful because they move selection pressure from human labels to scalable checks. A math answer can be checked. A unit test can be run. A code patch can be evaluated. A judge model can compare candidates. A tool agent can inspect external state. Each verifier turns a fuzzy objective into an executable signal.

The failure mode is equally structural. A verifier is a proxy. Once optimization pressure flows through it, the model can learn the proxy. In the benign case, learning the proxy transfers to learning the task. In the bad case, the policy discovers a shortcut: satisfy the verifier, not the user objective.

That shortcut has a cost. The system pays for more rollouts, more tool calls, more judge passes, and more apparent success, while the denominator that matters, externally validated success, fails to rise. This spread is the exploit tax.

2. Two regimes are converging

RLVR and tool-agent evaluation look different, but they expose the same measurement problem.

In RLVR, the model receives reward from a verifiable signal. The signal may be exact enough to drive learning at scale, but it is still an object in the world with boundaries. The model can improve on the verifier’s distribution, exploit formatting assumptions, specialize to answer forms, or overfit the reward channel. Recent work on LLMs gaming verifiers makes this risk explicit: local reward can rise through behaviors that do not transfer cleanly.

In tool-agent systems, the verifier is an environment. The agent can act on files, browsers, APIs, tests, databases, shells, or logs. The environment is supposed to make success observable. But observability creates a new attack surface. An agent can change the test rather than the code, create a fake artifact rather than the requested one, exploit parser assumptions, or optimize its transcript for a judge model. The reward-hacking benchmark line makes this failure measurable.

The bridge is this: both regimes optimize against a verifier that is narrower than the true objective. The only way to know whether improvement transferred is to audit transfer.

3. Definitions

Let A_train be the acceptance rate under the verifier used for training, selection, or primary reporting.

Let A_holdout be the acceptance rate under a held-out verifier that was not used to train, select, or tune the policy.

Let S_tool be the success rate against external tool-grounded state: tests that were not mutable by the agent, independent API state, sandbox traces, durable files, or other stateful evidence.

Let S_true be human-validated or domain-grounded true success.

The verifier transfer coefficient is:

T_V = S_true / A_train

When T_V = 1, local acceptance transfers cleanly. When T_V < 1, some accepted outputs fail under external audit. When T_V falls after optimization, the system is learning the verifier faster than it is learning the task.

The accepted exploit incidence is:

E_acc = accepted exploits / locally accepted outputs

The denominator matters. It asks how much of the verifier-approved set is actually contaminated.

The cost per true success is:

C_true = (C_inference + C_tools + C_retry + C_verification + C_escalation) / N_true_success

The exploit tax is the spread between cost per locally accepted answer and cost per true success:

Exploit tax = C_true - C_accepted

Equivalently, if C_accepted is reported while C_true is hidden, the missing term is the price of verifier non-transfer.

Verifier transfer audit map

4. The verifier transfer audit

A verifier transfer audit is a small protocol, not a new benchmark religion.

Run the same task distribution through the system before and after verifier-guided optimization. For each output or trajectory, record four outcomes:

  1. Accepted by the training verifier.
  2. Accepted by a held-out verifier.
  3. Successful against external tool-grounded state.
  4. Successful under human or domain-grounded audit.

Report the transition table, not only the headline score. The interesting cells are the locally accepted failures:

accepted by training verifier
rejected by held-out verifier
failed tool-grounded audit
failed human-grounded audit

Those cells are the operational shape of reward hacking. They reveal whether the verifier is doing useful work or merely moving errors into a harder-to-see bucket.

For tool agents, the audit needs immutable traces. The evaluator should not rely only on the transcript. It should inspect file diffs, tool calls, browser actions, database writes, API responses, and sandbox state. A final answer that says “done” is not evidence. A durable artifact with independent state is evidence.

5. Why ordinary benchmark reporting misses the tax

Benchmark pass rate collapses all accepted completions into one bucket. Cost-per-token collapses all spending into one unit. Judge-model win rate collapses preference into one score. These are useful local signals, but they do not separate true transfer from proxy exploitation.

The tax often hides behind improvement. A system can show higher local pass rate and worse transfer at the same time. This happens when the verifier is easy to satisfy in ways that are not robust. More search can amplify the failure because the model gets more chances to find the loophole. More rollouts raise local acceptance, but if the verifier has a blind spot, they can also raise accepted exploit incidence.

This is the same reason verification economics puts the verifier in the denominator. The verifier does not merely observe quality. It shapes the economics of the whole stack. If the verifier accepts bad work, every upstream optimization becomes suspect.

Cost per true success decomposition

6. What to report

For RLVR papers and tool-agent benchmarks, the minimum reporting table should include:

MetricQuestion it answers
Training-verifier acceptanceDid the system improve on the optimized signal?
Held-out verifier acceptanceDid the improvement transfer to a verifier it did not train against?
Tool-grounded successDid the external state actually change correctly?
Human-validated successDid a domain-grounded reviewer accept the result?
Verifier transfer coefficientHow much of local acceptance became true success?
Accepted exploit incidenceHow contaminated is the accepted set?
Cost per true successWhat did the successful external result actually cost?

This table is not expensive relative to the cost of training or deploying agentic systems. The expensive part is discovering after deployment that local acceptance was not real.

7. Where this fits in the archive

The Cost of Being Right argues that cost-per-correct-answer is the right economic unit for reasoning systems. The Alpha Asymmetry argues that verifier engineering can move cost more than generator engineering. The Verifier as Curriculum shows the verifier becoming a data-construction artifact as well as a reward and inference-time gate.

This paper adds the missing negative side. The verifier is the highest-leverage object only when it transfers. If it does not transfer, it becomes the highest-leverage failure surface.

The research program is therefore not “use more verifiers.” It is:

  1. Build verifiers that improve local acceptance.
  2. Audit whether that improvement transfers.
  3. Price the gap as an exploit tax.
  4. Optimize cost per true success, not cost per accepted output.

8. Practical checklist

Before shipping verifier-guided reasoning or tool-agent autonomy:

  • Freeze the training verifier before the transfer audit.
  • Hold out at least one independent verifier.
  • Prevent the agent from mutating the audit harness.
  • Preserve immutable tool traces.
  • Sample accepted trajectories, not only failures.
  • Report accepted exploit incidence.
  • Report cost per true success next to local pass rate.
  • Treat a falling transfer coefficient as a release blocker.

9. Conclusion

Verifier-guided reasoning is one of the strongest levers in AI systems engineering. It is also one of the easiest levers to mismeasure. The moment optimization pressure flows through a verifier, the verifier becomes both an asset and an attack surface.

The exploit tax gives the failure a unit. A verifier transfer audit gives it a test. Cost per true success gives it a denominator that production teams can use.


Sources

  1. Pan, A. et al. LLMs Gaming Verifiers: RLVR can Lead to Reward Hacking. arXiv, 2026.
  2. Li, J. et al. Reward Hacking Benchmark: Measuring Exploits in LLM Reasoning. arXiv, 2026.
  3. Lambert, N. et al. Tulu 3: Pushing Frontiers in Open Language Model Post-Training. arXiv, 2024.
  4. DeepSeek-AI. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. arXiv, 2025.
  5. Deng, X. et al. SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks? arXiv, 2025.

Cite this article

@misc{bhardwaj2026exploittax,
  author = {Bhardwaj, Manu},
  title  = {The Exploit Tax: Why Verifier-Guided Reasoning Needs a Transfer Audit},
  year   = {2026},
  month  = {May},
  url    = {https://ifitsmanu.com/papers/the-exploit-tax},
  note   = {Working paper. Version 0.1.}
}

Research index. Home.