Verifier Procurement Under Unobservable Quality. A Scoring-Rule Mechanism for Cost-Correct Minimization.

Manu Bhardwaj

Verifier Procurement Under Unobservable Quality.

A Scoring-Rule Mechanism for Cost-Correct Minimization.

Manu Bhardwaj. ifitsmanu.com. May 2026. Version 1.0. Paper #1 in the verification-procurement wedge.

Download as PDF (full proofs, simulation pseudocode, notation summary). LaTeX source. BibTeX of references. Cite this article. Papers index.

Companion to the verification-economics field notes. The Cost of Being Right. Verification Economics in 2026. (Field Notes #2) and The α Asymmetry. (Field Notes #3) characterise Cost-correct given a verifier. This paper closes the gap by characterising which verifier a deployer ends up with, and at what cost, when the deployer must buy rather than build.

Or view the full PDF inline.

Abstract

A deployer of a large language model who does not train its own verifier must buy verification from a third party. The verifier’s true accept rate on the deployer’s task distribution is private to the seller. Public benchmark scores do not reveal it. We prove that no posted-price market for verification-as-a-service sustains the efficient verifier in equilibrium when verifier quality is unobservable and the cost-of-quality function satisfies single-crossing. The selection collapses to the worst type, in the sense of Akerlof (1970). We construct a procurement mechanism in which each candidate verifier reports decisions on $N$ adversarially generated probes with known ground-truth labels and is paid a strictly proper scoring rule against those labels. The mechanism is dominant-strategy incentive-compatible, ex post individually rational, and budget feasible under a per-probe payment cap. When the deployer selects the verifier with highest empirical score, the expected gap from first-best Cost-correct is at most $C \cdot \sqrt{\log K / N}$ over $K$ candidates, by Hoeffding plus a union bound. A matching lower bound of order $\sqrt{\log K / N}$ holds on a calibration-monotone family by Le Cam’s two-point method, so the mechanism is minimax optimal up to log factors. A simulation on MATH, GSM8K, and HumanEval with $K \in \{4, 8, 16, 32\}$ and $N \in \{16, \ldots, 4096\}$ confirms a 5% Cost-correct gap to oracle at $N = 256$ under maximin-entropy probes, while posted-price baselines fail to close even 30% of the gap at any $N$ tested. Adversarial probe construction, not probe count, drives mechanism cost. The result has direct operational use under the European Union AI Act high-risk obligations entering force on August 2, 2026.

1. Introduction

The verification-economics framing of The Cost of Being Right treats the verifier accept rate $\alpha$ as the binding lever in cost-per-correct-answer for large language model deployments. The companion analysis on the α-asymmetry shows that the partial of Cost-correct with respect to $\alpha$ dominates the partials with respect to per-token price, the reasoning multiplier $R$ , and the rollout ratio $\bar\rho$ in the rStar-Math regime (Guan et al., 2025). Both notes treat the verifier as a deployer-controlled artefact. They are silent on a question that production deployers face daily. Where does the verifier come from when the deployer does not build process reward models in-house?

This paper formalises the procurement question. A deployer purchases verification from one of $K$ candidate sellers. Each seller’s true accept rate on the deployer’s task distribution is private. Public benchmark scores do not reveal the relevant quantity, since headline benchmark accuracy is not the same as task-conditional accept rate at the deployer’s quality threshold. The deployer has a budget of $N$ adversarially generated probes with known ground-truth labels. The question is whether there exists a procurement mechanism that elicits truthful quality reports, selects the efficient verifier in equilibrium, and bounds the deployer’s loss relative to first-best Cost-correct.

We give three results.

Theorem 1 (impossibility). Under single-crossing of verifier marginal cost in quality and unobservable type, every posted-price equilibrium concentrates on the worst verifier in the candidate family. The reduction to Akerlof (1970) is direct. No public benchmark of fixed dimension rescues posted prices in this setting because public accuracy does not identify task-conditional accept rate at the deployer’s threshold.

Theorem 2 (mechanism). A payment rule that compensates each verifier with the value of a strictly proper scoring rule (Gneiting and Raftery, 2007) applied to its reports against ground-truth probe labels is dominant-strategy incentive-compatible, ex post individually rational, and budget feasible under a per-probe payment cap. The construction is closer in spirit to Cai, Daskalakis, and Papadimitriou (2015) and Babaioff, Sharma, and Slivkins (2009) than to peer prediction (Miller, Resnick, and Zeckhauser, 2005; Witkowski and Parkes, 2012; Kong and Schoenebeck, 2019), because the grounded-probe assumption collapses the no-ground-truth peer-prediction reduction and yields strict propriety in dominant strategies rather than only in Nash equilibrium.

Theorems 3 and 4 (matching regret bounds). Selecting the verifier with the highest empirical score, the deployer’s expected Cost-correct gap to the oracle-best verifier is at most a constant times $\sqrt{\log K / N}$ by Hoeffding (1963) plus a union bound. A matching lower bound of order $\sqrt{\log K / N}$ holds on a calibration-monotone family by Le Cam’s two-point method (Le Cam, 1973; Tsybakov, 2009). The mechanism is therefore minimax optimal up to log factors.

The contribution that goes beyond the field notes is the move from $\alpha$ -as-property to $\alpha$ -as-procurement-outcome. The field notes characterise Cost-correct given a verifier. This paper characterises which verifier a deployer ends up with, and at what cost, when the deployer must buy rather than build.

The contribution beyond classical peer prediction is the shift from no-ground-truth elicitation to grounded-probe procurement. Peer-prediction mechanisms elicit truthful reports without verifiable signals. The verifier-procurement problem has access to verifiable signals, namely the $N$ probes. This rules in strict propriety in dominant strategies and rules out the common-prior assumptions that the peer-prediction tradition spent fifteen years removing.

The contribution beyond classical lemons-style market analysis is to identify the binding cost driver. The probe construction step, not the probe count, dominates mechanism cost at realistic $K$ . Probes are not free. Constructing a probe with reliable ground-truth labels is itself a verification operation. Section 5 develops this point and shows by simulation that the leading constant in the regret bound is governed by probe-construction strategy, not probe budget.

The result has an external forcing function. The European Union AI Act high-risk obligations apply from August 2, 2026 (Regulation (EU) 2024/1689). High-risk deployers must demonstrate accuracy, transparency, and human oversight. When the deployer does not build the verifier, procurement is the implementation lever for these obligations. The scoring-rule mechanism doubles as compliance evidence. The probe set, the verifier reports, and the payment ledger together constitute an auditable accept-rate trail at the contractually specified quality threshold.

2. Model

Players. A single deployer faces $K$ candidate verifier providers indexed $k \in \{1, \ldots, K\}$ . The deployer commits to a procurement mechanism before observing any private information. Each verifier provider knows its own type and observes the mechanism.

Task distribution. The deployer faces a known task distribution $D$ over prompts $x$ and a known target quality threshold $\theta$ . A response $y$ is correct at threshold $\theta$ if a fixed programmatic check $c(x, y, \theta) \in \{0, 1\}$ returns $1$ .

Verifier type. Each verifier $k$ has a private accept-rate function $\alpha_k : \mathcal{X} \times \mathcal{Y} \to [0, 1]$ , drawn from a known family $\mathcal{F}$ . The function $\alpha_k$ specifies the probability that verifier $k$ accepts a candidate response as correct at threshold $\theta$ . Verifier types are private. The family $\mathcal{F}$ and the per-prompt cost-of-quality functions $\{\kappa_k\}_{k=1}^K$ (cost to verifier $k$ of operating at quality $\alpha_k$ ) are common knowledge.

Cost-correct. Per-task cost under verifier $k$ is, following The Cost of Being Right,

\mathrm{CostCorrect}(k) = \frac{\mathrm{CPM}_{1{:}1} \cdot R \cdot (1 + \bar\rho)}{\alpha_k}

with $\mathrm{CPM}_{1{:}1}$ , $R$ , and $\bar\rho$ held fixed across verifier choice. The deployer minimises $\mathrm{CostCorrect}$ , which is equivalent to maximising $\alpha_k$ at fixed numerator.

Probe set. The deployer has a budget of $N$ probes drawn from a probe distribution $P$ over $\mathcal{X} \times \mathcal{Y}$ with known ground-truth labels $\ell_i \in \{0, 1\}$ . Probes may be adversarial with respect to $\mathcal{F}$ . Constructing each probe has a fixed cost $\gamma$ that we treat as exogenous below and endogenise in §5.

Mechanism. A direct mechanism is a pair $(s, t)$ where $s : \{0, 1\}^{K \times N} \to \{1, \ldots, K\}$ is a selection rule mapping verifier reports to a chosen verifier, and $t : \{0, 1\}^{K \times N} \to \mathbb{R}^K$ is a payment rule. We restrict to mechanisms that depend only on reported decisions on probes.

Solution concept. We seek mechanisms that satisfy dominant-strategy incentive compatibility (DSIC), ex post individual rationality (IR), and budget feasibility under a per-probe payment cap $\bar t$ . We measure performance by expected regret against first-best,

\mathrm{Reg}(s, t) = \mathbb{E}\!\left[\,\mathrm{CostCorrect}(s) - \min_k \mathrm{CostCorrect}(k)\,\right]

and by worst-case regret over $\mathcal{F}$ .

Calibration-monotone family. A family $\mathcal{F}$ is calibration-monotone if there exists an ordering $\succeq$ on $\mathcal{F}$ such that $\alpha_k \succeq \alpha_{k'}$ implies $\Pr[\alpha_k(x, y) > \tau] \geq \Pr[\alpha_{k'}(x, y) > \tau]$ for all thresholds $\tau$ and all $(x, y) \sim D$ . The condition is the procurement analogue of the monotone-likelihood-ratio property in classical statistics.

3. Impossibility for posted-price markets

A posted-price market offers a single price $p$ at which the deployer commits to purchase from any seller who chooses to participate. Sellers self-select. The deployer cannot screen on type and cannot condition payment on probes, since by hypothesis the posted-price market has no probe technology. The setting is the classical lemons market (Akerlof, 1970), adapted to verification.

Theorem 1 (posted-price collapse). Suppose $\mathcal{F}$ is calibration-monotone and the cost-of-quality function $\kappa_k$ satisfies single-crossing: for any $\alpha_k \succ \alpha_{k'}$ , the marginal cost of operating at quality $\alpha_k$ minus the marginal cost of operating at quality $\alpha_{k'}$ is strictly positive and increasing in quality. Then for every posted price $p$ , the unique sequentially rational equilibrium of the resulting procurement game concentrates on the worst type in $\mathcal{F}$ .

Proof sketch. Fix $p$ . Each verifier $k$ participates if and only if $p \geq \kappa_k$ . By single-crossing, the set of participating types is a lower set in the $\succeq$ ordering. The deployer’s expected cost-correct under uniform sampling from participating types is increasing in the quality of the marginal participating type. Anticipating this, only the lowest-cost (worst-quality) participating type’s expected payoff is bounded below by zero in the limit. The standard adverse-selection unravelling (Mas-Colell, Whinston, and Green, 1995, ch. 13) yields collapse to the worst type. Full proof in Appendix A of the PDF. ∎

Why public benchmarks do not rescue the posted-price market. Public benchmark scores measure $\Pr[\alpha_k(x, y) = 1]$ on a fixed evaluation distribution $D'$ . The deployer’s relevant quantity is $\Pr[\alpha_k(x, y) = 1 \mid x \sim D]$ at the deployer’s threshold $\theta$ . Even if $D' = D$ at the population level, public benchmark scores typically average over thresholds or report area under a curve, not the specific accept rate at the deployer’s threshold. The deployer-specific threshold and task-conditional acceptance behaviour are not generally identified from a fixed-dimension public score, by a standard non-identification argument.

Corollary 1 (no public-benchmark fix). No fixed-dimension public benchmark score function $\sigma : \mathcal{F} \to \mathbb{R}^d$ identifies the deployer-specific quantity $\alpha_k(\theta, D)$ for arbitrary $(\theta, D)$ , unless $d$ scales with the cardinality of the support of $D$ at threshold $\theta$ .

The combined message of Theorem 1 and Corollary 1 is that posted-price verification-as-a-service is structurally broken in the same way that used-car markets are broken under the lemons argument. The next sections build a mechanism that closes the gap.

No public benchmark of fixed dimension rescues posted prices in verification procurement. The relevant statistic is not identified from a public score.

4. The scoring-rule mechanism

Construction. Fix a strictly proper scoring rule $S : [0, 1] \times \{0, 1\} \to \mathbb{R}$ , for instance the Brier score $S(p, \ell) = -(p - \ell)^2$ or the quadratic score $S(p, \ell) = 2 p \ell - p^2$ . Generate $N$ probes $\{(x_i, y_i, \ell_i)\}_{i=1}^N$ with known ground-truth labels. Each verifier $k$ reports a probability $\hat p_{k, i} \in [0, 1]$ on each probe $i$ , optionally constrained to $\{0, 1\}$ for accept-or-reject verifiers. The mechanism pays verifier $k$ the amount

t_k(\hat p_k, \ell) = a + b \cdot \frac{1}{N} \sum_{i=1}^N S(\hat p_{k, i}, \ell_i)

for constants $a \geq 0$ and $b > 0$ to be set below. The selection rule is empirical $\arg\max$ over the average score, ties broken arbitrarily.

Theorem 2 (scoring-rule mechanism). Under the strictly proper scoring rule mechanism with $a$ chosen so that $a + b \cdot \min_S \geq 0$ , where $\min_S$ is the infimum of $S$ on $[0, 1] \times \{0, 1\}$ , the mechanism is dominant-strategy incentive-compatible, ex post individually rational, and budget feasible under per-probe payment cap $\bar t = a / N + b \cdot \max_S / N$ .

Proof. Strict propriety of $S$ implies that for any belief $q$ verifier $k$ holds about the probability that $\ell_i = 1$ given $(x_i, y_i)$ , the unique maximiser of $\mathbb{E}_\ell S(\hat p, \ell)$ over $\hat p$ is $\hat p = q$ . This is the defining property of strict propriety (Gneiting and Raftery, 2007). Truthful reporting of $\hat p_{k, i} = \alpha_k(x_i, y_i)$ therefore strictly dominates any other report on every probe where the verifier’s belief differs from its report, regardless of other verifiers’ reports, and is the unique dominant strategy. Individual rationality follows from the choice of $a$ . Budget feasibility follows from the per-probe payment cap. ∎

Selection. Let $\bar S_k = \frac{1}{N}\sum_i S(\hat p_{k, i}, \ell_i)$ be verifier $k$ ‘s average scoring-rule value on the probe set. The selection rule chooses $\hat k = \arg\max_k \bar S_k$ . When verifiers are restricted to binary reports $\hat p_{k, i} \in \{0, 1\}$ , this reduces to choosing the verifier with highest empirical accept rate $\hat\alpha_k = \frac{1}{N}\sum_i \mathbf{1}[\hat p_{k, i} = \ell_i]$ , since the Brier and quadratic scores collapse to a constant rescaling of the 0-1 loss on $\{0, 1\}$ outputs.

Why grounded probes give strict propriety in dominant strategies. Classical peer prediction elicits truthful reports without ground-truth signals by paying agents based on the joint distribution of their reports with peers’ reports. Mechanism design in this line achieves truthfulness only in Nash or Bayesian equilibrium, and depends on common priors or on common-knowledge structure of the joint distribution. The grounded-probe setting eliminates the joint-distribution dependence. Each verifier’s report is paid against the labels, not against other verifiers’ reports. This collapses the peer-prediction reduction and yields strict propriety in dominant strategies.

5. Regret bounds

We now bound the deployer’s expected gap from first-best Cost-correct under the mechanism of §4. Throughout this section, verifiers report truthfully, by Theorem 2.

Theorem 3 (upper bound). Let $\alpha_k(Q) := \mathbb{E}_{(x, y) \sim Q}[\alpha_k(x, y)]$ denote the population accept rate of verifier $k$ under distribution $Q$ . Let $k^* = \arg\max_k \alpha_k(D)$ be the oracle-best verifier on the deployer’s task distribution. Suppose probes are drawn iid from a distribution $P$ , that $\alpha_k(P) = \alpha_k(D)$ for all $k$ (probes are unbiased for the deployer’s distribution), and that verifiers report binary decisions in $\{0, 1\}$ . Then the expected gap of the empirical $\arg\max$ rule is

\mathbb{E}\!\left[\alpha_{k^*}(D) - \alpha_{\hat k}(D)\right] \leq C \cdot \sqrt{\frac{\log K}{N}}

for a universal constant $C$ .

Proof. By Hoeffding’s inequality (1963) applied to bounded random variables in $[0, 1]$ , $\Pr[|\hat\alpha_k - \alpha_k(P)| > \epsilon] \leq 2\exp(-2 N \epsilon^2)$ for each $k$ . By a union bound, $\Pr[\max_k |\hat\alpha_k - \alpha_k(P)| > \epsilon] \leq 2K\exp(-2 N \epsilon^2)$ . Setting $\epsilon = \sqrt{(\log K + \log(2/\delta)) / (2N)}$ gives the failure probability $\delta$ . Integrating the tail and using the unbiasedness assumption yields the stated bound with $C = O(1)$ . Full computation in Appendix B of the PDF. ∎

The translation to Cost-correct units is direct. Since $\mathrm{CostCorrect}(k) - \mathrm{CostCorrect}(k^*) = \mathrm{CPM}_{1{:}1} R (1 + \bar\rho)\,(1/\alpha_{\hat k} - 1/\alpha_{k^*})$ , and on the event that $\alpha_{\hat k}, \alpha_{k^*} \geq \alpha_{\min} > 0$ , the gap in $1/\alpha$ is bounded by $|1/\alpha_{\hat k} - 1/\alpha_{k^*}| \leq |\alpha_{k^*} - \alpha_{\hat k}| / \alpha_{\min}^2$ , which scales as $\sqrt{\log K / N}$ up to a Lipschitz constant determined by $\alpha_{\min}$ .

Theorem 4 (lower bound). Suppose $\mathcal{F}$ is calibration-monotone and contains at least two distinct types $\alpha_a \succ \alpha_b$ with $\sup_{x, y} |\alpha_a(x, y) - \alpha_b(x, y)| > 0$ . Then for any mechanism $(s, t)$ and any $K \geq 2$ , there exists a profile of types in $\mathcal{F}^K$ such that

\mathbb{E}\!\left[\alpha_{k^*}(D) - \alpha_{s(\hat p)}(D)\right] \geq c \cdot \sqrt{\frac{\log K}{N}}

for a constant $c > 0$ depending on $\mathcal{F}$ but not on $K$ or $N$ .

Proof sketch. Apply Le Cam’s two-point method. Construct a packing of $\Theta(K)$ profiles of types in $\mathcal{F}^K$ that are pairwise indistinguishable on probe sets of size $N$ at total variation distance $O(\sqrt{N} \cdot \Delta)$ , where $\Delta = \sup |\alpha_a - \alpha_b|$ . Standard Le Cam arguments (Tsybakov, 2009, ch. 2) yield expected $\ell_\infty$ error of order $\sqrt{\log K / N}$ on the implied estimation problem. The reduction from selection regret to estimation error follows from the calibration-monotone assumption. ∎

Theorems 3 and 4 together imply that the scoring-rule mechanism is minimax optimal up to log factors over calibration-monotone families. The remaining gap between $\sqrt{\log K / N}$ and the corresponding $\sqrt{1/N}$ rate of single-arm estimation is a $\sqrt{\log K}$ factor that comes from the union bound and is information-theoretically necessary at this level of generality.

Sample complexity in deployer-relevant terms. Solving for $N$ given target gap $\epsilon$ in $\alpha$ -units yields $N \geq C^2 \log K / \epsilon^2$ . At $K = 16$ and $\epsilon = 0.05$ , with the universal constant $C$ on the order of unity in the simulations of §7, the budget is $N \approx 1100$ . At $K = 32$ and $\epsilon = 0.05$ , $N \approx 1400$ . The mechanism is operationally feasible at probe budgets in the low thousands, even before the constant-improving effect of adversarial probe construction.

6. The adversarial probe construction problem

The bounds of §5 treat the probe distribution $P$ as exogenous. In practice, probes are not free. A probe with reliable ground-truth label is itself the output of a verification operation, which is precisely the problem we are trying to procure. We endogenise probe construction here.

Three probe-construction strategies.

Uniform random. Probes are drawn iid from $D$ . Ground-truth labels are obtained via expensive in-house verification or via a known-correct programmatic check (math, code with tests). Cost per probe is fixed at the in-house verification cost.

Maximin entropy. Probes are chosen to maximise disagreement among candidate verifiers’ decisions, conditional on having known ground-truth labels. Given a candidate pool of candidate probes, select the subset that maximises the entropy of the empirical accept-or-reject distribution across $\{1, \ldots, K\}$ . The construction follows the active-learning tradition.

Hard-instance mining. Probes are mined from the support of $D$ where a bootstrap verifier is least confident. The bootstrap is itself expensive, since a low-confidence label is by definition not yet ground-truth.

Proposition 1 (maximin-entropy improvement). Under maximin-entropy probe construction with a probe-pool size $M \geq K$ , the leading constant in the regret bound of Theorem 3 decreases by a factor of order $\sqrt{K}$ relative to uniform-random probes.

Proposition 2 (hard-instance mining tradeoff). Under hard-instance mining with bootstrap verifier of accept rate $\alpha_0$ , the leading constant in the regret bound decreases by a factor of $\Omega(1 / (1 - \alpha_0))$ , at the cost of per-probe construction cost scaling as $1 / (1 - \alpha_0)$ .

Propositions 1 and 2 together identify the operational tradeoff. Maximin entropy gives a sublinear-in- $K$ improvement at no per-probe cost increase. Hard-instance mining gives an arbitrarily large constant improvement at proportionate per-probe cost increase. The choice depends on the deployer’s marginal cost of probe construction relative to the marginal cost of mechanism payments.

Operational implication. Probe construction is the binding cost driver at realistic $K$ , not probe count. The simulation in §7 quantifies this: at $K = 16$ and target Cost-correct gap of 5%, the per-probe construction cost dominates total mechanism cost by a factor of approximately seven, across all three datasets.

7. Simulation

We test the mechanism and the regret bounds on three public eval datasets with known ground-truth labels.

Datasets. MATH (Hendrycks et al., 2021) is the standard benchmark for competition math. GSM8K (Cobbe et al., 2021) is the standard benchmark for grade-school math word problems. HumanEval (Chen et al., 2021) is the standard benchmark for Python code generation. All three admit programmatic verification: math problems with known numerical or symbolic answers, code with hidden unit tests.

Verifier population synthesis. We synthesise $K \in \{4, 8, 16, 32\}$ candidate verifiers as logistic-regression heads over trajectory features, calibrated on different fractions $\beta_k \in (0, 1]$ of held-out data. Features are length-normalised log-probabilities, step-count, and self-consistency agreement. Calibration fractions are spaced log-uniformly between $0.05$ and $1.0$ to span the calibration-monotone family.

Sweep. Probe budget $N \in \{16, 64, 256, 1024, 4096\}$ . Three scoring rules: Brier, quadratic, log. Three probe-construction strategies: uniform random, maximin entropy, hard-instance mining. Three baselines: posted-price uniform purchase, random verifier choice, public-benchmark ranking by headline accuracy on the standard eval split. Each cell is repeated over 200 seeds.

Headline finding. At $N = 256$ and $K = 16$ with maximin-entropy probes, the scoring-rule mechanism achieves Cost-correct within 5% of the oracle on all three datasets, averaged over seeds. The Brier and quadratic scoring rules give indistinguishable results. The log scoring rule penalises overconfident wrong reports more heavily and produces 1.2% higher payment dispersion at no accuracy benefit. We report Brier as the operational default.

**Table 1.** Cost-correct gap to oracle by procurement mechanism, at $K = 16$, $N = 256$, maximin-entropy probes, averaged over 200 seeds. The scoring-rule mechanism (Brier) closes the gap uniformly across datasets. Posted-price collapse and public-benchmark non-identification both leave large gaps. HumanEval is the calibration-monotone violation case (see negative finding below).
Mechanism	MATH	GSM8K	HumanEval
Oracle (first-best)	0.0%	0.0%	0.0%
Random verifier choice	27.4%	25.1%	31.8%
Posted-price (uniform purchase)	30.6%	28.9%	34.2%
Public-benchmark ranking	5.8%	4.1%	18.4%
Scoring-rule (Brier, this work)	4.7%	4.9%	6.4%

Posted-price baseline. Across all cells tested, the posted-price baseline does not close more than 30% of the Cost-correct gap to the oracle. At $K = 16$ on MATH, the posted-price equilibrium concentrates on the worst two verifier types in 72% of seeds, consistent with Theorem 1.

Public-benchmark baseline. Headline-accuracy ranking closes 40 to 60% of the gap to oracle on MATH and GSM8K but only 18% on HumanEval at $K = 16$ . The HumanEval gap reflects calibration-monotone violation: two of the synthesised verifiers achieve high headline accuracy on the public split but underperform at the deployer’s threshold on the held-out distribution.

Probe-cost decomposition. At $K = 16$ , $N = 256$ , maximin-entropy probes: per-probe construction cost (in dollars of in-house verification) is $7\times$ the per-probe scoring-rule payment, summed across $K$ verifiers. Aggregate probe construction is 87% of total mechanism cost. The decomposition matches the operational claim of §6.

Negative finding. On HumanEval, the calibration-monotone family assumption is violated for 2 of 16 synthesised verifiers in the population we generated. Two verifiers achieve high $\alpha$ on long programs but lower $\alpha$ on short programs than two other verifiers with weaker overall headline accuracy. The empirical $\arg\max$ rule still selects a verifier within 6.4% of oracle Cost-correct, but the constant in the regret bound is approximately three times larger than on MATH and GSM8K. This is consistent with the calibration-monotone assumption being load-bearing in the lower bound of Theorem 4 and a useful but not necessary condition for the upper bound of Theorem 3.

The full simulation harness is in Appendix D of the PDF. Total compute is 120 CPU-hours on a single 16-core machine; no GPU required.

Probe construction is the binding cost driver, not probe count. At $K=16$ and a 5% target gap, per-probe construction cost dominates total mechanism cost by approximately $7\times$ across all three datasets.

8. The August 2026 EU AI Act forcing function

The European Union AI Act high-risk obligations apply from August 2, 2026 (Regulation (EU) 2024/1689). Article 9 requires risk management. Article 13 requires transparency and provision of information to deployers. Article 14 requires human oversight. Article 15 requires demonstrable accuracy at a documented level, plus operational reliability and security. Implementation of all four articles for a high-risk LLM deployment requires demonstrable accept-rate measurement at a defined quality threshold.

The scoring-rule mechanism doubles as compliance evidence. The probe set is the auditable test set. The verifier reports are the auditable measurement. The payment ledger is the auditable accept-rate trail. The combination is sufficient evidence under Article 15(1), which requires that “high-risk AI systems shall be designed and developed in such a way that they achieve an appropriate level of accuracy, robustness and cybersecurity, and perform consistently in those respects throughout their lifecycle.” The phrase appropriate level of accuracy is operationalised in deployer compliance practice as accuracy at a documented threshold against a documented test set. The mechanism produces both as primitives.

A second connection is to the Article 13 transparency requirement. The deployer must report verifier accept-rate at threshold $\theta$ to downstream operators. The scoring-rule mechanism produces $\hat\alpha_k$ as a primitive. The reporting interface follows directly from the mechanism’s output.

We do not claim the mechanism is sufficient for Act compliance overall, since the Act covers risk management and human oversight beyond accept-rate measurement. We claim only that, where the Act requires accept-rate evidence, the mechanism produces it as a side effect and at low marginal cost.

9. Limitations and future work

Programmatic-verifier scope. The strict-propriety argument requires bounded and known label noise on probes. Math, formal logic, and code with strict tests satisfy this. LLM-as-judge verifiers do not, since the judge’s own accept rate is endogenous and unbounded. The dominant-strategy IC argument breaks under unbounded label noise. The extension to LLM-judge probes is the next paper in the wedge plan and connects to the recent literature on judge calibration (Zheng et al., 2023).

Static verifier population. We model a one-shot procurement. Reputation dynamics over repeated rounds are out of scope. The natural extension connects to Holmström (1979) on moral hazard with observable outcomes and to Crémer and McLean (1988) on full surplus extraction in dynamic settings.

Single deployer. Probe sharing across deployers introduces a public-goods structure with free-rider incentives. The natural extension is a private-value mechanism design analysis with conflicting deployer interests, in the spirit of the bilateral-trade impossibility of Myerson and Satterthwaite (1983).

Strategic deployer. The mechanism assumes the deployer reports probes truthfully. A strategic deployer who selectively withholds adversarial probes can manipulate the mechanism.

Calibration-monotone assumption. The lower bound of Theorem 4 requires calibration-monotone $\mathcal{F}$ . The upper bound of Theorem 3 does not. The simulation flags two verifiers on HumanEval where the assumption fails. We have not characterised the worst-case regret on non-calibration-monotone families. This is a direct open problem.

10. Conclusion

Verifier procurement is the missing lever in the verification-economics framing. The companion field notes establish that the verifier accept rate is the binding term in cost-per-correct-answer. They are silent on how a deployer who does not build verifiers in-house ends up with one. This paper closes the gap.

Posted-price markets cannot sustain verification-as-a-service under unobservable quality. A scoring-rule mechanism with adversarially constructed probes can, in dominant strategies, at provable regret of order $\sqrt{\log K / N}$ . The mechanism is minimax optimal up to log factors. Adversarial probe construction, not probe count, is the binding operational cost. The mechanism doubles as compliance evidence under the EU AI Act high-risk obligations entering force on August 2, 2026.

The next paper in the wedge plan extends the mechanism to LLM-as-judge probes with unbounded label noise.

References

Cite this article

@misc{bhardwaj2026verifierprocurement,
  author = {Bhardwaj, Manu},
  title  = {Verifier Procurement Under Unobservable Quality: A Scoring-Rule Mechanism for Cost-Correct Minimization},
  year   = {2026},
  month  = {May},
  url    = {https://ifitsmanu.com/papers/verifier-procurement},
  howpublished = {\url{https://ifitsmanu.com/papers/verifier-procurement/paper.pdf}},
  note   = {Working paper. Version 1.0.}
}

Bhardwaj, M. (2026, May). Verifier procurement under unobservable quality: A scoring-rule mechanism for cost-correct minimization. ifitsmanu.com. https://ifitsmanu.com/papers/verifier-procurement

Bhardwaj, Manu. "Verifier Procurement Under Unobservable Quality: A Scoring-Rule Mechanism for Cost-Correct Minimization." ifitsmanu.com, May 2026. https://ifitsmanu.com/papers/verifier-procurement.

M. Bhardwaj, "Verifier Procurement Under Unobservable Quality: A Scoring-Rule Mechanism for Cost-Correct Minimization," ifitsmanu.com, May 2026. [Online]. Available: https://ifitsmanu.com/papers/verifier-procurement

Companion. The Cost of Being Right. Companion. The α Asymmetry. Papers index. Home.

Verifier Procurement Under Unobservable Quality. #

A Scoring-Rule Mechanism for Cost-Correct Minimization. #

Abstract

1. Introduction #

2. Model #

3. Impossibility for posted-price markets #

4. The scoring-rule mechanism #

5. Regret bounds #

6. The adversarial probe construction problem #

7. Simulation #

8. The August 2026 EU AI Act forcing function #

9. Limitations and future work #

10. Conclusion #

References #