Manu Bhardwaj · Papers

Calibration Drift Under Verifier Composition.

A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization.

Manu Bhardwaj. ifitsmanu.com. May 2026. Version 1.0. Research Paper #2 in the verification-economics wedge.

Download as PDF (full proofs, figures, simulation pseudocode, appendices A through H). LaTeX source. BibTeX of references. Cite this article. Papers index.

Companion to Verifier Procurement. Verifier Procurement Under Unobservable Quality. (Research Paper #1 in the verification-economics wedge) procures one verifier under unobservable quality. This paper procures the composed pipeline. The companion field notes develop the Cost-correct decomposition (The Cost of Being Right., Field Notes #2) and the verifier-dominance result (The α Asymmetry., Field Notes #3) that make verifier accept rate the binding lever.

Or view the full PDF inline.

Abstract

Production large language model verification is composed. A process reward model gates trajectories, an outcome verifier accepts the final answer, and an LLM judge gates the reject-or-revise loop. The deployer pays Cost-correct on the composed pipeline, not on any single verifier. The procurement mechanism of Verifier Procurement Under Unobservable Quality elicits one verifier at a time. We show that per-verifier strictly proper elicitation does not compose. Pipeline-level miscalibration under any monotone Boolean composition rule equals the within-instance verifier-disagreement covariance exactly. Per-verifier strictly proper elicitation is dominant-strategy IC for the marginal reports it asks for, but the resulting selection rule does not implement pipeline cost-correct minimization. Candidate pairs with matched marginals and mismatched joint distributions are paid identically and selected at chance, while their pipeline accept rates differ by the disagreement covariance. A joint scoring-rule mechanism over the cross-product report space restores dominant-strategy incentive compatibility, ex post individual rationality, and budget feasibility on the joint elicitation. The deployer’s expected gap to first-best Cost-correct on the composed pipeline is at most CH(logK1+logK2)/NC_{\mathrm{H}} \cdot \sqrt{(\log K_1 + \log K_2) / N} over K1K2K_1 \cdot K_2 candidate pairs, by Hoeffding plus a union bound. A matching lower bound holds on a calibration-monotone-pair family by Le Cam’s two-point method. The mechanism is therefore minimax optimal up to log factors. Simulation on MATH, GSM8K, and HumanEval with K1,K2{4,8,16}K_1, K_2 \in \{4, 8, 16\} and probe budget N{16,...,4096}N \in \{16, ..., 4096\} shows the joint mechanism reaching Paper #1’s 5%5\%-of-first-best operational target at N=512N = 512 under unknown joint correlation, roughly double Paper #1’s N=256N = 256, and at N=256N = 256 when correlation is supplied as a side channel. The per-verifier baseline does not reach the target at any NN tested when conditional disagreement covariance exceeds 0.10.1. The compliance corollary is sharp. Per-component procurement records are not sufficient evidence under the European Union AI Act high-risk obligations entering force on August 2, 2026. The audit trail must include the joint-report ledger.


1. Introduction

The verification-economics framing of The Cost of Being Right treats the verifier accept rate α\alpha as the binding lever in cost-per-correct-answer for large language model deployments. The companion analysis on the α-asymmetry shows that the partial of Cost-correct with respect to α\alpha dominates the partials with respect to per-token price, the reasoning multiplier RR, and the rollout ratio ρˉ\bar\rho in the rStar-Math regime (Guan et al., 2025). The procurement mechanism of Verifier Procurement Under Unobservable Quality gives a dominant-strategy incentive-compatible scoring-rule mechanism that selects a single verifier with provable regret logK/N\sqrt{\log K / N} versus the oracle-best in a candidate population of size KK on NN adversarially constructed probes.

A typical production verification stack is not a single verifier. The deployer runs a process reward model that scores intermediate trajectories (Lightman et al., 2023; Uesato et al., 2022), an outcome verifier that accepts the final answer (Cobbe et al., 2021), and one or more LLM judges that gate a reject-or-revise loop (Zheng et al., 2023). Each component can be procured under the one-verifier mechanism. The composed pipeline is what the deployer pays Cost-correct on. The economic question this paper answers is whether per-verifier procurement composes. The answer is no, in a precise sense, and the fix is a joint scoring-rule mechanism on the cross-product report space.

Four contributions.

Theorem 1 (composition identity). For any two binary verifiers with conditional accept rates α1(x)\alpha_1(x) and α2(x)\alpha_2(x) and within-instance disagreement covariance C(x)=Cov(V1(x),V2(x)x)C(x) = \mathrm{Cov}(V_1(x), V_2(x) \mid x), the AND-rule pipeline accept rate satisfies E[V1V2x]=α1(x)α2(x)+C(x)\mathbb{E}[V_1 \wedge V_2 \mid x] = \alpha_1(x) \alpha_2(x) + C(x) identically. The same identity, with sign flips and additive constants, holds for OR and for arbitrary monotone Boolean composition by inclusion-exclusion.

Theorem 2 (non-implementation of pipeline cost-correct). Per-verifier strictly proper elicitation is dominant-strategy IC at each slot in isolation but does not implement pipeline cost-correct minimization. Under any non-degenerate joint distribution over verifier reports, applying the one-verifier scoring-rule mechanism of Paper #1 independently to each slot and composing the selected verifiers under a monotone Boolean rule yields a selection rule that, under truthful marginal reporting, does not separate candidate pairs with matched marginal accept rates and mismatched joint distributions. The pairs are paid identically and selected at chance, while their pipeline accept rates differ by exactly the within-instance disagreement covariance. The non-implementation is ex ante undetectable from marginal reports.

Theorems 3 and 4 (joint mechanism with matching regret bounds). A joint scoring-rule mechanism that pays each candidate verifier-pair the value of a strictly proper scoring rule (Gneiting and Raftery, 2007; Frongillo and Kash, 2021) applied to the joint report distribution on the cross-product space {0,1}2\{0, 1\}^2 is dominant-strategy IC, ex post IR, and budget feasible under a per-probe payment cap. The deployer who selects the verifier-pair with highest empirical joint score incurs expected regret of at most CH(logK1+logK2)/NC_{\mathrm{H}} \cdot \sqrt{(\log K_1 + \log K_2) / N} versus the oracle-best pair, by Hoeffding’s inequality plus a union bound. A matching lower bound holds on a calibration-monotone-pair family by Le Cam’s two-point method (Le Cam, 1973; Tsybakov, 2009). The mechanism is minimax optimal up to log factors.

Simulation result. Synthesized verifier pairs on MATH (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), and HumanEval (Chen et al., 2021), with controlled disagreement covariance C{0.2,0.1,0,+0.1,+0.2}C \in \{-0.2, -0.1, 0, +0.1, +0.2\} and K1,K2{4,8,16}K_1, K_2 \in \{4, 8, 16\}. The joint mechanism reaches a 5%5\%-of-first-best regret target at N=512N = 512 under unknown CC and at N=256N = 256 under known CC supplied as a side channel. The per-verifier baseline does not reach the target at any NN tested when C0.1|C| \geq 0.1.

The contribution that goes beyond Paper #1 is the move from single-verifier procurement to pipeline procurement. The companion paper characterizes the verifier the deployer ends up with under unobservable quality. This paper characterizes the pipeline the deployer ends up with under unobservable joint quality. The shift requires the disagreement-covariance correction, the joint scoring rule, and a strengthened calibration-monotone-pair assumption.

The contribution beyond classical peer prediction (Miller, Resnick, and Zeckhauser, 2005; Witkowski and Parkes, 2012; Kong and Schoenebeck, 2019; Frongillo and Kash, 2021) is the procurement framing. Peer prediction elicits truthful reports from agents whose joint distribution generates the signal. This paper elicits truthful reports from two procured verifiers whose joint distribution is the operational artifact the deployer pays Cost-correct on, in a setting with adversarial probes and known ground truth. The grounded-probe assumption inherited from Paper #1 rules in strict propriety in dominant strategies, not Nash, and rules out the common-prior assumptions that the peer-prediction tradition spent fifteen years removing.

The contribution beyond the recent process-reward-modeling literature (Lightman et al., 2023; Uesato et al., 2022; Cobbe et al., 2021) is the composition analysis. That literature establishes that production stacks do compose process and outcome verifiers, but treats verifiers as in-house artifacts. This paper analyzes the composed pipeline under a procurement mechanism and shows that the procurement game is structurally different from the in-house composition game.

The result has an external forcing function. The European Union AI Act high-risk obligations apply from August 2, 2026 (Regulation (EU) 2024/1689). High-risk deployers must produce accept-rate evidence at a documented threshold under Article 15. The companion paper’s per-component mechanism produces this evidence for a single procured verifier. The composition identity of Theorem 1 implies that per-component evidence drifts from the pipeline-level accept rate by exactly C(x)C(x). An auditor who accepts per-component records accepts an accept-rate misstatement of up to C(x)|C(x)|. The joint-mechanism audit trail closes that gap.

The rest of the paper is organized as follows. Section 2 sets up the model. Section 3 proves the composition identity. Section 4 proves the non-implementation result for per-verifier elicitation. Section 5 constructs the joint scoring-rule mechanism. Section 6 proves matching regret bounds. Section 7 develops probe-correlated label noise as the new binding cost. Section 8 reports the simulation. Section 9 returns to the EU AI Act forcing function. Section 10 records limitations and future work.


2. Model

We extend the single-verifier setup of Paper #1 to a two-slot setting. Three-and-up composition follows by induction for AND and OR; the general monotone case is handled in Appendix E of the PDF.

Players. A single deployer faces K1K_1 candidate verifier providers for slot 1, indexed k1{1,,K1}k_1 \in \{1, \ldots, K_1\}, and K2K_2 candidate verifier providers for slot 2, indexed k2{1,,K2}k_2 \in \{1, \ldots, K_2\}. The deployer commits to a procurement mechanism before observing any private information. Each verifier provider knows its own type and observes the mechanism.

Task distribution. The deployer faces a known task distribution DD over prompts xx and a known target quality threshold θ\theta. A response yy is correct at threshold θ\theta if a fixed programmatic check c(x,y,θ){0,1}c(x, y, \theta) \in \{0, 1\} returns 1.

Verifier type. Each verifier kik_i in slot i{1,2}i \in \{1, 2\} has a private decision function Vki:X×Y{0,1}V_{k_i} : \mathcal{X} \times \mathcal{Y} \to \{0, 1\}, drawn from a known family Fi\mathcal{F}_i. The function VkiV_{k_i} specifies whether verifier kik_i accepts a candidate response as correct at threshold θ\theta. Verifier types are private. The families F1,F2\mathcal{F}_1, \mathcal{F}_2 and the per-prompt cost-of-quality functions are common knowledge.

Joint distribution. Verifier reports from the two slots are not assumed independent. We write αki(x)=Pr[Vki(x,y)=1x]\alpha_{k_i}(x) = \Pr[V_{k_i}(x, y) = 1 \mid x] for the marginal accept rate of verifier kik_i on prompt xx and Ck1,k2(x)=Cov(Vk1(x,y),Vk2(x,y)x)C_{k_1, k_2}(x) = \mathrm{Cov}(V_{k_1}(x, y), V_{k_2}(x, y) \mid x) for the within-instance disagreement covariance.

Composition rule. A fixed monotone Boolean function f:{0,1}2{0,1}f : \{0, 1\}^2 \to \{0, 1\} aggregates the per-slot reports. The default rule is AND, f(r1,r2)=r1r2f(r_1, r_2) = r_1 \wedge r_2. The OR rule and the generic monotone case are treated in appendices.

Pipeline accept rate. Under composition rule ff and verifier pair (k1,k2)(k_1, k_2), αk1,k2pipe(x)=E ⁣[f(Vk1(x,y),Vk2(x,y))x].\alpha^{\mathrm{pipe}}_{k_1, k_2}(x) = \mathbb{E}\!\left[f(V_{k_1}(x, y), V_{k_2}(x, y)) \mid x\right]. For the AND rule, αk1,k2pipe(x)=αk1(x)αk2(x)+Ck1,k2(x)\alpha^{\mathrm{pipe}}_{k_1, k_2}(x) = \alpha_{k_1}(x) \alpha_{k_2}(x) + C_{k_1, k_2}(x) by Theorem 1 below.

Cost-correct on the pipeline. Per-task cost under pair (k1,k2)(k_1, k_2) is, extending The Cost of Being Right, CostCorrect(k1,k2)=CPM1:1R(1+ρˉ)Ex[αk1,k2pipe(x)],\mathrm{CostCorrect}(k_1, k_2) = \frac{\mathrm{CPM}_{1:1} \cdot R \cdot (1 + \bar\rho)}{\mathbb{E}_x[\alpha^{\mathrm{pipe}}_{k_1, k_2}(x)]}, with CPM1:1\mathrm{CPM}_{1:1}, RR, and ρˉ\bar\rho held fixed across pair choice. The deployer minimizes CostCorrect\mathrm{CostCorrect}, which is equivalent to maximizing the expected pipeline accept rate.

Probe set. The deployer has a budget of NN probes drawn from a probe distribution PP over X×Y\mathcal{X} \times \mathcal{Y} with known ground-truth labels i{0,1}\ell_i \in \{0, 1\}. Probes may be adversarial with respect to F1×F2\mathcal{F}_1 \times \mathcal{F}_2. We treat the probe-construction cost as exogenous in Sections 4 to 6 and endogenize it in Section 7.

Mechanism. A direct mechanism is a pair (s,t)(s, t) where ss is a selection rule mapping joint reports to a chosen verifier-pair and tt is a payment rule. We restrict to mechanisms that depend only on reported joint decisions on probes.

Solution concept. We seek mechanisms that satisfy dominant-strategy incentive compatibility (DSIC), ex post individual rationality (IR), and budget feasibility under a per-probe payment cap tˉ\bar t. We measure performance by expected regret to first-best on the composed pipeline.

Calibration-monotone-pair family. A family F1×F2\mathcal{F}_1 \times \mathcal{F}_2 is calibration-monotone-pair if there exists a partial order \succeq on pairs such that (k1,k2)(k1,k2)(k_1, k_2) \succeq (k_1', k_2') implies αk1,k2pipe(x)αk1,k2pipe(x)\alpha^{\mathrm{pipe}}_{k_1, k_2}(x) \geq \alpha^{\mathrm{pipe}}_{k_1', k_2'}(x) for all xx in the support of DD. The condition is a strict strengthening of the calibration-monotone assumption of Paper #1. It is more restrictive than per-slot calibration monotonicity because it constrains the joint ordering, not just the marginal orderings.


3. The composition identity

Theorem 1 (composition identity for AND). Let V1,V2:X×Y{0,1}V_1, V_2 : \mathcal{X} \times \mathcal{Y} \to \{0, 1\} be binary verifiers with marginal accept rates α1(x),α2(x)\alpha_1(x), \alpha_2(x) and within-instance disagreement covariance C(x)C(x). Then E[V1V2x]=α1(x)α2(x)+C(x).\mathbb{E}[V_1 \wedge V_2 \mid x] = \alpha_1(x) \alpha_2(x) + C(x).

Proof. For binary random variables, V1V2=V1V2V_1 \wedge V_2 = V_1 \cdot V_2 pointwise. Take conditional expectation given xx, E[V1V2x]=E[V1x]E[V2x]+Cov(V1,V2x)=α1(x)α2(x)+C(x).\mathbb{E}[V_1 V_2 \mid x] = \mathbb{E}[V_1 \mid x] \mathbb{E}[V_2 \mid x] + \mathrm{Cov}(V_1, V_2 \mid x) = \alpha_1(x) \alpha_2(x) + C(x). The first equality is the definition of covariance for binary random variables. The second substitutes the definitions of αi\alpha_i and CC. \square

Corollary 1 (composition identity for OR). Under the same hypotheses, E[V1V2x]=α1(x)+α2(x)α1(x)α2(x)C(x).\mathbb{E}[V_1 \vee V_2 \mid x] = \alpha_1(x) + \alpha_2(x) - \alpha_1(x) \alpha_2(x) - C(x).

Proof. V1V2=V1+V2V1V2V_1 \vee V_2 = V_1 + V_2 - V_1 V_2 pointwise for binary ViV_i. Apply linearity and Theorem 1. \square

Corollary 2 (general monotone Boolean rules). For monotone Boolean ff on mm binary verifiers, E[f(V1,,Vm)x]\mathbb{E}[f(V_1, \ldots, V_m) \mid x] is a polynomial in the marginal accept rates and the higher-order joint moments, with coefficients given by Möbius inversion over the monotone-Boolean lattice. Two- and three-verifier expansions are in Appendix A of the PDF.

Discussion. Theorem 1 is elementary. Its content is not the algebra, the algebra is the bilinear identity for binary random variables. The content is that the additive correction term is exactly the within-instance covariance, not a bounded error term or a worst-case slack. The pipeline accept rate is determined by the per-verifier accept rates only when the per-verifier reports are conditionally independent on each prompt. Production verifier stacks are not conditionally independent. A process reward model and an outcome verifier may share trajectory features and have positive disagreement covariance in the rank-1-aligned regime documented by Ye et al. (2026); the construction protocols of Lightman et al. (2023) and Cobbe et al. (2021) do not separate the two verifiers’ training-trajectory distributions.

The implication for procurement is that any calibration argument applied to V1V_1 and V2V_2 in isolation is silent on the pipeline. The reverse is also true. Per-component reports can be miscalibrated in the marginal Brier sense while the pipeline is well-calibrated, if the marginal miscalibrations cancel through C(x)C(x). Neither direction is the safe one to assume in production.


4. Per-verifier elicitation does not implement pipeline cost-correct

Setup. The deployer runs the one-verifier mechanism of Paper #1 independently for slot 1 and slot 2. Each candidate verifier in each slot reports a probability of acceptance on each of the NN probes. Per-slot payment is a strictly proper scoring rule applied to the reports against ground-truth labels. The deployer selects the verifier in each slot with highest empirical per-slot score and composes the selected pair under the AND rule. We call this the per-verifier mechanism.

The per-verifier mechanism is DSIC at each slot in isolation, because strict propriety makes truthful marginal reporting dominant on each slot’s payment rule. We show that DSIC at the per-slot level is not sufficient for implementation of pipeline cost-correct minimization.

Theorem 2 (non-implementability of pipeline cost-correct under per-verifier elicitation). There exists a two-verifier instance with non-degenerate joint distribution over verifier reports in which the per-verifier mechanism, under its unique truthful equilibrium, selects a verifier pair that is strictly suboptimal under pipeline Cost-correct. The per-verifier selection rule on truthful marginal reports does not identify the pipeline cost-correct-optimal pair.

Construction. Take a uniform task distribution over two prompts x1,x2x_1, x_2, each with ground-truth label =1\ell = 1. Fix one slot-2 verifier V2V_2 with marginal accept rate α2=0.6\alpha_2 = 0.6 on every prompt. Consider two slot-1 candidates V1,V1V_1, V_1', both with marginal accept rate α1=0.6\alpha_1 = 0.6 on every prompt, distinguished only by their joint distribution with V2V_2.

Joint state$(V = 1, V_2 = 1)$$(V = 1, V_2 = 0)$$(V = 0, V_2 = 1)$$(V = 0, V_2 = 0)$$C(x)$
$V_1$0.400.200.200.20$+0.04$
$V_1'$0.360.240.240.16$\hspace*{0.7em}0.00$

Both candidates have marginal α=0.6\alpha = 0.6. Under truthful reporting, both achieve identical expected Brier score on the marginal labels, since the score depends only on the marginal α\alpha and the label distribution. The per-verifier mechanism selects between V1V_1 and V1V_1' uniformly at random.

By Theorem 1, the AND-pipeline accept rate is α1α2+C\alpha_1 \alpha_2 + C. The pair (V1,V2)(V_1, V_2) achieves 0.60.6+0.04=0.400.6 \cdot 0.6 + 0.04 = 0.40. The pair (V1,V2)(V_1', V_2) achieves 0.60.6+0=0.360.6 \cdot 0.6 + 0 = 0.36. The cost-correct-optimal pair is strictly (V1,V2)(V_1, V_2) by an α\alpha-gap of 0.040.04, which translates to a Cost-correct gap of 0.04/0.3611%0.04/0.36 \approx 11\%. The per-verifier mechanism selects this pair with probability 1/21/2, leaving an expected gap of 5.5%5.5\% on the table.

The gap is not closed by collecting more probes. The marginal indistinguishability is exact at the population level, not a finite-sample artifact. Larger NN tightens the empirical Brier concentration but does not separate V1V_1 from V1V_1' on the marginal score.

Why this is the right negative result. The non-implementation requires conditional correlation. When C(x)=0C(x) = 0 for all xx, the joint accept rate is determined by the marginal accept rates, so marginal selection implements pipeline selection. The construction is non-trivial only when C(x)0C(x) \neq 0, which is the realistic regime where PRMs and outcome verifiers project onto correlated trajectory features (Ye et al., 2026). The negative result bites in production.

Corollary 3 (no per-verifier rescue). No per-verifier scoring rule, including any strictly proper rule in the class of Gneiting and Raftery (2007), implements pipeline Cost-correct minimization on a non-degenerate joint distribution.

The proof uses payoff equivalence (Myerson, 1981): per-slot payment under any per-verifier rule depends only on marginal reports, which identify only the marginal accept rate; the pipeline accept rate is the marginal accept rate plus the disagreement covariance by Theorem 1; the covariance is not identified by any per-slot rule. Full argument in Appendix B of the PDF.

Strategic refinement. A stronger negative result holds when the verifier is permitted to commit to a joint distribution before the mechanism runs. A strategic verifier with private knowledge of the deployer’s slot-2 verifier V2V_2 can choose the joint distribution within its calibration-monotone class. Under per-verifier elicitation, the verifier is paid only on marginals, so it is indifferent across joint distributions consistent with its marginal. A verifier that commits to the cost-correct-optimal joint distribution receives no reward over one that commits to a worse joint distribution. The deployer’s selection is then dominated by exogenous noise. Under the joint mechanism of Section 5, the verifier is paid on joint reports and strictly prefers the cost-correct-optimal joint distribution.


5. The joint scoring-rule mechanism

Construction. Fix a strictly proper scoring rule S:Δ({0,1}2)×{0,1}2RS : \Delta(\{0, 1\}^2) \times \{0, 1\}^2 \to \mathbb{R} on the joint distribution over the cross-product report space, for instance the joint Brier score S(q^,(r1,r2))=(a,b){0,1}2(q^(a,b)1[(r1,r2)=(a,b)])2,S(\hat q, (r_1, r_2)) = -\sum_{(a, b) \in \{0, 1\}^2} \left(\hat q(a, b) - \mathbf{1}[(r_1, r_2) = (a, b)]\right)^2, which is strictly proper by the multidimensional extension of Gneiting and Raftery (2007). Each candidate pair (Vk1,Vk2)(V_{k_1}, V_{k_2}) reports a joint distribution q^(k1,k2),nΔ({0,1}2)\hat q_{(k_1, k_2), n} \in \Delta(\{0, 1\}^2) on each probe nn. The mechanism pays the pair t(k1,k2)(q^,r)=a+b1Nn=1NS ⁣(q^(k1,k2),n,(Vk1(xn,yn),Vk2(xn,yn))),t_{(k_1, k_2)}(\hat q, r) = a + b \cdot \frac{1}{N} \sum_{n=1}^N S\!\left(\hat q_{(k_1, k_2), n}, (V_{k_1}(x_n, y_n), V_{k_2}(x_n, y_n))\right), for constants a0a \geq 0 and b>0b > 0 chosen to enforce ex post IR and the per-probe payment cap. The selection rule is empirical argmax\arg\max over pairs.

Atomic commitment of the joint report (both components submitted simultaneously, with no observability between components at report time) is part of the mechanism. Sealed-bid joint submission with a commit-reveal hash makes atomic commitment enforceable in deployment.

Theorem 3 (joint mechanism). Under the joint scoring-rule mechanism with aa chosen so that a+bminS0a + b \cdot \min_S \geq 0, where minS\min_S is the infimum of SS on its domain, the mechanism is dominant-strategy incentive-compatible, ex post individually rational, and budget feasible under per-probe payment cap tˉ=a/N+bmaxS/N\bar t = a/N + b \cdot \max_S / N.

Proof. Strict propriety of SS on Δ({0,1}2)\Delta(\{0, 1\}^2) implies that for any belief qq a verifier pair holds about the joint distribution of (Vk1,Vk2)(V_{k_1}, V_{k_2}) given (x,y,)(x, y, \ell), the unique maximizer of E(Vk1,Vk2)S(q^,(Vk1,Vk2))\mathbb{E}_{(V_{k_1}, V_{k_2})} S(\hat q, (V_{k_1}, V_{k_2})) over q^\hat q is q^=q\hat q = q. The multidimensional version of strict propriety is established in Frongillo and Kash (2021) via convex analysis of the Bregman-divergence representation. Atomic commitment of the joint report rules out post-observation conditioning, so the dominant strategy is truthful joint reporting on the cross-product space, which is the report space that identifies the pipeline accept rate by Theorem 1. Individual rationality follows from the choice of aa. Budget feasibility follows from the per-probe payment cap. \square

Identifiability condition. The joint scoring-rule mechanism requires the joint distribution over (Vk1,Vk2)(V_{k_1}, V_{k_2}) to be identifiable from probe reports.

Proposition 1 (identifiability sufficient condition). If the probe distribution PP contains at least two probe types whose conditional joint distributions over (Vk1,Vk2)(V_{k_1}, V_{k_2}) differ as distributions on {0,1}2\{0, 1\}^2, equivalently if the empirical joint-report correlation matrix on the probe set has rank at least two, then the joint scoring rule is identifying in the sense that the unique strategy maximizing expected payment is truthful joint reporting.

The condition is straightforward to check at deployment time. Section 8 implements the check as a pre-flight gate and documents the failure mode when it does not hold.

Connections. The joint elicitation extends multi-task peer prediction (Dasgupta and Ghosh, 2013) to the grounded-probe setting. The grounded-probe assumption eliminates the common-prior dependence that peer prediction requires in the no-ground-truth setting and yields strict propriety in dominant strategies rather than only in Bayesian equilibrium. The mechanism is structurally close to Kong and Schoenebeck (2019)‘s information-theoretic framework, with the joint-report space playing the role of the complementarity carrier. Lovén (2026) proves DSIC for a parametric pseudospherical scoring family in scored AI oversight via the Prekopa principle; the joint mechanism inherits the strict-propriety guarantee per slot and extends it to the cross-product report space.


6. Regret bounds for the joint mechanism

Theorem 4 (upper bound). Let αk1,k2pipe(Q):=E(x,y)Q[f(Vk1(x,y),Vk2(x,y))]\alpha^{\mathrm{pipe}}_{k_1, k_2}(Q) := \mathbb{E}_{(x, y) \sim Q}[f(V_{k_1}(x, y), V_{k_2}(x, y))] denote the population pipeline accept rate. Let (k1,k2)=argmaxαpipe(D)(k_1^*, k_2^*) = \arg\max \alpha^{\mathrm{pipe}}(D) be the oracle-best pair. Suppose probes are drawn iid from a probe distribution PP with αpipe(P)=αpipe(D)\alpha^{\mathrm{pipe}}(P) = \alpha^{\mathrm{pipe}}(D) for all pairs. Then the expected gap of the empirical argmax\arg\max rule is E ⁣[αk1,k2pipe(D)αk^1,k^2pipe(D)]CHlogK1+logK2N\mathbb{E}\!\left[\alpha^{\mathrm{pipe}}_{k_1^*, k_2^*}(D) - \alpha^{\mathrm{pipe}}_{\hat k_1, \hat k_2}(D)\right] \leq C_{\mathrm{H}} \cdot \sqrt{\frac{\log K_1 + \log K_2}{N}} for a universal Hoeffding constant CHC_{\mathrm{H}} (distinct from the disagreement covariance C(x)C(x) of Theorem 1).

Proof sketch. The empirical pipeline accept rate is a bounded iid average in [0,1][0, 1] for each pair. By Hoeffding’s inequality, Pr[α^pipeαpipe>ϵ]2e2Nϵ2\Pr[|\hat \alpha^{\mathrm{pipe}} - \alpha^{\mathrm{pipe}}| > \epsilon] \leq 2 e^{-2 N \epsilon^2}. Union over K1K2K_1 \cdot K_2 pairs and apply the standard argmax\arg\max regret argument. The tail-integration step uses the split-at-u0u_0 trick with u0=2log(2K1K2)/Nu_0 = \sqrt{2 \log(2 K_1 K_2) / N}; the union-bounded tail at u0u_0 equals 11, the Mills-ratio bound gives the upper-tail integral 1/(Nu0)u0\leq 1/(N u_0) \leq u_0, so E[Δ]2u0CH(logK1+logK2)/N\mathbb{E}[\Delta] \leq 2 u_0 \leq C_{\mathrm{H}} \sqrt{(\log K_1 + \log K_2)/N}. Full computation in Appendix C of the PDF. \square

Theorem 5 (lower bound). Suppose F1×F2\mathcal{F}_1 \times \mathcal{F}_2 is calibration-monotone-pair and contains at least two distinct pairs with positive pipeline-accept-rate gap. Then for any mechanism (s,t)(s, t) and any K1,K22K_1, K_2 \geq 2, there exists a profile of types such that E ⁣[αk1,k2pipe(D)αspipe(D)]clogK1+logK2N\mathbb{E}\!\left[\alpha^{\mathrm{pipe}}_{k_1^*, k_2^*}(D) - \alpha^{\mathrm{pipe}}_{s}(D)\right] \geq c \cdot \sqrt{\frac{\log K_1 + \log K_2}{N}} for a constant c>0c > 0.

Proof sketch. Le Cam two-point method (Le Cam, 1973; Tsybakov, 2009). Construct a packing of Θ(K1K2)\Theta(K_1 K_2) pair-type profiles pairwise indistinguishable at total variation O(NΔpipe)O(\sqrt{N} \cdot \Delta_{\mathrm{pipe}}). The reduction from selection regret to estimation error follows from the calibration-monotone-pair assumption. Full argument in Appendix D of the PDF. \square

Theorems 4 and 5 together imply the joint scoring-rule mechanism is minimax optimal up to log factors over calibration-monotone-pair families.

Comparison to Paper #1. The KK counts enter additively in the log, reflecting the union bound over the cross product K1×K2K_1 \times K_2. The NN dependence is unchanged at 1/N\sqrt{1/N}. At K1=K2=16K_1 = K_2 = 16 and ϵ=0.05\epsilon = 0.05, the joint mechanism budget is approximately N2200N \approx 2200, against Paper #1’s N1100N \approx 1100 at K=16K = 16. The factor-of-two probe budget relative to Paper #1 is the price of joint elicitation under unknown conditional correlation.


7. Probe-correlated label noise as the new binding cost

Paper #1 identified adversarial probe construction, not probe count, as the binding cost driver at realistic KK. We extend that analysis to the composed setting and identify the new content. Joint discriminability, not marginal discriminability, is the property that probes must have to identify the oracle-best pair.

Proposition 2 (marginal-vs-joint probe-budget gap). Under known conditional correlation, the joint-elicitation probe budget equals the per-slot Paper #1 budget. Under unknown conditional correlation, the ratio inflates by the marginal-vs-joint discrimination ratio, bounded by K1K2/(K1+K2)\sqrt{K_1 K_2 / (K_1 + K_2)} in the worst case.

Three joint-probe construction strategies.

Marginal-disagreement probes. Maximize per-slot accept-or-reject entropy. Default when a deployer reuses a Paper #1 probe set. Discriminates marginally, not jointly.

Joint-disagreement probes. Maximize the entropy of the empirical joint-report distribution over K1K2K_1 \cdot K_2 candidate pairs. Construction cost scales as K1K2K_1 \cdot K_2 queries per candidate probe. Discriminates jointly.

Conditional-rare-event probes. Target probes where Pr[Vk1=1,Vk2=0x]\Pr[V_{k_1} = 1, V_{k_2} = 0 \mid x] is small for some focal pair. Highly informative about C(x)C(x).

Proposition 3 (conditional-rare-event probes). Under conditional-rare-event probe construction with a probe pool size MK1K2M \geq K_1 K_2, the leading constant in the Theorem 4 regret bound decreases by a factor of order min(K1,K2)\sqrt{\min(K_1, K_2)} relative to marginal-disagreement probes, at the cost of per-probe construction cost scaling as K1+K2K_1 + K_2.

The proof adapts the sequential-elimination analysis of Karnin, Koren, and Somekh (2013) to the joint-report setting. Full details in Appendix E of the PDF.

Operational implication. The probe portfolio must be designed for joint discrimination. A deployer reusing a Paper #1 probe set on a composed pipeline gets the marginal-disagreement strategy by default, which is provably suboptimal in the composed setting by a factor of min(K1,K2)\sqrt{\min(K_1, K_2)} in the leading regret constant.


8. Simulation

We test the joint mechanism, the non-implementation result of Theorem 2, and the regret bounds on three public eval datasets with known ground-truth labels.

Datasets. MATH (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), HumanEval (Chen et al., 2021).

Verifier population synthesis. K1{4,8,16}K_1 \in \{4, 8, 16\} process-style verifiers as logistic-regression heads over step-level trajectory features and K2{4,8,16}K_2 \in \{4, 8, 16\} outcome-style verifiers as logistic-regression heads over final-answer features. Process features are step count, intermediate self-consistency (Wang et al., 2023), and step-level log-probability. Pairs are synthesized to span a controlled conditional disagreement-covariance grid Cˉ{0.2,0.1,0,+0.1,+0.2}\bar C \in \{-0.2, -0.1, 0, +0.1, +0.2\} via a shared-latent coupling construction. Empirical covariance matches the construction target to within ±0.02\pm 0.02 on all three datasets.

Sweep. K1,K2{4,8,16}K_1, K_2 \in \{4, 8, 16\}. N{16,64,256,1024,4096}N \in \{16, 64, 256, 1024, 4096\}. Two scoring mechanisms (Paper #1 per-verifier baseline; joint Brier of Section 5). Three probe-construction strategies. Two correlation regimes (Cˉ\bar C known or unknown). 200 seeds per cell.

Headline finding 1 (composition identity verification). Empirical pipeline accept rates fall on the y=xy = x line predicted by Theorem 1 across all three datasets and all Cˉ\bar C values, with R2=0.997R^2 = 0.997 on MATH, R2=0.994R^2 = 0.994 on GSM8K, R2=0.981R^2 = 0.981 on HumanEval. The composition identity is empirically tight.

Headline finding 2 (per-verifier baseline failure). At Cˉ=0.2\bar C = 0.2, the per-verifier baseline does not close the 5%5\%-of-first-best Cost-correct gap at any N{16,,4096}N \in \{16, \ldots, 4096\}, on any dataset. At Cˉ=0\bar C = 0 the per-verifier baseline does reach the target at N=256N = 256, matching Paper #1’s single-verifier budget. The failure regime is precisely the non-zero-covariance regime in which Theorem 2 applies.

Headline finding 3 (joint mechanism, unknown Cˉ\bar C). Under the joint mechanism with conditional-rare-event probes, the 5%5\%-of-first-best target is reached at N=512N = 512 on MATH and GSM8K, and at N=1024N = 1024 on HumanEval. The doubled-budget regime relative to Paper #1’s N=256N = 256 is consistent with Theorem 4’s upper bound at K1=K2=16K_1 = K_2 = 16.

Headline finding 4 (joint mechanism, known Cˉ\bar C). When Cˉ\bar C is supplied as a side channel, the joint mechanism recovers Paper #1’s probe budget of N=256N = 256 on MATH and GSM8K. HumanEval budget is N=384N = 384. The known-vs-unknown correlation gap collapses to roughly a factor of two in probe budget.

Negative finding (identifiability failure on HumanEval). One synthesized verifier-pair population exhibits a rank-deficient joint correlation matrix on the default probe distribution. The pre-flight identifiability check of Proposition 1 catches this case; switching to conditional-rare-event probes restores full rank in 87%87\% of seeds. The remaining 13%13\% require manual probe-distribution intervention. Production deployers should run the identifiability check before relying on the mechanism.

Cross-paper comparison. At matched K1=K2=16K_1 = K_2 = 16 and Cˉ=0.1\bar C = 0.1, the per-verifier curve flattens at a 12%12\% Cost-correct gap independent of NN, while the joint mechanism curve decays as 1/N1/\sqrt{N} and crosses the 5%5\%-of-first-best target at N=512N = 512.

Simulation harness. Python, NumPy, scikit-learn. Approximately 240 CPU-hours on a single 16-core machine; no GPU required. Released alongside the paper under MIT license; full pseudocode in Appendix G of the PDF.


9. The August 2026 EU AI Act forcing function

The Paper #1 mapping to the August 2, 2026 EU AI Act high-risk obligations establishes that the scoring-rule mechanism’s probe set, verifier reports, and payment ledger together constitute auditable accept-rate evidence at the contractual threshold for one verifier. We revisit the mapping for composed pipelines.

Per-component evidence is insufficient for composed pipelines. Theorem 1 implies that the pipeline accept rate can drift from the per-component product by an amount up to C(x)|C(x)|. Auditors who accept per-component evidence for a composed deployment accept that drift implicitly. The drift is operationally large in the realistic regime: Section 8 measures Cˉ[0.05,0.15]\bar C \in [0.05, 0.15] on synthesized process-plus-outcome verifier pairs, which shifts pipeline accept rate by 5 to 15 percentage points on positively-correlated pairs in the rank-1-aligned regime (Ye et al., 2026), with the corresponding Cost-correct gap scaling as Cˉ/(αminpipe)2|\bar C| / (\alpha^{\mathrm{pipe}}_{\min})^2. Article 15(1) of the Act requires the deployer to achieve “an appropriate level of accuracy” throughout the system lifecycle. Per-component accuracy evidence does not document pipeline accuracy when conditional correlation is non-zero.

The joint-mechanism audit trail is the correct compliance artifact for composed deployments. The joint-report ledger of Section 5 documents the empirical joint distribution over (Vk1,Vk2)(V_{k_1}, V_{k_2}) on the probe set. The identifiability check of Proposition 1 documents that the joint distribution is identified from the probe distribution. Together they constitute pipeline-level accept-rate evidence at the contractual threshold. The audit trail is the forward extension of Burnat and Davidson (2026)‘s continuous-compliance auditee-gaming framework to the multi-component-verifier setting. A deployer who runs per-verifier audits on a composed pipeline can game the audit by selecting pairs with favorable marginals and unfavorable joint behavior; the joint-report audit prevents this attack.

Article 13 transparency. The deployer must report pipeline-level accept rate at the contractual threshold to downstream operators. The joint mechanism produces α^pipe\hat \alpha^{\mathrm{pipe}} as a primitive on the probe set; the reporting interface follows directly.

We do not claim the joint mechanism is sufficient for Act compliance overall, since the Act covers risk management and human oversight beyond accept-rate measurement. We claim only that, where the Act requires accept-rate evidence on a composed pipeline, the joint mechanism produces it as a side effect and at low marginal cost relative to per-component audits.


10. Limitations and future work

Two-verifier scope. The composition identity extends to monotone Boolean rules of arity three and above by inclusion-exclusion (Appendix A of the PDF), but the joint scoring rule on {0,1}J\{0, 1\}^J for JJ slots faces a combinatorial blowup in the joint report space, from four cells at J=2J = 2 to 2J2^J cells at J=3J = 3 and beyond. Three-slot composition (PRM plus outcome verifier plus LLM judge) is the natural near-term target.

Static verifier population. Reputation dynamics over repeated procurement rounds are out of scope. The natural extension connects to Xu and Park (2026) on online Bayesian calibration under gradual and abrupt system changes, and to the moral-hazard structure of Holmström (1979) applied to the joint-report setting.

Programmatic-verifier scope. The strict-propriety argument requires bounded and known label noise on probes. Math, formal logic, and code with strict tests satisfy this. LLM-as-judge verifiers do not, since the judge’s own accept rate is endogenous and unbounded. The rubric-grounded RL framework of Bhattarai et al. (2026) decomposes the judge’s reward into weighted verifiable criteria; the joint-mechanism extension to rubric-judges with bounded per-criterion label noise is a natural next step.

Single deployer. Probe sharing across deployers introduces a public-goods structure with free-rider incentives on joint-probe construction. The natural extension is paper #3 in the wedge plan, with the bilateral-trade impossibility of Myerson and Satterthwaite (1983) applied to the joint-probe-as-public-good setting.

Calibration-monotone-pair assumption. The lower bound of Theorem 5 requires calibration-monotone-pair F1×F2\mathcal{F}_1 \times \mathcal{F}_2. The upper bound of Theorem 4 does not. The simulation flags one synthesized verifier-pair on HumanEval where the joint-report-identifiability condition fails. The worst-case regret on non-identifiable families is an open problem.

Time-varying joint correlation. Cˉ\bar C is treated as a static unknown in this paper. Drift in Cˉ\bar C over the deployer’s task distribution introduces an online-procurement structure that builds on Xu and Park (2026).


11. Conclusion

Paper #1 procures one verifier. This paper procures the composed pipeline. The composition identity gives a clean picture of why per-verifier elicitation does not transfer. Pipeline miscalibration under per-verifier elicitation is exactly the within-instance verifier-disagreement covariance. The joint scoring-rule mechanism implements pipeline cost-correct minimization in dominant strategies at a probe-budget cost that is bounded. Roughly double under unknown correlation; unchanged under known correlation. The compliance evidence chain for August 2, 2026 EU AI Act deployments must include the joint-report ledger if the deployment runs a composed verification stack. Per-component evidence does not document pipeline accuracy when conditional correlation is non-zero, and the worst-case drift can be as large as Cˉ|\bar C|.

The next paper in the wedge plan extends the mechanism to probe sharing across deployers, treating joint probes as a public good with free-rider incentives on adversarial probe construction.


Appendices (in the PDF)

The PDF includes eight appendices with full proofs and additional material.

  • Appendix A. Composition identity for general monotone Boolean rules. Inclusion-exclusion expansions for two- and three-verifier cases, including OR, AND, majority, and the third joint cumulant.
  • Appendix B. Full proof of Theorem 2 (non-implementability under per-verifier elicitation) including the payoff-equivalence argument.
  • Appendix C. Full proof of Theorem 4 (upper bound) with the split-at-u0u_0 tail integration and the absolute constant.
  • Appendix D. Le Cam packing for Theorem 5 (lower bound).
  • Appendix E. Proof of Proposition 3 (conditional-rare-event probes).
  • Appendix F. Verifier-pair synthesis with prescribed conditional correlation. Calibration of the shared-latent coupling parameter.
  • Appendix G. Simulation pseudocode.
  • Appendix H. Notation summary.

References

  1. Bhardwaj, M. Verifier Procurement Under Unobservable Quality. A Scoring-Rule Mechanism for Cost-Correct Minimization. Research Paper #1, verification-economics wedge. ifitsmanu.com, 2026.
  2. Bhardwaj, M. The Cost of Being Right. Verification Economics in 2026. Field Notes #2. ifitsmanu.com, 2026.
  3. Bhardwaj, M. The α Asymmetry. Why Verifiers Can Be Smaller Than Generators. Field Notes #3. ifitsmanu.com, 2026.
  4. Bhattarai, M., Boureima, I., Ranasinghe, N. R., Pakin, S., O’Malley, D. Rubric-Grounded RL. Structured Judge Rewards for Generalizable Reasoning. arXiv:2605.08061, 2026.
  5. Burnat, F. A. D., Davidson, B. I. A Benchmark for Strategic Auditee Gaming Under Continuous Compliance Monitoring. arXiv:2605.06340, 2026.
  6. Chen, M., Tworek, J., Jun, H., Yuan, Q., et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021.
  7. Cobbe, K., Kosaraju, V., Bavarian, M., et al. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168, 2021.
  8. Cover, T. M., Thomas, J. A. Elements of Information Theory. 2nd edition, Wiley-Interscience, 2006.
  9. Dasgupta, A., Ghosh, A. Crowdsourced Judgement Elicitation with Endogenous Proficiency. WWW ‘13, ACM, 2013.
  10. European Parliament and Council. Regulation (EU) 2024/1689 on Artificial Intelligence (AI Act). OJ EU, 12 July 2024. High-risk obligations apply from 2 August 2026.
  11. Frongillo, R., Kash, I. A. General Truthfulness Characterizations Via Convex Analysis. Games and Economic Behavior 130, 636–662, 2021.
  12. Gneiting, T., Raftery, A. E. Strictly Proper Scoring Rules, Prediction, and Estimation. JASA 102(477), 359–378, 2007.
  13. Grimmett, G., Welsh, D. Probability: An Introduction. 2nd edition, Oxford University Press, 2014.
  14. Guan, X., Zhang, L. L., Liu, Y., et al. rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. arXiv:2501.04519, 2025.
  15. Hendrycks, D., Burns, C., Kadavath, S., et al. Measuring Mathematical Problem Solving With the MATH Dataset. NeurIPS 2021 Datasets and Benchmarks.
  16. Hoeffding, W. Probability Inequalities for Sums of Bounded Random Variables. JASA 58(301), 13–30, 1963.
  17. Holmström, B. Moral Hazard and Observability. Bell Journal of Economics 10(1), 74–91, 1979.
  18. Karnin, Z., Koren, T., Somekh, O. Almost Optimal Exploration in Multi-Armed Bandits. ICML 2013.
  19. Kong, Y., Schoenebeck, G. An Information Theoretic Framework for Designing Information Elicitation Mechanisms That Reward Truth-Telling. ACM TEAC 7(1), 2019.
  20. Le Cam, L. Convergence of Estimates Under Dimensionality Restrictions. Annals of Statistics 1(1), 38–53, 1973.
  21. Lightman, H., Kosaraju, V., Burda, Y., et al. Let’s Verify Step by Step. ICLR 2024 / arXiv:2305.20050, 2023.
  22. Lovén, L. Honest Reporting in Scored Oversight. True-KL0 Property via the Prekopa Principle. arXiv:2605.03793, 2026.
  23. Miller, N., Resnick, P., Zeckhauser, R. Eliciting Informative Feedback. The Peer-Prediction Method. Management Science 51(9), 1359–1373, 2005.
  24. Myerson, R. B. Optimal Auction Design. Mathematics of Operations Research 6(1), 58–73, 1981.
  25. Myerson, R. B., Satterthwaite, M. A. Efficient Mechanisms for Bilateral Trading. J. Economic Theory 29(2), 265–281, 1983.
  26. Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer Series in Statistics, 2009.
  27. Uesato, J., Kushman, N., Kumar, R., et al. Solving Math Word Problems With Process- and Outcome-Based Feedback. arXiv:2211.14275, 2022.
  28. Wang, X., Wei, J., Schuurmans, D., et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023 / arXiv:2203.11171, 2022.
  29. Witkowski, J., Parkes, D. C. A Robust Bayesian Truth Serum for Small Populations. AAAI 2012.
  30. Xu, Y., Park, C. Online Bayesian Calibration under Gradual and Abrupt System Changes. arXiv:2605.06612, 2026.
  31. Ye, H., Dang, J., Fang, J., et al. On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR. arXiv:2605.06523, 2026.
  32. Zheng, L., Chiang, W.-L., Sheng, Y., et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023.

Cite this article

@misc{bhardwaj2026verifiercomposition,
  author       = {Bhardwaj, Manu},
  title        = {Calibration Drift Under Verifier Composition: A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization},
  year         = {2026},
  month        = {May},
  url          = {https://ifitsmanu.com/papers/verifier-composition},
  howpublished = {\url{https://ifitsmanu.com/papers/verifier-composition/paper.pdf}},
  note         = {Working paper. Version 1.0. Research Paper #2 in the verification-economics wedge.}
}

Companion. Verifier Procurement. Companion. The Cost of Being Right. Companion. The α Asymmetry. Papers index. Home.