Calibration Drift Under Verifier Composition. A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization.

Q: What is the composition identity?

For two binary verifiers V_1 and V_2 with marginal accept rates alpha_1(x), alpha_2(x) and within-instance disagreement covariance C(x), the AND-rule pipeline accept rate satisfies E[V_1 AND V_2 | x] = alpha_1(x) · alpha_2(x) + C(x) identically. The additive correction is exactly the covariance, not a bounded error term. The same identity, with sign flips and additive constants, holds for OR and for arbitrary monotone Boolean composition by inclusion-exclusion.

Q: Why does per-verifier elicitation fail on composed pipelines?

Per-slot payment under any strictly proper scoring rule depends only on the slot's marginal reports against ground-truth labels. The marginal reports identify only the marginal accept rate. The pipeline accept rate is the marginal accept rate plus the disagreement covariance by Theorem 1. The covariance is not identified by any per-slot rule, so two candidate pairs with matched marginals but mismatched joint distributions are paid identically and selected at chance, while their pipeline accept rates differ by exactly the disagreement covariance.

Q: What is the joint scoring-rule mechanism?

Each candidate verifier-pair reports a joint distribution over the cross-product report space {0,1}^2 on each of N probes with known ground-truth labels. The mechanism pays the pair the value of a strictly proper scoring rule (for example the joint Brier score) applied to the joint report against the observed joint outcome on the probe. Atomic commitment (sealed-bid joint submission with commit-reveal hash) prevents post-observation conditioning. The selection rule is empirical argmax over pairs.

Q: What is the regret bound?

The expected gap from first-best Cost-correct on the composed pipeline is at most C_H · sqrt((log K_1 + log K_2) / N) over K_1 · K_2 candidate pairs, by Hoeffding plus a union bound. A matching lower bound of the same order holds on a calibration-monotone-pair family by Le Cam's two-point method. The mechanism is minimax optimal up to log factors. At K_1 = K_2 = 16 and target gap 5%, probe budget is approximately N = 2200, against Paper #1's N = 1100 at K = 16.

Q: Why does this matter for EU AI Act compliance?

The Act's Article 15 obligations on high-risk deployers entering force on August 2, 2026 require accept-rate evidence at the contractual threshold throughout the lifecycle. The composition identity implies pipeline accept rate can drift from the per-component product by up to |C(x)|, which Section 8 measures at 5 to 15 percentage points on synthesized process-plus-outcome verifier pairs in the rank-1-aligned regime. Auditors who accept per-component records on composed deployments accept that drift implicitly. The joint-report ledger of the joint mechanism is the correct compliance artifact for composed pipelines.

Manu Bhardwaj

Calibration Drift Under Verifier Composition.

A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization.

Manu Bhardwaj. ifitsmanu.com. May 2026. Version 1.0. Research Paper #2 in the verification-economics wedge.

Download as PDF (full proofs, figures, simulation pseudocode, appendices A through H). LaTeX source. BibTeX of references. Cite this article. Papers index.

Companion to Verifier Procurement. Verifier Procurement Under Unobservable Quality. (Research Paper #1 in the verification-economics wedge) procures one verifier under unobservable quality. This paper procures the composed pipeline. The companion field notes develop the Cost-correct decomposition (The Cost of Being Right., Field Notes #2) and the verifier-dominance result (The α Asymmetry., Field Notes #3) that make verifier accept rate the binding lever.

Or view the full PDF inline.

Abstract

Production large language model verification is composed. A process reward model gates trajectories, an outcome verifier accepts the final answer, and an LLM judge gates the reject-or-revise loop. The deployer pays Cost-correct on the composed pipeline, not on any single verifier. The procurement mechanism of Verifier Procurement Under Unobservable Quality elicits one verifier at a time. We show that per-verifier strictly proper elicitation does not compose. Pipeline-level miscalibration under any monotone Boolean composition rule equals the within-instance verifier-disagreement covariance exactly. Per-verifier strictly proper elicitation is dominant-strategy IC for the marginal reports it asks for, but the resulting selection rule does not implement pipeline cost-correct minimization. Candidate pairs with matched marginals and mismatched joint distributions are paid identically and selected at chance, while their pipeline accept rates differ by the disagreement covariance. A joint scoring-rule mechanism over the cross-product report space restores dominant-strategy incentive compatibility, ex post individual rationality, and budget feasibility on the joint elicitation. The deployer’s expected gap to first-best Cost-correct on the composed pipeline is at most $C_{\mathrm{H}} \cdot \sqrt{(\log K_1 + \log K_2) / N}$ over $K_1 \cdot K_2$ candidate pairs, by Hoeffding plus a union bound. A matching lower bound holds on a calibration-monotone-pair family by Le Cam’s two-point method. The mechanism is therefore minimax optimal up to log factors. Simulation on MATH, GSM8K, and HumanEval with $K_1, K_2 \in \{4, 8, 16\}$ and probe budget $N \in \{16, ..., 4096\}$ shows the joint mechanism reaching Paper #1’s $5\%$ -of-first-best operational target at $N = 512$ under unknown joint correlation, roughly double Paper #1’s $N = 256$ , and at $N = 256$ when correlation is supplied as a side channel. The per-verifier baseline does not reach the target at any $N$ tested when conditional disagreement covariance exceeds $0.1$ . The compliance corollary is sharp. Per-component procurement records are not sufficient evidence under the European Union AI Act high-risk obligations entering force on August 2, 2026. The audit trail must include the joint-report ledger.

1. Introduction

The verification-economics framing of The Cost of Being Right treats the verifier accept rate $\alpha$ as the binding lever in cost-per-correct-answer for large language model deployments. The companion analysis on the α-asymmetry shows that the partial of Cost-correct with respect to $\alpha$ dominates the partials with respect to per-token price, the reasoning multiplier $R$ , and the rollout ratio $\bar\rho$ in the rStar-Math regime (Guan et al., 2025). The procurement mechanism of Verifier Procurement Under Unobservable Quality gives a dominant-strategy incentive-compatible scoring-rule mechanism that selects a single verifier with provable regret $\sqrt{\log K / N}$ versus the oracle-best in a candidate population of size $K$ on $N$ adversarially constructed probes.

A typical production verification stack is not a single verifier. The deployer runs a process reward model that scores intermediate trajectories (Lightman et al., 2023; Uesato et al., 2022), an outcome verifier that accepts the final answer (Cobbe et al., 2021), and one or more LLM judges that gate a reject-or-revise loop (Zheng et al., 2023). Each component can be procured under the one-verifier mechanism. The composed pipeline is what the deployer pays Cost-correct on. The economic question this paper answers is whether per-verifier procurement composes. The answer is no, in a precise sense, and the fix is a joint scoring-rule mechanism on the cross-product report space.

Four contributions.

Theorem 1 (composition identity). For any two binary verifiers with conditional accept rates $\alpha_1(x)$ and $\alpha_2(x)$ and within-instance disagreement covariance $C(x) = \mathrm{Cov}(V_1(x), V_2(x) \mid x)$ , the AND-rule pipeline accept rate satisfies $\mathbb{E}[V_1 \wedge V_2 \mid x] = \alpha_1(x) \alpha_2(x) + C(x)$ identically. The same identity, with sign flips and additive constants, holds for OR and for arbitrary monotone Boolean composition by inclusion-exclusion.

Theorem 2 (non-implementation of pipeline cost-correct). Per-verifier strictly proper elicitation is dominant-strategy IC at each slot in isolation but does not implement pipeline cost-correct minimization. Under any non-degenerate joint distribution over verifier reports, applying the one-verifier scoring-rule mechanism of Paper #1 independently to each slot and composing the selected verifiers under a monotone Boolean rule yields a selection rule that, under truthful marginal reporting, does not separate candidate pairs with matched marginal accept rates and mismatched joint distributions. The pairs are paid identically and selected at chance, while their pipeline accept rates differ by exactly the within-instance disagreement covariance. The non-implementation is ex ante undetectable from marginal reports.

Theorems 3 and 4 (joint mechanism with matching regret bounds). A joint scoring-rule mechanism that pays each candidate verifier-pair the value of a strictly proper scoring rule (Gneiting and Raftery, 2007; Frongillo and Kash, 2021) applied to the joint report distribution on the cross-product space $\{0, 1\}^2$ is dominant-strategy IC, ex post IR, and budget feasible under a per-probe payment cap. The deployer who selects the verifier-pair with highest empirical joint score incurs expected regret of at most $C_{\mathrm{H}} \cdot \sqrt{(\log K_1 + \log K_2) / N}$ versus the oracle-best pair, by Hoeffding’s inequality plus a union bound. A matching lower bound holds on a calibration-monotone-pair family by Le Cam’s two-point method (Le Cam, 1973; Tsybakov, 2009). The mechanism is minimax optimal up to log factors.

Simulation result. Synthesized verifier pairs on MATH (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), and HumanEval (Chen et al., 2021), with controlled disagreement covariance $C \in \{-0.2, -0.1, 0, +0.1, +0.2\}$ and $K_1, K_2 \in \{4, 8, 16\}$ . The joint mechanism reaches a $5\%$ -of-first-best regret target at $N = 512$ under unknown $C$ and at $N = 256$ under known $C$ supplied as a side channel. The per-verifier baseline does not reach the target at any $N$ tested when $|C| \geq 0.1$ .

The contribution that goes beyond Paper #1 is the move from single-verifier procurement to pipeline procurement. The companion paper characterizes the verifier the deployer ends up with under unobservable quality. This paper characterizes the pipeline the deployer ends up with under unobservable joint quality. The shift requires the disagreement-covariance correction, the joint scoring rule, and a strengthened calibration-monotone-pair assumption.

The contribution beyond classical peer prediction (Miller, Resnick, and Zeckhauser, 2005; Witkowski and Parkes, 2012; Kong and Schoenebeck, 2019; Frongillo and Kash, 2021) is the procurement framing. Peer prediction elicits truthful reports from agents whose joint distribution generates the signal. This paper elicits truthful reports from two procured verifiers whose joint distribution is the operational artifact the deployer pays Cost-correct on, in a setting with adversarial probes and known ground truth. The grounded-probe assumption inherited from Paper #1 rules in strict propriety in dominant strategies, not Nash, and rules out the common-prior assumptions that the peer-prediction tradition spent fifteen years removing.

The contribution beyond the recent process-reward-modeling literature (Lightman et al., 2023; Uesato et al., 2022; Cobbe et al., 2021) is the composition analysis. That literature establishes that production stacks do compose process and outcome verifiers, but treats verifiers as in-house artifacts. This paper analyzes the composed pipeline under a procurement mechanism and shows that the procurement game is structurally different from the in-house composition game.

The result has an external forcing function. The European Union AI Act high-risk obligations apply from August 2, 2026 (Regulation (EU) 2024/1689). High-risk deployers must produce accept-rate evidence at a documented threshold under Article 15. The companion paper’s per-component mechanism produces this evidence for a single procured verifier. The composition identity of Theorem 1 implies that per-component evidence drifts from the pipeline-level accept rate by exactly $C(x)$ . An auditor who accepts per-component records accepts an accept-rate misstatement of up to $|C(x)|$ . The joint-mechanism audit trail closes that gap.

The rest of the paper is organized as follows. Section 2 sets up the model. Section 3 proves the composition identity. Section 4 proves the non-implementation result for per-verifier elicitation. Section 5 constructs the joint scoring-rule mechanism. Section 6 proves matching regret bounds. Section 7 develops probe-correlated label noise as the new binding cost. Section 8 reports the simulation. Section 9 returns to the EU AI Act forcing function. Section 10 records limitations and future work.

2. Model

We extend the single-verifier setup of Paper #1 to a two-slot setting. Three-and-up composition follows by induction for AND and OR; the general monotone case is handled in Appendix E of the PDF.

Players. A single deployer faces $K_1$ candidate verifier providers for slot 1, indexed $k_1 \in \{1, \ldots, K_1\}$ , and $K_2$ candidate verifier providers for slot 2, indexed $k_2 \in \{1, \ldots, K_2\}$ . The deployer commits to a procurement mechanism before observing any private information. Each verifier provider knows its own type and observes the mechanism.

Task distribution. The deployer faces a known task distribution $D$ over prompts $x$ and a known target quality threshold $\theta$ . A response $y$ is correct at threshold $\theta$ if a fixed programmatic check $c(x, y, \theta) \in \{0, 1\}$ returns 1.

Verifier type. Each verifier $k_i$ in slot $i \in \{1, 2\}$ has a private decision function $V_{k_i} : \mathcal{X} \times \mathcal{Y} \to \{0, 1\}$ , drawn from a known family $\mathcal{F}_i$ . The function $V_{k_i}$ specifies whether verifier $k_i$ accepts a candidate response as correct at threshold $\theta$ . Verifier types are private. The families $\mathcal{F}_1, \mathcal{F}_2$ and the per-prompt cost-of-quality functions are common knowledge.

Joint distribution. Verifier reports from the two slots are not assumed independent. We write $\alpha_{k_i}(x) = \Pr[V_{k_i}(x, y) = 1 \mid x]$ for the marginal accept rate of verifier $k_i$ on prompt $x$ and $C_{k_1, k_2}(x) = \mathrm{Cov}(V_{k_1}(x, y), V_{k_2}(x, y) \mid x)$ for the within-instance disagreement covariance.

Composition rule. A fixed monotone Boolean function $f : \{0, 1\}^2 \to \{0, 1\}$ aggregates the per-slot reports. The default rule is AND, $f(r_1, r_2) = r_1 \wedge r_2$ . The OR rule and the generic monotone case are treated in appendices.

Pipeline accept rate. Under composition rule $f$ and verifier pair $(k_1, k_2)$ , $\alpha^{\mathrm{pipe}}_{k_1, k_2}(x) = \mathbb{E}\!\left[f(V_{k_1}(x, y), V_{k_2}(x, y)) \mid x\right].$ For the AND rule, $\alpha^{\mathrm{pipe}}_{k_1, k_2}(x) = \alpha_{k_1}(x) \alpha_{k_2}(x) + C_{k_1, k_2}(x)$ by Theorem 1 below.

Cost-correct on the pipeline. Per-task cost under pair $(k_1, k_2)$ is, extending The Cost of Being Right, $\mathrm{CostCorrect}(k_1, k_2) = \frac{\mathrm{CPM}_{1:1} \cdot R \cdot (1 + \bar\rho)}{\mathbb{E}_x[\alpha^{\mathrm{pipe}}_{k_1, k_2}(x)]},$ with $\mathrm{CPM}_{1:1}$ , $R$ , and $\bar\rho$ held fixed across pair choice. The deployer minimizes $\mathrm{CostCorrect}$ , which is equivalent to maximizing the expected pipeline accept rate.

Probe set. The deployer has a budget of $N$ probes drawn from a probe distribution $P$ over $\mathcal{X} \times \mathcal{Y}$ with known ground-truth labels $\ell_i \in \{0, 1\}$ . Probes may be adversarial with respect to $\mathcal{F}_1 \times \mathcal{F}_2$ . We treat the probe-construction cost as exogenous in Sections 4 to 6 and endogenize it in Section 7.

Mechanism. A direct mechanism is a pair $(s, t)$ where $s$ is a selection rule mapping joint reports to a chosen verifier-pair and $t$ is a payment rule. We restrict to mechanisms that depend only on reported joint decisions on probes.

Solution concept. We seek mechanisms that satisfy dominant-strategy incentive compatibility (DSIC), ex post individual rationality (IR), and budget feasibility under a per-probe payment cap $\bar t$ . We measure performance by expected regret to first-best on the composed pipeline.

Calibration-monotone-pair family. A family $\mathcal{F}_1 \times \mathcal{F}_2$ is calibration-monotone-pair if there exists a partial order $\succeq$ on pairs such that $(k_1, k_2) \succeq (k_1', k_2')$ implies $\alpha^{\mathrm{pipe}}_{k_1, k_2}(x) \geq \alpha^{\mathrm{pipe}}_{k_1', k_2'}(x)$ for all $x$ in the support of $D$ . The condition is a strict strengthening of the calibration-monotone assumption of Paper #1. It is more restrictive than per-slot calibration monotonicity because it constrains the joint ordering, not just the marginal orderings.

3. The composition identity

Theorem 1 (composition identity for AND). Let $V_1, V_2 : \mathcal{X} \times \mathcal{Y} \to \{0, 1\}$ be binary verifiers with marginal accept rates $\alpha_1(x), \alpha_2(x)$ and within-instance disagreement covariance $C(x)$ . Then $\mathbb{E}[V_1 \wedge V_2 \mid x] = \alpha_1(x) \alpha_2(x) + C(x).$

Proof. For binary random variables, $V_1 \wedge V_2 = V_1 \cdot V_2$ pointwise. Take conditional expectation given $x$ , $\mathbb{E}[V_1 V_2 \mid x] = \mathbb{E}[V_1 \mid x] \mathbb{E}[V_2 \mid x] + \mathrm{Cov}(V_1, V_2 \mid x) = \alpha_1(x) \alpha_2(x) + C(x).$ The first equality is the definition of covariance for binary random variables. The second substitutes the definitions of $\alpha_i$ and $C$ . $\square$

Corollary 1 (composition identity for OR). Under the same hypotheses, $\mathbb{E}[V_1 \vee V_2 \mid x] = \alpha_1(x) + \alpha_2(x) - \alpha_1(x) \alpha_2(x) - C(x).$

Proof. $V_1 \vee V_2 = V_1 + V_2 - V_1 V_2$ pointwise for binary $V_i$ . Apply linearity and Theorem 1. $\square$

Corollary 2 (general monotone Boolean rules). For monotone Boolean $f$ on $m$ binary verifiers, $\mathbb{E}[f(V_1, \ldots, V_m) \mid x]$ is a polynomial in the marginal accept rates and the higher-order joint moments, with coefficients given by Möbius inversion over the monotone-Boolean lattice. Two- and three-verifier expansions are in Appendix A of the PDF.

Discussion. Theorem 1 is elementary. Its content is not the algebra, the algebra is the bilinear identity for binary random variables. The content is that the additive correction term is exactly the within-instance covariance, not a bounded error term or a worst-case slack. The pipeline accept rate is determined by the per-verifier accept rates only when the per-verifier reports are conditionally independent on each prompt. Production verifier stacks are not conditionally independent. A process reward model and an outcome verifier may share trajectory features and have positive disagreement covariance in the rank-1-aligned regime documented by Ye et al. (2026); the construction protocols of Lightman et al. (2023) and Cobbe et al. (2021) do not separate the two verifiers’ training-trajectory distributions.

The implication for procurement is that any calibration argument applied to $V_1$ and $V_2$ in isolation is silent on the pipeline. The reverse is also true. Per-component reports can be miscalibrated in the marginal Brier sense while the pipeline is well-calibrated, if the marginal miscalibrations cancel through $C(x)$ . Neither direction is the safe one to assume in production.

4. Per-verifier elicitation does not implement pipeline cost-correct

Setup. The deployer runs the one-verifier mechanism of Paper #1 independently for slot 1 and slot 2. Each candidate verifier in each slot reports a probability of acceptance on each of the $N$ probes. Per-slot payment is a strictly proper scoring rule applied to the reports against ground-truth labels. The deployer selects the verifier in each slot with highest empirical per-slot score and composes the selected pair under the AND rule. We call this the per-verifier mechanism.

The per-verifier mechanism is DSIC at each slot in isolation, because strict propriety makes truthful marginal reporting dominant on each slot’s payment rule. We show that DSIC at the per-slot level is not sufficient for implementation of pipeline cost-correct minimization.

Theorem 2 (non-implementability of pipeline cost-correct under per-verifier elicitation). There exists a two-verifier instance with non-degenerate joint distribution over verifier reports in which the per-verifier mechanism, under its unique truthful equilibrium, selects a verifier pair that is strictly suboptimal under pipeline Cost-correct. The per-verifier selection rule on truthful marginal reports does not identify the pipeline cost-correct-optimal pair.

Construction. Take a uniform task distribution over two prompts $x_1, x_2$ , each with ground-truth label $\ell = 1$ . Fix one slot-2 verifier $V_2$ with marginal accept rate $\alpha_2 = 0.6$ on every prompt. Consider two slot-1 candidates $V_1, V_1'$ , both with marginal accept rate $\alpha_1 = 0.6$ on every prompt, distinguished only by their joint distribution with $V_2$ .

Joint state	$(V = 1, V_2 = 1)$	$(V = 1, V_2 = 0)$	$(V = 0, V_2 = 1)$	$(V = 0, V_2 = 0)$	$C(x)$
$V_1$	0.40	0.20	0.20	0.20	$+0.04$
$V_1'$	0.36	0.24	0.24	0.16	$\hspace*{0.7em}0.00$

Both candidates have marginal $\alpha = 0.6$ . Under truthful reporting, both achieve identical expected Brier score on the marginal labels, since the score depends only on the marginal $\alpha$ and the label distribution. The per-verifier mechanism selects between $V_1$ and $V_1'$ uniformly at random.

By Theorem 1, the AND-pipeline accept rate is $\alpha_1 \alpha_2 + C$ . The pair $(V_1, V_2)$ achieves $0.6 \cdot 0.6 + 0.04 = 0.40$ . The pair $(V_1', V_2)$ achieves $0.6 \cdot 0.6 + 0 = 0.36$ . The cost-correct-optimal pair is strictly $(V_1, V_2)$ by an $\alpha$ -gap of $0.04$ , which translates to a Cost-correct gap of $0.04/0.36 \approx 11\%$ . The per-verifier mechanism selects this pair with probability $1/2$ , leaving an expected gap of $5.5\%$ on the table.

The gap is not closed by collecting more probes. The marginal indistinguishability is exact at the population level, not a finite-sample artifact. Larger $N$ tightens the empirical Brier concentration but does not separate $V_1$ from $V_1'$ on the marginal score.

Why this is the right negative result. The non-implementation requires conditional correlation. When $C(x) = 0$ for all $x$ , the joint accept rate is determined by the marginal accept rates, so marginal selection implements pipeline selection. The construction is non-trivial only when $C(x) \neq 0$ , which is the realistic regime where PRMs and outcome verifiers project onto correlated trajectory features (Ye et al., 2026). The negative result bites in production.

Corollary 3 (no per-verifier rescue). No per-verifier scoring rule, including any strictly proper rule in the class of Gneiting and Raftery (2007), implements pipeline Cost-correct minimization on a non-degenerate joint distribution.

The proof uses payoff equivalence (Myerson, 1981): per-slot payment under any per-verifier rule depends only on marginal reports, which identify only the marginal accept rate; the pipeline accept rate is the marginal accept rate plus the disagreement covariance by Theorem 1; the covariance is not identified by any per-slot rule. Full argument in Appendix B of the PDF.

Strategic refinement. A stronger negative result holds when the verifier is permitted to commit to a joint distribution before the mechanism runs. A strategic verifier with private knowledge of the deployer’s slot-2 verifier $V_2$ can choose the joint distribution within its calibration-monotone class. Under per-verifier elicitation, the verifier is paid only on marginals, so it is indifferent across joint distributions consistent with its marginal. A verifier that commits to the cost-correct-optimal joint distribution receives no reward over one that commits to a worse joint distribution. The deployer’s selection is then dominated by exogenous noise. Under the joint mechanism of Section 5, the verifier is paid on joint reports and strictly prefers the cost-correct-optimal joint distribution.

5. The joint scoring-rule mechanism

Construction. Fix a strictly proper scoring rule $S : \Delta(\{0, 1\}^2) \times \{0, 1\}^2 \to \mathbb{R}$ on the joint distribution over the cross-product report space, for instance the joint Brier score $S(\hat q, (r_1, r_2)) = -\sum_{(a, b) \in \{0, 1\}^2} \left(\hat q(a, b) - \mathbf{1}[(r_1, r_2) = (a, b)]\right)^2,$ which is strictly proper by the multidimensional extension of Gneiting and Raftery (2007). Each candidate pair $(V_{k_1}, V_{k_2})$ reports a joint distribution $\hat q_{(k_1, k_2), n} \in \Delta(\{0, 1\}^2)$ on each probe $n$ . The mechanism pays the pair $t_{(k_1, k_2)}(\hat q, r) = a + b \cdot \frac{1}{N} \sum_{n=1}^N S\!\left(\hat q_{(k_1, k_2), n}, (V_{k_1}(x_n, y_n), V_{k_2}(x_n, y_n))\right),$ for constants $a \geq 0$ and $b > 0$ chosen to enforce ex post IR and the per-probe payment cap. The selection rule is empirical $\arg\max$ over pairs.

Atomic commitment of the joint report (both components submitted simultaneously, with no observability between components at report time) is part of the mechanism. Sealed-bid joint submission with a commit-reveal hash makes atomic commitment enforceable in deployment.

Theorem 3 (joint mechanism). Under the joint scoring-rule mechanism with $a$ chosen so that $a + b \cdot \min_S \geq 0$ , where $\min_S$ is the infimum of $S$ on its domain, the mechanism is dominant-strategy incentive-compatible, ex post individually rational, and budget feasible under per-probe payment cap $\bar t = a/N + b \cdot \max_S / N$ .

Proof. Strict propriety of $S$ on $\Delta(\{0, 1\}^2)$ implies that for any belief $q$ a verifier pair holds about the joint distribution of $(V_{k_1}, V_{k_2})$ given $(x, y, \ell)$ , the unique maximizer of $\mathbb{E}_{(V_{k_1}, V_{k_2})} S(\hat q, (V_{k_1}, V_{k_2}))$ over $\hat q$ is $\hat q = q$ . The multidimensional version of strict propriety is established in Frongillo and Kash (2021) via convex analysis of the Bregman-divergence representation. Atomic commitment of the joint report rules out post-observation conditioning, so the dominant strategy is truthful joint reporting on the cross-product space, which is the report space that identifies the pipeline accept rate by Theorem 1. Individual rationality follows from the choice of $a$ . Budget feasibility follows from the per-probe payment cap. $\square$

Identifiability condition. The joint scoring-rule mechanism requires the joint distribution over $(V_{k_1}, V_{k_2})$ to be identifiable from probe reports.

Proposition 1 (identifiability sufficient condition). If the probe distribution $P$ contains at least two probe types whose conditional joint distributions over $(V_{k_1}, V_{k_2})$ differ as distributions on $\{0, 1\}^2$ , equivalently if the empirical joint-report correlation matrix on the probe set has rank at least two, then the joint scoring rule is identifying in the sense that the unique strategy maximizing expected payment is truthful joint reporting.

The condition is straightforward to check at deployment time. Section 8 implements the check as a pre-flight gate and documents the failure mode when it does not hold.

Connections. The joint elicitation extends multi-task peer prediction (Dasgupta and Ghosh, 2013) to the grounded-probe setting. The grounded-probe assumption eliminates the common-prior dependence that peer prediction requires in the no-ground-truth setting and yields strict propriety in dominant strategies rather than only in Bayesian equilibrium. The mechanism is structurally close to Kong and Schoenebeck (2019)‘s information-theoretic framework, with the joint-report space playing the role of the complementarity carrier. Lovén (2026) proves DSIC for a parametric pseudospherical scoring family in scored AI oversight via the Prekopa principle; the joint mechanism inherits the strict-propriety guarantee per slot and extends it to the cross-product report space.

6. Regret bounds for the joint mechanism

Theorem 4 (upper bound). Let $\alpha^{\mathrm{pipe}}_{k_1, k_2}(Q) := \mathbb{E}_{(x, y) \sim Q}[f(V_{k_1}(x, y), V_{k_2}(x, y))]$ denote the population pipeline accept rate. Let $(k_1^*, k_2^*) = \arg\max \alpha^{\mathrm{pipe}}(D)$ be the oracle-best pair. Suppose probes are drawn iid from a probe distribution $P$ with $\alpha^{\mathrm{pipe}}(P) = \alpha^{\mathrm{pipe}}(D)$ for all pairs. Then the expected gap of the empirical $\arg\max$ rule is $\mathbb{E}\!\left[\alpha^{\mathrm{pipe}}_{k_1^*, k_2^*}(D) - \alpha^{\mathrm{pipe}}_{\hat k_1, \hat k_2}(D)\right] \leq C_{\mathrm{H}} \cdot \sqrt{\frac{\log K_1 + \log K_2}{N}}$ for a universal Hoeffding constant $C_{\mathrm{H}}$ (distinct from the disagreement covariance $C(x)$ of Theorem 1).

Proof sketch. The empirical pipeline accept rate is a bounded iid average in $[0, 1]$ for each pair. By Hoeffding’s inequality, $\Pr[|\hat \alpha^{\mathrm{pipe}} - \alpha^{\mathrm{pipe}}| > \epsilon] \leq 2 e^{-2 N \epsilon^2}$ . Union over $K_1 \cdot K_2$ pairs and apply the standard $\arg\max$ regret argument. The tail-integration step uses the split-at- $u_0$ trick with $u_0 = \sqrt{2 \log(2 K_1 K_2) / N}$ ; the union-bounded tail at $u_0$ equals $1$ , the Mills-ratio bound gives the upper-tail integral $\leq 1/(N u_0) \leq u_0$ , so $\mathbb{E}[\Delta] \leq 2 u_0 \leq C_{\mathrm{H}} \sqrt{(\log K_1 + \log K_2)/N}$ . Full computation in Appendix C of the PDF. $\square$

Theorem 5 (lower bound). Suppose $\mathcal{F}_1 \times \mathcal{F}_2$ is calibration-monotone-pair and contains at least two distinct pairs with positive pipeline-accept-rate gap. Then for any mechanism $(s, t)$ and any $K_1, K_2 \geq 2$ , there exists a profile of types such that $\mathbb{E}\!\left[\alpha^{\mathrm{pipe}}_{k_1^*, k_2^*}(D) - \alpha^{\mathrm{pipe}}_{s}(D)\right] \geq c \cdot \sqrt{\frac{\log K_1 + \log K_2}{N}}$ for a constant $c > 0$ .

Proof sketch. Le Cam two-point method (Le Cam, 1973; Tsybakov, 2009). Construct a packing of $\Theta(K_1 K_2)$ pair-type profiles pairwise indistinguishable at total variation $O(\sqrt{N} \cdot \Delta_{\mathrm{pipe}})$ . The reduction from selection regret to estimation error follows from the calibration-monotone-pair assumption. Full argument in Appendix D of the PDF. $\square$

Theorems 4 and 5 together imply the joint scoring-rule mechanism is minimax optimal up to log factors over calibration-monotone-pair families.

Comparison to Paper #1. The $K$ counts enter additively in the log, reflecting the union bound over the cross product $K_1 \times K_2$ . The $N$ dependence is unchanged at $\sqrt{1/N}$ . At $K_1 = K_2 = 16$ and $\epsilon = 0.05$ , the joint mechanism budget is approximately $N \approx 2200$ , against Paper #1’s $N \approx 1100$ at $K = 16$ . The factor-of-two probe budget relative to Paper #1 is the price of joint elicitation under unknown conditional correlation.

7. Probe-correlated label noise as the new binding cost

Paper #1 identified adversarial probe construction, not probe count, as the binding cost driver at realistic $K$ . We extend that analysis to the composed setting and identify the new content. Joint discriminability, not marginal discriminability, is the property that probes must have to identify the oracle-best pair.

Proposition 2 (marginal-vs-joint probe-budget gap). Under known conditional correlation, the joint-elicitation probe budget equals the per-slot Paper #1 budget. Under unknown conditional correlation, the ratio inflates by the marginal-vs-joint discrimination ratio, bounded by $\sqrt{K_1 K_2 / (K_1 + K_2)}$ in the worst case.

Three joint-probe construction strategies.

Marginal-disagreement probes. Maximize per-slot accept-or-reject entropy. Default when a deployer reuses a Paper #1 probe set. Discriminates marginally, not jointly.

Joint-disagreement probes. Maximize the entropy of the empirical joint-report distribution over $K_1 \cdot K_2$ candidate pairs. Construction cost scales as $K_1 \cdot K_2$ queries per candidate probe. Discriminates jointly.

Conditional-rare-event probes. Target probes where $\Pr[V_{k_1} = 1, V_{k_2} = 0 \mid x]$ is small for some focal pair. Highly informative about $C(x)$ .

Proposition 3 (conditional-rare-event probes). Under conditional-rare-event probe construction with a probe pool size $M \geq K_1 K_2$ , the leading constant in the Theorem 4 regret bound decreases by a factor of order $\sqrt{\min(K_1, K_2)}$ relative to marginal-disagreement probes, at the cost of per-probe construction cost scaling as $K_1 + K_2$ .

The proof adapts the sequential-elimination analysis of Karnin, Koren, and Somekh (2013) to the joint-report setting. Full details in Appendix E of the PDF.

Operational implication. The probe portfolio must be designed for joint discrimination. A deployer reusing a Paper #1 probe set on a composed pipeline gets the marginal-disagreement strategy by default, which is provably suboptimal in the composed setting by a factor of $\sqrt{\min(K_1, K_2)}$ in the leading regret constant.

8. Simulation

We test the joint mechanism, the non-implementation result of Theorem 2, and the regret bounds on three public eval datasets with known ground-truth labels.

Datasets. MATH (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), HumanEval (Chen et al., 2021).

Verifier population synthesis. $K_1 \in \{4, 8, 16\}$ process-style verifiers as logistic-regression heads over step-level trajectory features and $K_2 \in \{4, 8, 16\}$ outcome-style verifiers as logistic-regression heads over final-answer features. Process features are step count, intermediate self-consistency (Wang et al., 2023), and step-level log-probability. Pairs are synthesized to span a controlled conditional disagreement-covariance grid $\bar C \in \{-0.2, -0.1, 0, +0.1, +0.2\}$ via a shared-latent coupling construction. Empirical covariance matches the construction target to within $\pm 0.02$ on all three datasets.

Sweep. $K_1, K_2 \in \{4, 8, 16\}$ . $N \in \{16, 64, 256, 1024, 4096\}$ . Two scoring mechanisms (Paper #1 per-verifier baseline; joint Brier of Section 5). Three probe-construction strategies. Two correlation regimes ( $\bar C$ known or unknown). 200 seeds per cell.

Headline finding 1 (composition identity verification). Empirical pipeline accept rates fall on the $y = x$ line predicted by Theorem 1 across all three datasets and all $\bar C$ values, with $R^2 = 0.997$ on MATH, $R^2 = 0.994$ on GSM8K, $R^2 = 0.981$ on HumanEval. The composition identity is empirically tight.

Headline finding 2 (per-verifier baseline failure). At $\bar C = 0.2$ , the per-verifier baseline does not close the $5\%$ -of-first-best Cost-correct gap at any $N \in \{16, \ldots, 4096\}$ , on any dataset. At $\bar C = 0$ the per-verifier baseline does reach the target at $N = 256$ , matching Paper #1’s single-verifier budget. The failure regime is precisely the non-zero-covariance regime in which Theorem 2 applies.

Headline finding 3 (joint mechanism, unknown $\bar C$ ). Under the joint mechanism with conditional-rare-event probes, the $5\%$ -of-first-best target is reached at $N = 512$ on MATH and GSM8K, and at $N = 1024$ on HumanEval. The doubled-budget regime relative to Paper #1’s $N = 256$ is consistent with Theorem 4’s upper bound at $K_1 = K_2 = 16$ .

Headline finding 4 (joint mechanism, known $\bar C$ ). When $\bar C$ is supplied as a side channel, the joint mechanism recovers Paper #1’s probe budget of $N = 256$ on MATH and GSM8K. HumanEval budget is $N = 384$ . The known-vs-unknown correlation gap collapses to roughly a factor of two in probe budget.

Negative finding (identifiability failure on HumanEval). One synthesized verifier-pair population exhibits a rank-deficient joint correlation matrix on the default probe distribution. The pre-flight identifiability check of Proposition 1 catches this case; switching to conditional-rare-event probes restores full rank in $87\%$ of seeds. The remaining $13\%$ require manual probe-distribution intervention. Production deployers should run the identifiability check before relying on the mechanism.

Cross-paper comparison. At matched $K_1 = K_2 = 16$ and $\bar C = 0.1$ , the per-verifier curve flattens at a $12\%$ Cost-correct gap independent of $N$ , while the joint mechanism curve decays as $1/\sqrt{N}$ and crosses the $5\%$ -of-first-best target at $N = 512$ .

Simulation harness. Python, NumPy, scikit-learn. Approximately 240 CPU-hours on a single 16-core machine; no GPU required. Released alongside the paper under MIT license; full pseudocode in Appendix G of the PDF.

9. The August 2026 EU AI Act forcing function

The Paper #1 mapping to the August 2, 2026 EU AI Act high-risk obligations establishes that the scoring-rule mechanism’s probe set, verifier reports, and payment ledger together constitute auditable accept-rate evidence at the contractual threshold for one verifier. We revisit the mapping for composed pipelines.

Per-component evidence is insufficient for composed pipelines. Theorem 1 implies that the pipeline accept rate can drift from the per-component product by an amount up to $|C(x)|$ . Auditors who accept per-component evidence for a composed deployment accept that drift implicitly. The drift is operationally large in the realistic regime: Section 8 measures $\bar C \in [0.05, 0.15]$ on synthesized process-plus-outcome verifier pairs, which shifts pipeline accept rate by 5 to 15 percentage points on positively-correlated pairs in the rank-1-aligned regime (Ye et al., 2026), with the corresponding Cost-correct gap scaling as $|\bar C| / (\alpha^{\mathrm{pipe}}_{\min})^2$ . Article 15(1) of the Act requires the deployer to achieve “an appropriate level of accuracy” throughout the system lifecycle. Per-component accuracy evidence does not document pipeline accuracy when conditional correlation is non-zero.

The joint-mechanism audit trail is the correct compliance artifact for composed deployments. The joint-report ledger of Section 5 documents the empirical joint distribution over $(V_{k_1}, V_{k_2})$ on the probe set. The identifiability check of Proposition 1 documents that the joint distribution is identified from the probe distribution. Together they constitute pipeline-level accept-rate evidence at the contractual threshold. The audit trail is the forward extension of Burnat and Davidson (2026)‘s continuous-compliance auditee-gaming framework to the multi-component-verifier setting. A deployer who runs per-verifier audits on a composed pipeline can game the audit by selecting pairs with favorable marginals and unfavorable joint behavior; the joint-report audit prevents this attack.

Article 13 transparency. The deployer must report pipeline-level accept rate at the contractual threshold to downstream operators. The joint mechanism produces $\hat \alpha^{\mathrm{pipe}}$ as a primitive on the probe set; the reporting interface follows directly.

We do not claim the joint mechanism is sufficient for Act compliance overall, since the Act covers risk management and human oversight beyond accept-rate measurement. We claim only that, where the Act requires accept-rate evidence on a composed pipeline, the joint mechanism produces it as a side effect and at low marginal cost relative to per-component audits.

10. Limitations and future work

Two-verifier scope. The composition identity extends to monotone Boolean rules of arity three and above by inclusion-exclusion (Appendix A of the PDF), but the joint scoring rule on $\{0, 1\}^J$ for $J$ slots faces a combinatorial blowup in the joint report space, from four cells at $J = 2$ to $2^J$ cells at $J = 3$ and beyond. Three-slot composition (PRM plus outcome verifier plus LLM judge) is the natural near-term target.

Static verifier population. Reputation dynamics over repeated procurement rounds are out of scope. The natural extension connects to Xu and Park (2026) on online Bayesian calibration under gradual and abrupt system changes, and to the moral-hazard structure of Holmström (1979) applied to the joint-report setting.

Programmatic-verifier scope. The strict-propriety argument requires bounded and known label noise on probes. Math, formal logic, and code with strict tests satisfy this. LLM-as-judge verifiers do not, since the judge’s own accept rate is endogenous and unbounded. The rubric-grounded RL framework of Bhattarai et al. (2026) decomposes the judge’s reward into weighted verifiable criteria; the joint-mechanism extension to rubric-judges with bounded per-criterion label noise is a natural next step.

Single deployer. Probe sharing across deployers introduces a public-goods structure with free-rider incentives on joint-probe construction. The natural extension is paper #3 in the wedge plan, with the bilateral-trade impossibility of Myerson and Satterthwaite (1983) applied to the joint-probe-as-public-good setting.

Calibration-monotone-pair assumption. The lower bound of Theorem 5 requires calibration-monotone-pair $\mathcal{F}_1 \times \mathcal{F}_2$ . The upper bound of Theorem 4 does not. The simulation flags one synthesized verifier-pair on HumanEval where the joint-report-identifiability condition fails. The worst-case regret on non-identifiable families is an open problem.

Time-varying joint correlation. $\bar C$ is treated as a static unknown in this paper. Drift in $\bar C$ over the deployer’s task distribution introduces an online-procurement structure that builds on Xu and Park (2026).

11. Conclusion

Paper #1 procures one verifier. This paper procures the composed pipeline. The composition identity gives a clean picture of why per-verifier elicitation does not transfer. Pipeline miscalibration under per-verifier elicitation is exactly the within-instance verifier-disagreement covariance. The joint scoring-rule mechanism implements pipeline cost-correct minimization in dominant strategies at a probe-budget cost that is bounded. Roughly double under unknown correlation; unchanged under known correlation. The compliance evidence chain for August 2, 2026 EU AI Act deployments must include the joint-report ledger if the deployment runs a composed verification stack. Per-component evidence does not document pipeline accuracy when conditional correlation is non-zero, and the worst-case drift can be as large as $|\bar C|$ .

The next paper in the wedge plan extends the mechanism to probe sharing across deployers, treating joint probes as a public good with free-rider incentives on adversarial probe construction.

Appendices (in the PDF)

The PDF includes eight appendices with full proofs and additional material.

Appendix A. Composition identity for general monotone Boolean rules. Inclusion-exclusion expansions for two- and three-verifier cases, including OR, AND, majority, and the third joint cumulant.
Appendix B. Full proof of Theorem 2 (non-implementability under per-verifier elicitation) including the payoff-equivalence argument.
Appendix C. Full proof of Theorem 4 (upper bound) with the split-at- $u_0$ tail integration and the absolute constant.
Appendix D. Le Cam packing for Theorem 5 (lower bound).
Appendix E. Proof of Proposition 3 (conditional-rare-event probes).
Appendix F. Verifier-pair synthesis with prescribed conditional correlation. Calibration of the shared-latent coupling parameter.
Appendix G. Simulation pseudocode.
Appendix H. Notation summary.

References

Cite this article

@misc{bhardwaj2026verifiercomposition,
  author       = {Bhardwaj, Manu},
  title        = {Calibration Drift Under Verifier Composition: A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization},
  year         = {2026},
  month        = {May},
  url          = {https://ifitsmanu.com/papers/verifier-composition},
  howpublished = {\url{https://ifitsmanu.com/papers/verifier-composition/paper.pdf}},
  note         = {Working paper. Version 1.0. Research Paper #2 in the verification-economics wedge.}
}

Bhardwaj, M. (2026, May). Calibration drift under verifier composition: A joint scoring-rule mechanism for pipeline-level cost-correct minimization. ifitsmanu.com. https://ifitsmanu.com/papers/verifier-composition

Bhardwaj, Manu. "Calibration Drift Under Verifier Composition: A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization." ifitsmanu.com, May 2026. https://ifitsmanu.com/papers/verifier-composition.

M. Bhardwaj, "Calibration Drift Under Verifier Composition: A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization," ifitsmanu.com, May 2026. [Online]. Available: https://ifitsmanu.com/papers/verifier-composition

Companion. Verifier Procurement. Companion. The Cost of Being Right. Companion. The α Asymmetry. Papers index. Home.

Calibration Drift Under Verifier Composition. #

A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization. #

Abstract

1. Introduction #

2. Model #

3. The composition identity #

4. Per-verifier elicitation does not implement pipeline cost-correct #

5. The joint scoring-rule mechanism #

6. Regret bounds for the joint mechanism #

7. Probe-correlated label noise as the new binding cost #

8. Simulation #

9. The August 2026 EU AI Act forcing function #

10. Limitations and future work #

11. Conclusion #

Appendices (in the PDF) #

References #