Calibration Drift Under Verifier Composition.
A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization.
Manu Bhardwaj. ifitsmanu.com. May 2026. Version 1.0. Research Paper #2 in the verification-economics wedge.
Download as PDF (full proofs, figures, simulation pseudocode, appendices A through H). LaTeX source. BibTeX of references. Cite this article. Papers index.
Companion to Verifier Procurement. Verifier Procurement Under Unobservable Quality. (Research Paper #1 in the verification-economics wedge) procures one verifier under unobservable quality. This paper procures the composed pipeline. The companion field notes develop the Cost-correct decomposition (The Cost of Being Right., Field Notes #2) and the verifier-dominance result (The α Asymmetry., Field Notes #3) that make verifier accept rate the binding lever.
Or view the full PDF inline.
Abstract
Production large language model verification is composed. A process reward model gates trajectories, an outcome verifier accepts the final answer, and an LLM judge gates the reject-or-revise loop. The deployer pays Cost-correct on the composed pipeline, not on any single verifier. The procurement mechanism of Verifier Procurement Under Unobservable Quality elicits one verifier at a time. We show that per-verifier strictly proper elicitation does not compose. Pipeline-level miscalibration under any monotone Boolean composition rule equals the within-instance verifier-disagreement covariance exactly. Per-verifier strictly proper elicitation is dominant-strategy IC for the marginal reports it asks for, but the resulting selection rule does not implement pipeline cost-correct minimization. Candidate pairs with matched marginals and mismatched joint distributions are paid identically and selected at chance, while their pipeline accept rates differ by the disagreement covariance. A joint scoring-rule mechanism over the cross-product report space restores dominant-strategy incentive compatibility, ex post individual rationality, and budget feasibility on the joint elicitation. The deployer’s expected gap to first-best Cost-correct on the composed pipeline is at most over candidate pairs, by Hoeffding plus a union bound. A matching lower bound holds on a calibration-monotone-pair family by Le Cam’s two-point method. The mechanism is therefore minimax optimal up to log factors. Simulation on MATH, GSM8K, and HumanEval with and probe budget shows the joint mechanism reaching Paper #1’s -of-first-best operational target at under unknown joint correlation, roughly double Paper #1’s , and at when correlation is supplied as a side channel. The per-verifier baseline does not reach the target at any tested when conditional disagreement covariance exceeds . The compliance corollary is sharp. Per-component procurement records are not sufficient evidence under the European Union AI Act high-risk obligations entering force on August 2, 2026. The audit trail must include the joint-report ledger.
1. Introduction
The verification-economics framing of The Cost of Being Right treats the verifier accept rate as the binding lever in cost-per-correct-answer for large language model deployments. The companion analysis on the α-asymmetry shows that the partial of Cost-correct with respect to dominates the partials with respect to per-token price, the reasoning multiplier , and the rollout ratio in the rStar-Math regime (Guan et al., 2025). The procurement mechanism of Verifier Procurement Under Unobservable Quality gives a dominant-strategy incentive-compatible scoring-rule mechanism that selects a single verifier with provable regret versus the oracle-best in a candidate population of size on adversarially constructed probes.
A typical production verification stack is not a single verifier. The deployer runs a process reward model that scores intermediate trajectories (Lightman et al., 2023; Uesato et al., 2022), an outcome verifier that accepts the final answer (Cobbe et al., 2021), and one or more LLM judges that gate a reject-or-revise loop (Zheng et al., 2023). Each component can be procured under the one-verifier mechanism. The composed pipeline is what the deployer pays Cost-correct on. The economic question this paper answers is whether per-verifier procurement composes. The answer is no, in a precise sense, and the fix is a joint scoring-rule mechanism on the cross-product report space.
Four contributions.
Theorem 1 (composition identity). For any two binary verifiers with conditional accept rates and and within-instance disagreement covariance , the AND-rule pipeline accept rate satisfies identically. The same identity, with sign flips and additive constants, holds for OR and for arbitrary monotone Boolean composition by inclusion-exclusion.
Theorem 2 (non-implementation of pipeline cost-correct). Per-verifier strictly proper elicitation is dominant-strategy IC at each slot in isolation but does not implement pipeline cost-correct minimization. Under any non-degenerate joint distribution over verifier reports, applying the one-verifier scoring-rule mechanism of Paper #1 independently to each slot and composing the selected verifiers under a monotone Boolean rule yields a selection rule that, under truthful marginal reporting, does not separate candidate pairs with matched marginal accept rates and mismatched joint distributions. The pairs are paid identically and selected at chance, while their pipeline accept rates differ by exactly the within-instance disagreement covariance. The non-implementation is ex ante undetectable from marginal reports.
Theorems 3 and 4 (joint mechanism with matching regret bounds). A joint scoring-rule mechanism that pays each candidate verifier-pair the value of a strictly proper scoring rule (Gneiting and Raftery, 2007; Frongillo and Kash, 2021) applied to the joint report distribution on the cross-product space is dominant-strategy IC, ex post IR, and budget feasible under a per-probe payment cap. The deployer who selects the verifier-pair with highest empirical joint score incurs expected regret of at most versus the oracle-best pair, by Hoeffding’s inequality plus a union bound. A matching lower bound holds on a calibration-monotone-pair family by Le Cam’s two-point method (Le Cam, 1973; Tsybakov, 2009). The mechanism is minimax optimal up to log factors.
Simulation result. Synthesized verifier pairs on MATH (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), and HumanEval (Chen et al., 2021), with controlled disagreement covariance and . The joint mechanism reaches a -of-first-best regret target at under unknown and at under known supplied as a side channel. The per-verifier baseline does not reach the target at any tested when .
The contribution that goes beyond Paper #1 is the move from single-verifier procurement to pipeline procurement. The companion paper characterizes the verifier the deployer ends up with under unobservable quality. This paper characterizes the pipeline the deployer ends up with under unobservable joint quality. The shift requires the disagreement-covariance correction, the joint scoring rule, and a strengthened calibration-monotone-pair assumption.
The contribution beyond classical peer prediction (Miller, Resnick, and Zeckhauser, 2005; Witkowski and Parkes, 2012; Kong and Schoenebeck, 2019; Frongillo and Kash, 2021) is the procurement framing. Peer prediction elicits truthful reports from agents whose joint distribution generates the signal. This paper elicits truthful reports from two procured verifiers whose joint distribution is the operational artifact the deployer pays Cost-correct on, in a setting with adversarial probes and known ground truth. The grounded-probe assumption inherited from Paper #1 rules in strict propriety in dominant strategies, not Nash, and rules out the common-prior assumptions that the peer-prediction tradition spent fifteen years removing.
The contribution beyond the recent process-reward-modeling literature (Lightman et al., 2023; Uesato et al., 2022; Cobbe et al., 2021) is the composition analysis. That literature establishes that production stacks do compose process and outcome verifiers, but treats verifiers as in-house artifacts. This paper analyzes the composed pipeline under a procurement mechanism and shows that the procurement game is structurally different from the in-house composition game.
The result has an external forcing function. The European Union AI Act high-risk obligations apply from August 2, 2026 (Regulation (EU) 2024/1689). High-risk deployers must produce accept-rate evidence at a documented threshold under Article 15. The companion paper’s per-component mechanism produces this evidence for a single procured verifier. The composition identity of Theorem 1 implies that per-component evidence drifts from the pipeline-level accept rate by exactly . An auditor who accepts per-component records accepts an accept-rate misstatement of up to . The joint-mechanism audit trail closes that gap.
The rest of the paper is organized as follows. Section 2 sets up the model. Section 3 proves the composition identity. Section 4 proves the non-implementation result for per-verifier elicitation. Section 5 constructs the joint scoring-rule mechanism. Section 6 proves matching regret bounds. Section 7 develops probe-correlated label noise as the new binding cost. Section 8 reports the simulation. Section 9 returns to the EU AI Act forcing function. Section 10 records limitations and future work.
2. Model
We extend the single-verifier setup of Paper #1 to a two-slot setting. Three-and-up composition follows by induction for AND and OR; the general monotone case is handled in Appendix E of the PDF.
Players. A single deployer faces candidate verifier providers for slot 1, indexed , and candidate verifier providers for slot 2, indexed . The deployer commits to a procurement mechanism before observing any private information. Each verifier provider knows its own type and observes the mechanism.
Task distribution. The deployer faces a known task distribution over prompts and a known target quality threshold . A response is correct at threshold if a fixed programmatic check returns 1.
Verifier type. Each verifier in slot has a private decision function , drawn from a known family . The function specifies whether verifier accepts a candidate response as correct at threshold . Verifier types are private. The families and the per-prompt cost-of-quality functions are common knowledge.
Joint distribution. Verifier reports from the two slots are not assumed independent. We write for the marginal accept rate of verifier on prompt and for the within-instance disagreement covariance.
Composition rule. A fixed monotone Boolean function aggregates the per-slot reports. The default rule is AND, . The OR rule and the generic monotone case are treated in appendices.
Pipeline accept rate. Under composition rule and verifier pair , For the AND rule, by Theorem 1 below.
Cost-correct on the pipeline. Per-task cost under pair is, extending The Cost of Being Right, with , , and held fixed across pair choice. The deployer minimizes , which is equivalent to maximizing the expected pipeline accept rate.
Probe set. The deployer has a budget of probes drawn from a probe distribution over with known ground-truth labels . Probes may be adversarial with respect to . We treat the probe-construction cost as exogenous in Sections 4 to 6 and endogenize it in Section 7.
Mechanism. A direct mechanism is a pair where is a selection rule mapping joint reports to a chosen verifier-pair and is a payment rule. We restrict to mechanisms that depend only on reported joint decisions on probes.
Solution concept. We seek mechanisms that satisfy dominant-strategy incentive compatibility (DSIC), ex post individual rationality (IR), and budget feasibility under a per-probe payment cap . We measure performance by expected regret to first-best on the composed pipeline.
Calibration-monotone-pair family. A family is calibration-monotone-pair if there exists a partial order on pairs such that implies for all in the support of . The condition is a strict strengthening of the calibration-monotone assumption of Paper #1. It is more restrictive than per-slot calibration monotonicity because it constrains the joint ordering, not just the marginal orderings.
3. The composition identity
Theorem 1 (composition identity for AND). Let be binary verifiers with marginal accept rates and within-instance disagreement covariance . Then
Proof. For binary random variables, pointwise. Take conditional expectation given , The first equality is the definition of covariance for binary random variables. The second substitutes the definitions of and .
Corollary 1 (composition identity for OR). Under the same hypotheses,
Proof. pointwise for binary . Apply linearity and Theorem 1.
Corollary 2 (general monotone Boolean rules). For monotone Boolean on binary verifiers, is a polynomial in the marginal accept rates and the higher-order joint moments, with coefficients given by Möbius inversion over the monotone-Boolean lattice. Two- and three-verifier expansions are in Appendix A of the PDF.
Discussion. Theorem 1 is elementary. Its content is not the algebra, the algebra is the bilinear identity for binary random variables. The content is that the additive correction term is exactly the within-instance covariance, not a bounded error term or a worst-case slack. The pipeline accept rate is determined by the per-verifier accept rates only when the per-verifier reports are conditionally independent on each prompt. Production verifier stacks are not conditionally independent. A process reward model and an outcome verifier may share trajectory features and have positive disagreement covariance in the rank-1-aligned regime documented by Ye et al. (2026); the construction protocols of Lightman et al. (2023) and Cobbe et al. (2021) do not separate the two verifiers’ training-trajectory distributions.
The implication for procurement is that any calibration argument applied to and in isolation is silent on the pipeline. The reverse is also true. Per-component reports can be miscalibrated in the marginal Brier sense while the pipeline is well-calibrated, if the marginal miscalibrations cancel through . Neither direction is the safe one to assume in production.
4. Per-verifier elicitation does not implement pipeline cost-correct
Setup. The deployer runs the one-verifier mechanism of Paper #1 independently for slot 1 and slot 2. Each candidate verifier in each slot reports a probability of acceptance on each of the probes. Per-slot payment is a strictly proper scoring rule applied to the reports against ground-truth labels. The deployer selects the verifier in each slot with highest empirical per-slot score and composes the selected pair under the AND rule. We call this the per-verifier mechanism.
The per-verifier mechanism is DSIC at each slot in isolation, because strict propriety makes truthful marginal reporting dominant on each slot’s payment rule. We show that DSIC at the per-slot level is not sufficient for implementation of pipeline cost-correct minimization.
Theorem 2 (non-implementability of pipeline cost-correct under per-verifier elicitation). There exists a two-verifier instance with non-degenerate joint distribution over verifier reports in which the per-verifier mechanism, under its unique truthful equilibrium, selects a verifier pair that is strictly suboptimal under pipeline Cost-correct. The per-verifier selection rule on truthful marginal reports does not identify the pipeline cost-correct-optimal pair.
Construction. Take a uniform task distribution over two prompts , each with ground-truth label . Fix one slot-2 verifier with marginal accept rate on every prompt. Consider two slot-1 candidates , both with marginal accept rate on every prompt, distinguished only by their joint distribution with .
| Joint state | $(V = 1, V_2 = 1)$ | $(V = 1, V_2 = 0)$ | $(V = 0, V_2 = 1)$ | $(V = 0, V_2 = 0)$ | $C(x)$ |
|---|---|---|---|---|---|
| $V_1$ | 0.40 | 0.20 | 0.20 | 0.20 | $+0.04$ |
| $V_1'$ | 0.36 | 0.24 | 0.24 | 0.16 | $\hspace*{0.7em}0.00$ |
Both candidates have marginal . Under truthful reporting, both achieve identical expected Brier score on the marginal labels, since the score depends only on the marginal and the label distribution. The per-verifier mechanism selects between and uniformly at random.
By Theorem 1, the AND-pipeline accept rate is . The pair achieves . The pair achieves . The cost-correct-optimal pair is strictly by an -gap of , which translates to a Cost-correct gap of . The per-verifier mechanism selects this pair with probability , leaving an expected gap of on the table.
The gap is not closed by collecting more probes. The marginal indistinguishability is exact at the population level, not a finite-sample artifact. Larger tightens the empirical Brier concentration but does not separate from on the marginal score.
Why this is the right negative result. The non-implementation requires conditional correlation. When for all , the joint accept rate is determined by the marginal accept rates, so marginal selection implements pipeline selection. The construction is non-trivial only when , which is the realistic regime where PRMs and outcome verifiers project onto correlated trajectory features (Ye et al., 2026). The negative result bites in production.
Corollary 3 (no per-verifier rescue). No per-verifier scoring rule, including any strictly proper rule in the class of Gneiting and Raftery (2007), implements pipeline Cost-correct minimization on a non-degenerate joint distribution.
The proof uses payoff equivalence (Myerson, 1981): per-slot payment under any per-verifier rule depends only on marginal reports, which identify only the marginal accept rate; the pipeline accept rate is the marginal accept rate plus the disagreement covariance by Theorem 1; the covariance is not identified by any per-slot rule. Full argument in Appendix B of the PDF.
Strategic refinement. A stronger negative result holds when the verifier is permitted to commit to a joint distribution before the mechanism runs. A strategic verifier with private knowledge of the deployer’s slot-2 verifier can choose the joint distribution within its calibration-monotone class. Under per-verifier elicitation, the verifier is paid only on marginals, so it is indifferent across joint distributions consistent with its marginal. A verifier that commits to the cost-correct-optimal joint distribution receives no reward over one that commits to a worse joint distribution. The deployer’s selection is then dominated by exogenous noise. Under the joint mechanism of Section 5, the verifier is paid on joint reports and strictly prefers the cost-correct-optimal joint distribution.
5. The joint scoring-rule mechanism
Construction. Fix a strictly proper scoring rule on the joint distribution over the cross-product report space, for instance the joint Brier score which is strictly proper by the multidimensional extension of Gneiting and Raftery (2007). Each candidate pair reports a joint distribution on each probe . The mechanism pays the pair for constants and chosen to enforce ex post IR and the per-probe payment cap. The selection rule is empirical over pairs.
Atomic commitment of the joint report (both components submitted simultaneously, with no observability between components at report time) is part of the mechanism. Sealed-bid joint submission with a commit-reveal hash makes atomic commitment enforceable in deployment.
Theorem 3 (joint mechanism). Under the joint scoring-rule mechanism with chosen so that , where is the infimum of on its domain, the mechanism is dominant-strategy incentive-compatible, ex post individually rational, and budget feasible under per-probe payment cap .
Proof. Strict propriety of on implies that for any belief a verifier pair holds about the joint distribution of given , the unique maximizer of over is . The multidimensional version of strict propriety is established in Frongillo and Kash (2021) via convex analysis of the Bregman-divergence representation. Atomic commitment of the joint report rules out post-observation conditioning, so the dominant strategy is truthful joint reporting on the cross-product space, which is the report space that identifies the pipeline accept rate by Theorem 1. Individual rationality follows from the choice of . Budget feasibility follows from the per-probe payment cap.
Identifiability condition. The joint scoring-rule mechanism requires the joint distribution over to be identifiable from probe reports.
Proposition 1 (identifiability sufficient condition). If the probe distribution contains at least two probe types whose conditional joint distributions over differ as distributions on , equivalently if the empirical joint-report correlation matrix on the probe set has rank at least two, then the joint scoring rule is identifying in the sense that the unique strategy maximizing expected payment is truthful joint reporting.
The condition is straightforward to check at deployment time. Section 8 implements the check as a pre-flight gate and documents the failure mode when it does not hold.
Connections. The joint elicitation extends multi-task peer prediction (Dasgupta and Ghosh, 2013) to the grounded-probe setting. The grounded-probe assumption eliminates the common-prior dependence that peer prediction requires in the no-ground-truth setting and yields strict propriety in dominant strategies rather than only in Bayesian equilibrium. The mechanism is structurally close to Kong and Schoenebeck (2019)‘s information-theoretic framework, with the joint-report space playing the role of the complementarity carrier. Lovén (2026) proves DSIC for a parametric pseudospherical scoring family in scored AI oversight via the Prekopa principle; the joint mechanism inherits the strict-propriety guarantee per slot and extends it to the cross-product report space.
6. Regret bounds for the joint mechanism
Theorem 4 (upper bound). Let denote the population pipeline accept rate. Let be the oracle-best pair. Suppose probes are drawn iid from a probe distribution with for all pairs. Then the expected gap of the empirical rule is for a universal Hoeffding constant (distinct from the disagreement covariance of Theorem 1).
Proof sketch. The empirical pipeline accept rate is a bounded iid average in for each pair. By Hoeffding’s inequality, . Union over pairs and apply the standard regret argument. The tail-integration step uses the split-at- trick with ; the union-bounded tail at equals , the Mills-ratio bound gives the upper-tail integral , so . Full computation in Appendix C of the PDF.
Theorem 5 (lower bound). Suppose is calibration-monotone-pair and contains at least two distinct pairs with positive pipeline-accept-rate gap. Then for any mechanism and any , there exists a profile of types such that for a constant .
Proof sketch. Le Cam two-point method (Le Cam, 1973; Tsybakov, 2009). Construct a packing of pair-type profiles pairwise indistinguishable at total variation . The reduction from selection regret to estimation error follows from the calibration-monotone-pair assumption. Full argument in Appendix D of the PDF.
Theorems 4 and 5 together imply the joint scoring-rule mechanism is minimax optimal up to log factors over calibration-monotone-pair families.
Comparison to Paper #1. The counts enter additively in the log, reflecting the union bound over the cross product . The dependence is unchanged at . At and , the joint mechanism budget is approximately , against Paper #1’s at . The factor-of-two probe budget relative to Paper #1 is the price of joint elicitation under unknown conditional correlation.
7. Probe-correlated label noise as the new binding cost
Paper #1 identified adversarial probe construction, not probe count, as the binding cost driver at realistic . We extend that analysis to the composed setting and identify the new content. Joint discriminability, not marginal discriminability, is the property that probes must have to identify the oracle-best pair.
Proposition 2 (marginal-vs-joint probe-budget gap). Under known conditional correlation, the joint-elicitation probe budget equals the per-slot Paper #1 budget. Under unknown conditional correlation, the ratio inflates by the marginal-vs-joint discrimination ratio, bounded by in the worst case.
Three joint-probe construction strategies.
Marginal-disagreement probes. Maximize per-slot accept-or-reject entropy. Default when a deployer reuses a Paper #1 probe set. Discriminates marginally, not jointly.
Joint-disagreement probes. Maximize the entropy of the empirical joint-report distribution over candidate pairs. Construction cost scales as queries per candidate probe. Discriminates jointly.
Conditional-rare-event probes. Target probes where is small for some focal pair. Highly informative about .
Proposition 3 (conditional-rare-event probes). Under conditional-rare-event probe construction with a probe pool size , the leading constant in the Theorem 4 regret bound decreases by a factor of order relative to marginal-disagreement probes, at the cost of per-probe construction cost scaling as .
The proof adapts the sequential-elimination analysis of Karnin, Koren, and Somekh (2013) to the joint-report setting. Full details in Appendix E of the PDF.
Operational implication. The probe portfolio must be designed for joint discrimination. A deployer reusing a Paper #1 probe set on a composed pipeline gets the marginal-disagreement strategy by default, which is provably suboptimal in the composed setting by a factor of in the leading regret constant.
8. Simulation
We test the joint mechanism, the non-implementation result of Theorem 2, and the regret bounds on three public eval datasets with known ground-truth labels.
Datasets. MATH (Hendrycks et al., 2021), GSM8K (Cobbe et al., 2021), HumanEval (Chen et al., 2021).
Verifier population synthesis. process-style verifiers as logistic-regression heads over step-level trajectory features and outcome-style verifiers as logistic-regression heads over final-answer features. Process features are step count, intermediate self-consistency (Wang et al., 2023), and step-level log-probability. Pairs are synthesized to span a controlled conditional disagreement-covariance grid via a shared-latent coupling construction. Empirical covariance matches the construction target to within on all three datasets.
Sweep. . . Two scoring mechanisms (Paper #1 per-verifier baseline; joint Brier of Section 5). Three probe-construction strategies. Two correlation regimes ( known or unknown). 200 seeds per cell.
Headline finding 1 (composition identity verification). Empirical pipeline accept rates fall on the line predicted by Theorem 1 across all three datasets and all values, with on MATH, on GSM8K, on HumanEval. The composition identity is empirically tight.
Headline finding 2 (per-verifier baseline failure). At , the per-verifier baseline does not close the -of-first-best Cost-correct gap at any , on any dataset. At the per-verifier baseline does reach the target at , matching Paper #1’s single-verifier budget. The failure regime is precisely the non-zero-covariance regime in which Theorem 2 applies.
Headline finding 3 (joint mechanism, unknown ). Under the joint mechanism with conditional-rare-event probes, the -of-first-best target is reached at on MATH and GSM8K, and at on HumanEval. The doubled-budget regime relative to Paper #1’s is consistent with Theorem 4’s upper bound at .
Headline finding 4 (joint mechanism, known ). When is supplied as a side channel, the joint mechanism recovers Paper #1’s probe budget of on MATH and GSM8K. HumanEval budget is . The known-vs-unknown correlation gap collapses to roughly a factor of two in probe budget.
Negative finding (identifiability failure on HumanEval). One synthesized verifier-pair population exhibits a rank-deficient joint correlation matrix on the default probe distribution. The pre-flight identifiability check of Proposition 1 catches this case; switching to conditional-rare-event probes restores full rank in of seeds. The remaining require manual probe-distribution intervention. Production deployers should run the identifiability check before relying on the mechanism.
Cross-paper comparison. At matched and , the per-verifier curve flattens at a Cost-correct gap independent of , while the joint mechanism curve decays as and crosses the -of-first-best target at .
Simulation harness. Python, NumPy, scikit-learn. Approximately 240 CPU-hours on a single 16-core machine; no GPU required. Released alongside the paper under MIT license; full pseudocode in Appendix G of the PDF.
9. The August 2026 EU AI Act forcing function
The Paper #1 mapping to the August 2, 2026 EU AI Act high-risk obligations establishes that the scoring-rule mechanism’s probe set, verifier reports, and payment ledger together constitute auditable accept-rate evidence at the contractual threshold for one verifier. We revisit the mapping for composed pipelines.
Per-component evidence is insufficient for composed pipelines. Theorem 1 implies that the pipeline accept rate can drift from the per-component product by an amount up to . Auditors who accept per-component evidence for a composed deployment accept that drift implicitly. The drift is operationally large in the realistic regime: Section 8 measures on synthesized process-plus-outcome verifier pairs, which shifts pipeline accept rate by 5 to 15 percentage points on positively-correlated pairs in the rank-1-aligned regime (Ye et al., 2026), with the corresponding Cost-correct gap scaling as . Article 15(1) of the Act requires the deployer to achieve “an appropriate level of accuracy” throughout the system lifecycle. Per-component accuracy evidence does not document pipeline accuracy when conditional correlation is non-zero.
The joint-mechanism audit trail is the correct compliance artifact for composed deployments. The joint-report ledger of Section 5 documents the empirical joint distribution over on the probe set. The identifiability check of Proposition 1 documents that the joint distribution is identified from the probe distribution. Together they constitute pipeline-level accept-rate evidence at the contractual threshold. The audit trail is the forward extension of Burnat and Davidson (2026)‘s continuous-compliance auditee-gaming framework to the multi-component-verifier setting. A deployer who runs per-verifier audits on a composed pipeline can game the audit by selecting pairs with favorable marginals and unfavorable joint behavior; the joint-report audit prevents this attack.
Article 13 transparency. The deployer must report pipeline-level accept rate at the contractual threshold to downstream operators. The joint mechanism produces as a primitive on the probe set; the reporting interface follows directly.
We do not claim the joint mechanism is sufficient for Act compliance overall, since the Act covers risk management and human oversight beyond accept-rate measurement. We claim only that, where the Act requires accept-rate evidence on a composed pipeline, the joint mechanism produces it as a side effect and at low marginal cost relative to per-component audits.
10. Limitations and future work
Two-verifier scope. The composition identity extends to monotone Boolean rules of arity three and above by inclusion-exclusion (Appendix A of the PDF), but the joint scoring rule on for slots faces a combinatorial blowup in the joint report space, from four cells at to cells at and beyond. Three-slot composition (PRM plus outcome verifier plus LLM judge) is the natural near-term target.
Static verifier population. Reputation dynamics over repeated procurement rounds are out of scope. The natural extension connects to Xu and Park (2026) on online Bayesian calibration under gradual and abrupt system changes, and to the moral-hazard structure of Holmström (1979) applied to the joint-report setting.
Programmatic-verifier scope. The strict-propriety argument requires bounded and known label noise on probes. Math, formal logic, and code with strict tests satisfy this. LLM-as-judge verifiers do not, since the judge’s own accept rate is endogenous and unbounded. The rubric-grounded RL framework of Bhattarai et al. (2026) decomposes the judge’s reward into weighted verifiable criteria; the joint-mechanism extension to rubric-judges with bounded per-criterion label noise is a natural next step.
Single deployer. Probe sharing across deployers introduces a public-goods structure with free-rider incentives on joint-probe construction. The natural extension is paper #3 in the wedge plan, with the bilateral-trade impossibility of Myerson and Satterthwaite (1983) applied to the joint-probe-as-public-good setting.
Calibration-monotone-pair assumption. The lower bound of Theorem 5 requires calibration-monotone-pair . The upper bound of Theorem 4 does not. The simulation flags one synthesized verifier-pair on HumanEval where the joint-report-identifiability condition fails. The worst-case regret on non-identifiable families is an open problem.
Time-varying joint correlation. is treated as a static unknown in this paper. Drift in over the deployer’s task distribution introduces an online-procurement structure that builds on Xu and Park (2026).
11. Conclusion
Paper #1 procures one verifier. This paper procures the composed pipeline. The composition identity gives a clean picture of why per-verifier elicitation does not transfer. Pipeline miscalibration under per-verifier elicitation is exactly the within-instance verifier-disagreement covariance. The joint scoring-rule mechanism implements pipeline cost-correct minimization in dominant strategies at a probe-budget cost that is bounded. Roughly double under unknown correlation; unchanged under known correlation. The compliance evidence chain for August 2, 2026 EU AI Act deployments must include the joint-report ledger if the deployment runs a composed verification stack. Per-component evidence does not document pipeline accuracy when conditional correlation is non-zero, and the worst-case drift can be as large as .
The next paper in the wedge plan extends the mechanism to probe sharing across deployers, treating joint probes as a public good with free-rider incentives on adversarial probe construction.
Appendices (in the PDF)
The PDF includes eight appendices with full proofs and additional material.
- Appendix A. Composition identity for general monotone Boolean rules. Inclusion-exclusion expansions for two- and three-verifier cases, including OR, AND, majority, and the third joint cumulant.
- Appendix B. Full proof of Theorem 2 (non-implementability under per-verifier elicitation) including the payoff-equivalence argument.
- Appendix C. Full proof of Theorem 4 (upper bound) with the split-at- tail integration and the absolute constant.
- Appendix D. Le Cam packing for Theorem 5 (lower bound).
- Appendix E. Proof of Proposition 3 (conditional-rare-event probes).
- Appendix F. Verifier-pair synthesis with prescribed conditional correlation. Calibration of the shared-latent coupling parameter.
- Appendix G. Simulation pseudocode.
- Appendix H. Notation summary.
References
- Bhardwaj, M. Verifier Procurement Under Unobservable Quality. A Scoring-Rule Mechanism for Cost-Correct Minimization. Research Paper #1, verification-economics wedge. ifitsmanu.com, 2026.
- Bhardwaj, M. The Cost of Being Right. Verification Economics in 2026. Field Notes #2. ifitsmanu.com, 2026.
- Bhardwaj, M. The α Asymmetry. Why Verifiers Can Be Smaller Than Generators. Field Notes #3. ifitsmanu.com, 2026.
- Bhattarai, M., Boureima, I., Ranasinghe, N. R., Pakin, S., O’Malley, D. Rubric-Grounded RL. Structured Judge Rewards for Generalizable Reasoning. arXiv:2605.08061, 2026.
- Burnat, F. A. D., Davidson, B. I. A Benchmark for Strategic Auditee Gaming Under Continuous Compliance Monitoring. arXiv:2605.06340, 2026.
- Chen, M., Tworek, J., Jun, H., Yuan, Q., et al. Evaluating Large Language Models Trained on Code. arXiv:2107.03374, 2021.
- Cobbe, K., Kosaraju, V., Bavarian, M., et al. Training Verifiers to Solve Math Word Problems. arXiv:2110.14168, 2021.
- Cover, T. M., Thomas, J. A. Elements of Information Theory. 2nd edition, Wiley-Interscience, 2006.
- Dasgupta, A., Ghosh, A. Crowdsourced Judgement Elicitation with Endogenous Proficiency. WWW ‘13, ACM, 2013.
- European Parliament and Council. Regulation (EU) 2024/1689 on Artificial Intelligence (AI Act). OJ EU, 12 July 2024. High-risk obligations apply from 2 August 2026.
- Frongillo, R., Kash, I. A. General Truthfulness Characterizations Via Convex Analysis. Games and Economic Behavior 130, 636–662, 2021.
- Gneiting, T., Raftery, A. E. Strictly Proper Scoring Rules, Prediction, and Estimation. JASA 102(477), 359–378, 2007.
- Grimmett, G., Welsh, D. Probability: An Introduction. 2nd edition, Oxford University Press, 2014.
- Guan, X., Zhang, L. L., Liu, Y., et al. rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking. arXiv:2501.04519, 2025.
- Hendrycks, D., Burns, C., Kadavath, S., et al. Measuring Mathematical Problem Solving With the MATH Dataset. NeurIPS 2021 Datasets and Benchmarks.
- Hoeffding, W. Probability Inequalities for Sums of Bounded Random Variables. JASA 58(301), 13–30, 1963.
- Holmström, B. Moral Hazard and Observability. Bell Journal of Economics 10(1), 74–91, 1979.
- Karnin, Z., Koren, T., Somekh, O. Almost Optimal Exploration in Multi-Armed Bandits. ICML 2013.
- Kong, Y., Schoenebeck, G. An Information Theoretic Framework for Designing Information Elicitation Mechanisms That Reward Truth-Telling. ACM TEAC 7(1), 2019.
- Le Cam, L. Convergence of Estimates Under Dimensionality Restrictions. Annals of Statistics 1(1), 38–53, 1973.
- Lightman, H., Kosaraju, V., Burda, Y., et al. Let’s Verify Step by Step. ICLR 2024 / arXiv:2305.20050, 2023.
- Lovén, L. Honest Reporting in Scored Oversight. True-KL0 Property via the Prekopa Principle. arXiv:2605.03793, 2026.
- Miller, N., Resnick, P., Zeckhauser, R. Eliciting Informative Feedback. The Peer-Prediction Method. Management Science 51(9), 1359–1373, 2005.
- Myerson, R. B. Optimal Auction Design. Mathematics of Operations Research 6(1), 58–73, 1981.
- Myerson, R. B., Satterthwaite, M. A. Efficient Mechanisms for Bilateral Trading. J. Economic Theory 29(2), 265–281, 1983.
- Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer Series in Statistics, 2009.
- Uesato, J., Kushman, N., Kumar, R., et al. Solving Math Word Problems With Process- and Outcome-Based Feedback. arXiv:2211.14275, 2022.
- Wang, X., Wei, J., Schuurmans, D., et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. ICLR 2023 / arXiv:2203.11171, 2022.
- Witkowski, J., Parkes, D. C. A Robust Bayesian Truth Serum for Small Populations. AAAI 2012.
- Xu, Y., Park, C. Online Bayesian Calibration under Gradual and Abrupt System Changes. arXiv:2605.06612, 2026.
- Ye, H., Dang, J., Fang, J., et al. On the Implicit Reward Overfitting and the Low-rank Dynamics in RLVR. arXiv:2605.06523, 2026.
- Zheng, L., Chiang, W.-L., Sheng, Y., et al. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023.
Cite this article
@misc{bhardwaj2026verifiercomposition,
author = {Bhardwaj, Manu},
title = {Calibration Drift Under Verifier Composition: A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization},
year = {2026},
month = {May},
url = {https://ifitsmanu.com/papers/verifier-composition},
howpublished = {\url{https://ifitsmanu.com/papers/verifier-composition/paper.pdf}},
note = {Working paper. Version 1.0. Research Paper #2 in the verification-economics wedge.}
}
Bhardwaj, M. (2026, May). Calibration drift under verifier composition: A joint scoring-rule mechanism for pipeline-level cost-correct minimization. ifitsmanu.com. https://ifitsmanu.com/papers/verifier-composition
Bhardwaj, Manu. "Calibration Drift Under Verifier Composition: A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization." ifitsmanu.com, May 2026. https://ifitsmanu.com/papers/verifier-composition.
M. Bhardwaj, "Calibration Drift Under Verifier Composition: A Joint Scoring-Rule Mechanism for Pipeline-Level Cost-Correct Minimization," ifitsmanu.com, May 2026. [Online]. Available: https://ifitsmanu.com/papers/verifier-composition
Companion. Verifier Procurement. Companion. The Cost of Being Right. Companion. The α Asymmetry. Papers index. Home.