Newsroom

You Cannot Price What You Cannot Interrogate

Gallagher Re published a standalone report this week examining how the insurance market assesses AI systems when pricing AI-related risk. Its conclusion is direct: the methods currently in use were not built for this purpose, and the gap between what evaluation techniques measure and what insurers actually need to know is widening.

Newsroom 10 June 2026

The problem Gallagher Re identifies is structural.

AI models are assessed through benchmarks, standardised tests that compare performance on fixed tasks under controlled conditions. These produce scores. What they do not produce is a reliable picture of how a system behaves when exposed to uncertain, complex or unpredictable inputs in live deployment. Strong benchmark performance, as the report notes, does not eliminate hallucinations, inconsistent outputs, or subtle failures that may not surface until they have already caused harm.

Ed Pocock, Global Head of Cyber Security at Gallagher Re, puts the disconnect plainly: insurers are not concerned with what a model can do in a test environment. They are concerned with how models fail, how often they fail, and whether those failures could be correlated across a portfolio. That final point, correlated failure, is the one that should concentrate minds in reinsurance. A failure mode shared across widely deployed foundation models is not an isolated claims event. It is an accumulation event.

The report also raises the problem of benchmark contamination

Models increasingly optimised to perform well on the very tests used to evaluate them, producing scores that inflate apparent capability while reducing meaningful differentiation between systems. The practical consequence, as Gallagher Re warns, is that scale and brand become proxies for safety in the absence of better signals. That is precisely the condition in which concentration risk builds unseen.

The report draws specific attention to restricted-distribution frontier models, those released only to approved partners rather than openly, as a fourth category of AI distribution that creates particular assessment challenges. Where independent evaluation is not possible, Pocock's position is unambiguous: if a model cannot be independently evaluated, it cannot be meaningfully priced.

This is where the argument lands for infrastructure.

The evaluation gap Gallagher Re describes is not primarily a methodology problem, though better methodologies are needed. It is a provenance problem. A system that cannot expose its own reasoning, that cannot show what it knew, how confident it was, and what it weighted at the moment of a consequential output, cannot be assessed from the outside with any rigour. You are pricing opacity.

The firms building AI infrastructure with traceable, verifiable reasoning baked in are not just better positioned for regulatory compliance. They are building systems that the insurance market can eventually price with confidence rather than load for uncertainty. In a market where concentration risk in AI is becoming a first-order concern, that distinction has real economic value.