Calibration is a verifiable claim
A score that doesn't carry its skill is a marketing number. We treat verification as the credibility moat: every audit-logged score gets joined to its observed outcome (when the operator reports one), and the daily-refreshed mv_forecast_verification_daily materialised view exposes per-cell BSS, CRPS, sample size, and reliability bins. The view is the source of truth for our own recalibration loop (L3) AND for the public skill metrics any researcher or auditor can reproduce.
Three numbers, each answering a different question
BS = mean((forecast_prob − outcome_binary)²). BSS scales it against a climatological baseline — positive means we beat climatology, 0 means we match it, negative means we're worse. We publish per-cell BSS at horizons 24 / 72 / 168h.
Continuous Ranked Probability Score generalises Brier to non-binary outcomes — appropriate when you score on a continuous 0-100 verdict rather than a single threshold. Lower is better. Same per-cell stratification as BSS.
When the engine says "70% favourable", how often did the activity actually run? A perfectly-calibrated forecaster sits on the 45° diagonal. Deviations point at specific recalibration cells.
Skill is never uniform
A single platform-wide BSS is a marketing number too. Our verification view + the public export stratify across four dimensions so the consumer (whether it's our own calibrator or an insurance actuary defending a premium loading) can see exactly where the engine speaks confidently and where it doesn't.
Per-cell skill — the unit of bias-correction in the hierarchical Bayesian calibration layer (engine refit). A weak BSS on a single cell rebuilds just that cell's curves.
24h / 72h / 168h. Nearer horizons should beat further ones; the gap quantifies how aggressively to widen tolerances as the forecast extends.
Tier 1 = data-rich (CMEMS-covered marine spots, instrumented alpine cells). Tier 3 = sparse. Skill should decline with tier; the deviation flags spots whose tier is mis-classified.
Skill is rarely uniform — Mediterranean wind regimes are different in summer vs winter; alpine snow stability differs early-vs-late season. Stratifying surfaces seasonal bias.
The public export
Every verification row that clears the privacy gate (consented tenant, k ≥ 10 distinct contributors per cell, 1 km² grid aggregation, 90-day lag) is exported as line-delimited JSON at GET /v1/research/verification/export.jsonl. CC BY 4.0, no auth required, no rate-limit on academic consumers. Recompute our skill claims, plug them into a JOSS paper, embed them in a Bayesian network — the only ask is the citation.
{
"day": "2026-05-18",
"activity": "kitesurfing",
"sub_spot_slug": "tarifa-balneario",
"horizon_h": 24,
"spot_tier": 1,
"n_samples": 47,
"brier": 0.124,
"bss": 0.31,
"crps": 0.156,
"reliability_bins": [
{ "p_forecast": 0.05, "p_observed": 0.07, "n": 4 },
{ "p_forecast": 0.15, "p_observed": 0.13, "n": 8 },
...
]
}What the numbers don't yet say
Today the export covers the cells where at least one consented tenant has accumulated paired outcomes. That's a partial map while the platform is pre-launch. As the cohort accumulates (each new paying tenant deepens the verification surface), cells move from provisional to reviewed to calibrated. Every score response already carries a profileMaturity banner so consumers know whether they're reading a calibrated number or an expert-set one.
Working on a verification paper?
If you're using the Stream F export in research or want a custom cut, the partnerships desk is the fastest path — we'll set you up with a research API key + co-author the methods section when it helps.