Kaggle DeepMind Agentic Evals at Scale Framework

Last updated: 25 May 2026

Design and deploy community-scale agentic evaluation systems that are transparent, unsaturatable, and accessible to non-expert contributors — not just AI labs.

// TL;DR

The Kaggle DeepMind Agentic Evals at Scale Framework is a structured methodology for designing, deploying, and maintaining AI evaluation systems that are transparent, unsaturatable, and open to non-expert contributors — not just AI labs. Use it when your benchmarks are stale, opaque, or narrowly authored; when you need to evaluate agents (not just models); when domain-specific knowledge is missing from mainstream evals; or when you need to structure agent testing before production deployment. It introduces PvP Game Arena scoring, Standardized Agent Exams, and hackathon-driven community benchmarking to solve the Democratization Problem in AI evaluation.

Framework

// When should you use the Kaggle DeepMind Agentic Evals at Scale Framework?

Use this skill when you need to build, audit, or expand an AI evaluation or benchmarking program — especially when evaluations are stale, opaque, narrowly authored, or fail to cover domain-specific knowledge that mainstream AI labs ignore. Also applicable when deciding how to structure agent testing before production deployment.

// What inputs do you need to design an agentic evaluation system?

evaluation_targetrequired
What is being evaluated — a model, an agent, a harness, or a combination? Be explicit.
domain_or_capabilityrequired
The specific capability or domain the benchmark should cover (e.g., wastewater treatment safety protocols, poker deception, coding under ambiguity).
audience_typerequired
Who will build and consume these evals — research labs, enterprise teams, consumer agent builders, or community contributors?
saturation_risk
Is there a risk the benchmark will be solved/saturated quickly? Describe the current state of model performance on similar tasks.
transparency_requirements
What level of openness is required — fully open source, reproducible configs, public leaderboard, or proprietary?

// What core principles drive the Agentic Evals at Scale Framework?

Evals Are Scattered, Decentralized, and Get Stale Fast

Ten or more benchmarks are published daily on arXiv. Authors move on to the next benchmark after publishing, leaving leaderboards stale and irrelevant. Any eval system must have a mechanism to stay live and current, not just snapshot a moment in time.

Transparency, Accessibility, and Verifiability

Model publisher benchmark charts are not trustworthy by default — configurations, prompt strategies, and API features (e.g., context compaction) can be tuned to favor the evaluator's own model. Every eval must expose its full configuration so results are reproducible by any third party.

The Democratization Problem (Small Circles, Big World)

Approximately 30,000 AI researchers create nearly all evals for a world of 30 million technical professionals and billions of end users. If a capability isn't benchmarked, you cannot hill-climb on it. Uneven eval coverage produces cognitive jaggedness — superhuman performance in some areas, mediocre performance in others — and that is not equitable AI.

Proprietary Novel Data Sets from Domain Experts

The most valuable benchmarks contain knowledge that doesn't exist anywhere on the web and isn't economically productive for AI labs to pursue — like a 20-year wastewater treatment engineer's safety protocols. The eval ecosystem must actively recruit and empower these domain experts, not just AI researchers.

You're Testing the Harness, Not Just the Model

For agentic benchmarks, the harness (execution environment, tool scaffolding, prompt structure) can account for a 22%+ performance difference across the same frontier models on the same task (per SWE-Bench data). Always be explicit about what is actually under test: the model, the agent, or the harness.

Unsaturatable Benchmarks via PvP (Game Arena Principle)

Static benchmarks saturate — models reach ceiling performance and the signal disappears. PvP (model vs. model) structures using ELO/Bradley-Terry scoring are inherently unsaturatable because there is always one winner and one loser. Use this design for long-running, evergreen evaluation.

The Difficulty Spectrum

Benchmarks must be calibrated on a spectrum: too hard and no agent completes it (no signal); too easy and performance differences collapse (no signal). Useful benchmarks occupy the zone where meaningful differentiation is possible without being prohibitively expensive to run.

// How do you apply the Agentic Evals at Scale Framework step by step?

1
Define what is actually under test
Explicitly separate: (a) the model, (b) the agent, (c) the harness. Do not conflate them. A benchmark that mixes these produces misleading results. State upfront which of the three you are evaluating and lock the other two as controlled constants.
2
Identify the domain expertise gap
Ask: is this capability covered by existing benchmarks? If the knowledge lives only in a practitioner's head or proprietary field (like wastewater treatment, niche legal domains, rare engineering disciplines), treat it as a Proprietary Novel Data Set opportunity. Recruit that domain expert as the benchmark author, not an AI researcher.
3
Choose the eval architecture based on saturation risk
If the task space is finite and models will likely saturate it within 12-18 months, use a PvP / Game Arena architecture with ELO scoring. If the task space is open-ended or domain knowledge is proprietary, use an assertion-based benchmark with LLM-as-judge scoring. Never use a static leaderboard as your only output.
4
Design assertions and tasks with explicit verifiability
Write assertions that any third party can verify independently. Combine hard-coded checks (does the output contain X?) with LLM judging for qualitative dimensions. Group assertions into Tasks; group Tasks into Benchmarks. Expose all configuration publicly so no one can accuse you of optimizing for a specific model.
5
Calibrate difficulty on the Difficulty Spectrum
Run a small pilot with 3-5 representative agents. If none complete the task, it is too hard. If all score above 90%, it is too easy. Adjust task complexity until the pilot shows meaningful spread across agents. For agentic tasks, factor in token cost — tasks that are too long become prohibitively expensive at scale.
6
Build or select a standardized harness and document it completely
Use a consistent LLM model proxy layer so all models are called identically. Document: model API version, context window used, whether features like context compaction are enabled or disabled, temperature, and any other configs. The wastewater engineer's benchmark and a frontier lab's benchmark must run under identical harness conditions to be comparable.
7
Deploy a Standardized Agent Exam for consumer/low-resource users
For agents that need a quick baseline before deployment (e.g., an OpenAI operator agent or a custom automation), provide a one-line prompt interface that returns a score on a public leaderboard. This covers the consumer end of the spectrum — most agent builders are not running any evals before deploying. Prioritize safety-focused exam tracks.
8
Use hackathons to channel domain expertise at scale
Hackathons are the mechanism to convert the Democratization Problem into a strength. Define guardrails (focus areas, e.g., five specific cognitive faculties) but give participants creative latitude. Provide free access to: data hosting, API credits for state-of-the-art models, and writeup tools so work is understandable and reusable by others. All outputs must be open source.
9
Apply Bradley-Terry pairwise scheduling to control compute costs
Running full round-robins at statistical significance is extremely expensive (e.g., 400,000 poker hands for one game). Use Bradley-Terry pairwise comparison to select which matchups to run next — prioritize matchups with the most information gain given current ELO uncertainty. Always track and publish the full LLM conversation logs as a dataset for community analysis.
10
Publish results with full reproducibility artifacts
Publish: the benchmark tasks, the harness config, the raw LLM conversation logs, the ELO/score methodology, and a game visualizer or qualitative examples where possible. A leaderboard without these artifacts is marketing, not science. Make everything forkable and open source.

// What are real-world examples of the Agentic Evals at Scale Framework in action?

A 20-year domain specialist in a safety-critical industrial field (e.g., chemical plant operations) wants to evaluate whether frontier AI models can correctly follow safety protocols in their domain.

Treat this as a Proprietary Novel Data Set. The expert authors the benchmark tasks from lived experience — no AI lab has this data. Build assertion-based tasks (does the model recommend the correct emergency shutdown procedure?), add LLM-as-judge for nuanced safety reasoning, calibrate difficulty by checking whether models pass basic protocol questions but fail edge-case incident scenarios. Publish fully open source so the domain community can validate and extend it.

A consumer developer has built an AI agent that manages their email and calendar and wants to do a quick safety baseline before deployment.

Use the Standardized Agent Exam pattern: pass a one-line system prompt describing the agent to the exam endpoint, receive a score on a public leaderboard. Run safety-focused exam tracks first (does the agent refuse to forward sensitive data? does it respect scope boundaries?). Compare against 500+ already-evaluated agents on the leaderboard to contextualize performance without building a custom eval harness.

An AI team wants to benchmark models on a capability (e.g., multi-step deception in negotiation) without the benchmark saturating within a year.

Use the Game Arena / PvP architecture. Design a game that isolates the deception capability (e.g., a negotiation game with hidden information). Implement ELO scoring with Bradley-Terry pairing to minimize games needed. Observe emergent model personalities (e.g., risk-seeking vs. risk-averse behaviors) as a secondary signal. The benchmark is permanently unsaturatable because one model always loses.

// What mistakes should you avoid when building agentic evaluations at scale?

Benchmarking the model when you are actually benchmarking the harness — the same six frontier models can differ by 22%+ on identical tasks depending on harness configuration.
Optimizing benchmark configuration for your own model (e.g., enabling proprietary API features only for your model) and publishing results as neutral. Always use identical harness settings across all evaluated models.
Publishing a static leaderboard and then letting it go stale as authors move to the next paper. Either assign ongoing maintenance ownership or architect an unsaturatable (PvP) format from the start.
Making agent exams too hard (no agent finishes, no signal) or too easy (all agents score >90%, no differentiation). Calibrate on the Difficulty Spectrum before public launch.
Assuming AI researchers are sufficient to cover all important capabilities. The world's most important domain knowledge lives in practitioners who are not AI researchers — wastewater engineers, medical specialists, tradespeople. Failing to recruit them produces cognitively jagged AI.
Ignoring compute costs in benchmark design. Running PvP games at statistical significance can require hundreds of thousands of game instances. Design with Bradley-Terry pairwise scheduling and cost ceilings before starting.
Conflating model benchmarks with agent benchmarks. What gets wetter as it dries (a towel) is a model task. An agent completing a multi-step workflow involving tools, memory, and external APIs requires a completely different assertion and harness design.
Relying on human expert judgment without planning for inter-expert alignment — even domain experts disagree, and AI cannot reliably judge innovation or creativity. Build explicit alignment workflows for human review stages.

// What key terms and concepts does the Agentic Evals at Scale Framework use?

Game Arena: A PvP benchmarking platform where AI models play games against each other with ELO/Bradley-Terry scoring. Inherently unsaturatable because there is always a winner and a loser, unlike static benchmarks that models eventually max out.
Standardized Agent Exams: A one-line prompt interface that allows any agent developer to submit their agent, have it take a standardized exam, and receive a score on a public leaderboard — democratizing eval access for consumer and low-resource agent builders.
Benchmark (Kaggle Benchmarks product): A platform enabling any community member to build, run, and share evaluations openly. Structured as: Assertions → Tasks → Benchmarks, with LLM-as-judge and hard-coded checks combined.
Proprietary Novel Data Set: A benchmark dataset created from deep domain expertise that does not exist anywhere on the web and is not covered by AI lab research because it lacks immediate economic productivity — e.g., a wastewater treatment safety protocol benchmark built by a 20-year industry veteran.
Hill-climbing: The process of iteratively improving model performance by measuring against a benchmark. If something is not benchmarked, you cannot hill-climb on it — the capability will not improve systematically.
Cognitive Jaggedness: The uneven capability profile of AI models — superhuman in benchmarked areas, mediocre or untested in areas that lack evaluations. Caused by the Democratization Problem where a small number of researchers create all evals.
Democratization Problem: The structural imbalance where ~30,000 AI researchers create nearly all benchmarks for a world of 30M+ technical professionals and billions of end users, leaving vast swathes of human knowledge unevaluated.
Harness: The execution environment, scaffolding, tooling, and prompt structure surrounding a model during evaluation. The harness can account for 22%+ performance differences on identical tasks — making it critical to specify and control when comparing models.
Difficulty Spectrum: The calibration axis for benchmarks: too hard means no agent completes the task (no signal); too easy means no differentiation between agents (no signal). Useful benchmarks occupy the meaningful middle zone.
Bradley-Terry Pairing: A pairwise comparison scheduling method used in Game Arena to maximize statistical significance of ELO rankings while minimizing the number of games (and thus API cost) required to run.
LLM Model Proxy: A consistent interface layer that routes requests to multiple different model APIs in a standardized way, ensuring all models in a benchmark run are called under identical conditions.
Saturation: The state where models reach ceiling performance on a benchmark, eliminating its usefulness as a signal. Static benchmarks inevitably saturate; PvP Game Arena architectures are designed to be permanently unsaturatable.

// FREQUENTLY ASKED QUESTIONS

What is the Kaggle DeepMind Agentic Evals at Scale Framework?

It is a structured methodology from Google DeepMind and Kaggle for building AI evaluation systems that are transparent, unsaturatable, and accessible to domain experts beyond AI research labs. The framework distinguishes between model, agent, and harness evaluation; uses PvP Game Arena architectures with ELO scoring to prevent benchmark saturation; recruits domain experts to create proprietary novel datasets; and provides Standardized Agent Exams so any developer can baseline their agent before deployment.

What is cognitive jaggedness in AI and why does it matter?

Cognitive jaggedness is the uneven capability profile of AI models — superhuman in heavily benchmarked areas but mediocre in domains that lack evaluations. It matters because if a capability isn't benchmarked, teams cannot hill-climb on it, so it never improves systematically. The root cause is the Democratization Problem: roughly 30,000 AI researchers create nearly all benchmarks for billions of end users, leaving vast domains of human knowledge untested.

How do I build an unsaturatable AI benchmark?

Use a PvP (model-vs-model) Game Arena architecture with ELO or Bradley-Terry scoring. Unlike static benchmarks where models eventually reach ceiling performance, PvP structures always produce a winner and a loser, so the signal never disappears. Design a game that isolates the target capability (e.g., negotiation with hidden information), implement Bradley-Terry pairwise scheduling to control compute costs, and publish full conversation logs for community analysis.

How do I evaluate an AI agent versus evaluating a model?

Explicitly separate three components: the model, the agent, and the harness (execution environment, tool scaffolding, prompt structure). Lock two as controlled constants and vary only the one under test. SWE-Bench data shows the same frontier models can differ by 22%+ on identical tasks depending on harness configuration alone. Conflating these three produces misleading benchmark results that cannot be meaningfully interpreted.

How does the Kaggle DeepMind framework compare to traditional AI benchmarking?

Traditional benchmarking relies on static leaderboards authored by AI researchers that saturate quickly and go stale. The Kaggle DeepMind framework differs in four key ways: it uses PvP architectures to prevent saturation, it recruits domain experts (not just researchers) to create proprietary novel datasets, it requires full reproducibility artifacts including harness configs and conversation logs, and it provides one-line Standardized Agent Exams for developers who lack resources to build custom eval harnesses.

When should I use the Agentic Evals at Scale Framework?

Use it when your existing AI evaluations are stale, opaque, or narrowly authored by a small group; when you need to benchmark domain-specific knowledge that mainstream AI labs ignore (e.g., industrial safety protocols, niche legal domains); when deploying agents to production without any baseline safety testing; or when you need an evaluation architecture that won't saturate within 12-18 months. It is especially valuable for community-scale eval programs.

What results can I expect from implementing this framework?

You can expect evaluations that stay meaningful over time instead of saturating, benchmark coverage that extends into domain-specific areas previously untested, reproducible results that withstand third-party scrutiny, and a scalable contributor pipeline through hackathons. Teams using Standardized Agent Exams can baseline agents against 500+ already-evaluated agents before deployment, and PvP Game Arenas reveal emergent model behaviors like risk-seeking or risk-averse personality patterns.

What is a Standardized Agent Exam?

A Standardized Agent Exam is a one-line prompt interface that lets any agent developer submit their agent's system prompt, have it take a standardized exam, and receive a score on a public leaderboard. It democratizes evaluation access for consumer and low-resource agent builders who cannot afford to build custom eval harnesses. Safety-focused exam tracks are prioritized so developers can baseline critical safety behaviors before deploying agents to production.

What is Bradley-Terry pairing and why does it matter for AI evals?

Bradley-Terry pairing is a pairwise comparison scheduling method that maximizes the statistical significance of ELO rankings while minimizing the number of games (and API cost) required. Running full round-robins at statistical significance can require hundreds of thousands of game instances — for example, 400,000 poker hands for a single game. Bradley-Terry selects matchups with the most information gain given current ELO uncertainty, making large-scale PvP benchmarks computationally feasible.

What is a proprietary novel dataset in AI evaluation?

A proprietary novel dataset is a benchmark created from deep domain expertise that does not exist anywhere on the web and is not economically productive for AI labs to pursue independently. Examples include a 20-year wastewater treatment engineer's safety protocols or niche industrial compliance procedures. These datasets are the most valuable benchmarks because they test knowledge that AI models cannot have memorized from training data, producing genuine capability assessments.

How do I calibrate benchmark difficulty for AI agents?

Run a pilot with 3-5 representative agents. If none complete the task, it is too hard and produces no signal. If all score above 90%, it is too easy and produces no differentiation. Adjust task complexity until the pilot shows meaningful spread across agents. For agentic tasks, also factor in token cost — tasks that are too long become prohibitively expensive at scale. This calibration step is called the Difficulty Spectrum.

// GET THIS SKILL — FREE