Kaggle DeepMind Agentic Evals at Scale Framework
Design and deploy community-scale agentic evaluation systems that are transparent, unsaturatable, and accessible to non-expert contributors — not just AI labs.
// TL;DR
The Kaggle DeepMind Agentic Evals at Scale Framework is a systematic approach for designing, deploying, and maintaining AI evaluation systems that are transparent, unsaturatable, and open to non-expert contributors — not just AI labs. Use it when you need to build or audit AI benchmarks, especially when existing evaluations are stale, opaque, or fail to cover domain-specific knowledge. It introduces PvP Game Arena architectures for evergreen benchmarking, standardized agent exams for consumer developers, and hackathon-driven community pipelines to recruit domain experts whose knowledge doesn't exist in any training set.
// When should I use the Kaggle DeepMind Agentic Evals at Scale Framework?
Use this skill when you need to build, audit, or expand an AI evaluation or benchmarking program — especially when evaluations are stale, opaque, narrowly authored, or fail to cover domain-specific knowledge that mainstream AI labs ignore. Also applicable when deciding how to structure agent testing before production deployment.
// What inputs do I need to get started with agentic evals at scale?
- evaluation_targetrequired
What is being evaluated — a model, an agent, a harness, or a combination? Be explicit. - domain_or_capabilityrequired
The specific capability or domain the benchmark should cover (e.g., wastewater treatment safety protocols, poker deception, coding under ambiguity). - audience_typerequired
Who will build and consume these evals — research labs, enterprise teams, consumer agent builders, or community contributors? - saturation_risk
Is there a risk the benchmark will be solved/saturated quickly? Describe the current state of model performance on similar tasks. - transparency_requirements
What level of openness is required — fully open source, reproducible configs, public leaderboard, or proprietary?
// What core principles guide the design of scalable agentic evaluations?
Evals Are Scattered, Decentralized, and Get Stale Fast
Ten or more benchmarks are published daily on arXiv. Authors move on to the next benchmark after publishing, leaving leaderboards stale and irrelevant. Any eval system must have a mechanism to stay live and current, not just snapshot a moment in time.
Transparency, Accessibility, and Verifiability
Model publisher benchmark charts are not trustworthy by default — configurations, prompt strategies, and API features (e.g., context compaction) can be tuned to favor the evaluator's own model. Every eval must expose its full configuration so results are reproducible by any third party.
The Democratization Problem (Small Circles, Big World)
Approximately 30,000 AI researchers create nearly all evals for a world of 30 million technical professionals and billions of end users. If a capability isn't benchmarked, you cannot hill-climb on it. Uneven eval coverage produces cognitive jaggedness — superhuman performance in some areas, mediocre performance in others — and that is not equitable AI.
Proprietary Novel Data Sets from Domain Experts
The most valuable benchmarks contain knowledge that doesn't exist anywhere on the web and isn't economically productive for AI labs to pursue — like a 20-year wastewater treatment engineer's safety protocols. The eval ecosystem must actively recruit and empower these domain experts, not just AI researchers.
You're Testing the Harness, Not Just the Model
For agentic benchmarks, the harness (execution environment, tool scaffolding, prompt structure) can account for a 22%+ performance difference across the same frontier models on the same task (per SWE-Bench data). Always be explicit about what is actually under test: the model, the agent, or the harness.
Unsaturatable Benchmarks via PvP (Game Arena Principle)
Static benchmarks saturate — models reach ceiling performance and the signal disappears. PvP (model vs. model) structures using ELO/Bradley-Terry scoring are inherently unsaturatable because there is always one winner and one loser. Use this design for long-running, evergreen evaluation.
The Difficulty Spectrum
Benchmarks must be calibrated on a spectrum: too hard and no agent completes it (no signal); too easy and performance differences collapse (no signal). Useful benchmarks occupy the zone where meaningful differentiation is possible without being prohibitively expensive to run.
// How do you apply the Kaggle DeepMind Agentic Evals Framework step by step?
- 1
Define what is actually under test
Explicitly separate: (a) the model, (b) the agent, (c) the harness. Do not conflate them. A benchmark that mixes these produces misleading results. State upfront which of the three you are evaluating and lock the other two as controlled constants.
- 2
Identify the domain expertise gap
Ask: is this capability covered by existing benchmarks? If the knowledge lives only in a practitioner's head or proprietary field (like wastewater treatment, niche legal domains, rare engineering disciplines), treat it as a Proprietary Novel Data Set opportunity. Recruit that domain expert as the benchmark author, not an AI researcher.
- 3
Choose the eval architecture based on saturation risk
If the task space is finite and models will likely saturate it within 12-18 months, use a PvP / Game Arena architecture with ELO scoring. If the task space is open-ended or domain knowledge is proprietary, use an assertion-based benchmark with LLM-as-judge scoring. Never use a static leaderboard as your only output.
- 4
Design assertions and tasks with explicit verifiability
Write assertions that any third party can verify independently. Combine hard-coded checks (does the output contain X?) with LLM judging for qualitative dimensions. Group assertions into Tasks; group Tasks into Benchmarks. Expose all configuration publicly so no one can accuse you of optimizing for a specific model.
- 5
Calibrate difficulty on the Difficulty Spectrum
Run a small pilot with 3-5 representative agents. If none complete the task, it is too hard. If all score above 90%, it is too easy. Adjust task complexity until the pilot shows meaningful spread across agents. For agentic tasks, factor in token cost — tasks that are too long become prohibitively expensive at scale.
- 6
Build or select a standardized harness and document it completely
Use a consistent LLM model proxy layer so all models are called identically. Document: model API version, context window used, whether features like context compaction are enabled or disabled, temperature, and any other configs. The wastewater engineer's benchmark and a frontier lab's benchmark must run under identical harness conditions to be comparable.
- 7
Deploy a Standardized Agent Exam for consumer/low-resource users
For agents that need a quick baseline before deployment (e.g., an OpenAI operator agent or a custom automation), provide a one-line prompt interface that returns a score on a public leaderboard. This covers the consumer end of the spectrum — most agent builders are not running any evals before deploying. Prioritize safety-focused exam tracks.
- 8
Use hackathons to channel domain expertise at scale
Hackathons are the mechanism to convert the Democratization Problem into a strength. Define guardrails (focus areas, e.g., five specific cognitive faculties) but give participants creative latitude. Provide free access to: data hosting, API credits for state-of-the-art models, and writeup tools so work is understandable and reusable by others. All outputs must be open source.
- 9
Apply Bradley-Terry pairwise scheduling to control compute costs
Running full round-robins at statistical significance is extremely expensive (e.g., 400,000 poker hands for one game). Use Bradley-Terry pairwise comparison to select which matchups to run next — prioritize matchups with the most information gain given current ELO uncertainty. Always track and publish the full LLM conversation logs as a dataset for community analysis.
- 10
Publish results with full reproducibility artifacts
Publish: the benchmark tasks, the harness config, the raw LLM conversation logs, the ELO/score methodology, and a game visualizer or qualitative examples where possible. A leaderboard without these artifacts is marketing, not science. Make everything forkable and open source.
// What are real-world examples of this agentic eval framework in action?
A 20-year domain specialist in a safety-critical industrial field (e.g., chemical plant operations) wants to evaluate whether frontier AI models can correctly follow safety protocols in their domain.
Treat this as a Proprietary Novel Data Set. The expert authors the benchmark tasks from lived experience — no AI lab has this data. Build assertion-based tasks (does the model recommend the correct emergency shutdown procedure?), add LLM-as-judge for nuanced safety reasoning, calibrate difficulty by checking whether models pass basic protocol questions but fail edge-case incident scenarios. Publish fully open source so the domain community can validate and extend it.
A consumer developer has built an AI agent that manages their email and calendar and wants to do a quick safety baseline before deployment.
Use the Standardized Agent Exam pattern: pass a one-line system prompt describing the agent to the exam endpoint, receive a score on a public leaderboard. Run safety-focused exam tracks first (does the agent refuse to forward sensitive data? does it respect scope boundaries?). Compare against 500+ already-evaluated agents on the leaderboard to contextualize performance without building a custom eval harness.
An AI team wants to benchmark models on a capability (e.g., multi-step deception in negotiation) without the benchmark saturating within a year.
Use the Game Arena / PvP architecture. Design a game that isolates the deception capability (e.g., a negotiation game with hidden information). Implement ELO scoring with Bradley-Terry pairing to minimize games needed. Observe emergent model personalities (e.g., risk-seeking vs. risk-averse behaviors) as a secondary signal. The benchmark is permanently unsaturatable because one model always loses.
// What mistakes should I avoid when building agentic evaluations at scale?
- Benchmarking the model when you are actually benchmarking the harness — the same six frontier models can differ by 22%+ on identical tasks depending on harness configuration.
- Optimizing benchmark configuration for your own model (e.g., enabling proprietary API features only for your model) and publishing results as neutral. Always use identical harness settings across all evaluated models.
- Publishing a static leaderboard and then letting it go stale as authors move to the next paper. Either assign ongoing maintenance ownership or architect an unsaturatable (PvP) format from the start.
- Making agent exams too hard (no agent finishes, no signal) or too easy (all agents score >90%, no differentiation). Calibrate on the Difficulty Spectrum before public launch.
- Assuming AI researchers are sufficient to cover all important capabilities. The world's most important domain knowledge lives in practitioners who are not AI researchers — wastewater engineers, medical specialists, tradespeople. Failing to recruit them produces cognitively jagged AI.
- Ignoring compute costs in benchmark design. Running PvP games at statistical significance can require hundreds of thousands of game instances. Design with Bradley-Terry pairwise scheduling and cost ceilings before starting.
- Conflating model benchmarks with agent benchmarks. What gets wetter as it dries (a towel) is a model task. An agent completing a multi-step workflow involving tools, memory, and external APIs requires a completely different assertion and harness design.
- Relying on human expert judgment without planning for inter-expert alignment — even domain experts disagree, and AI cannot reliably judge innovation or creativity. Build explicit alignment workflows for human review stages.
// What are the key terms and concepts in the Kaggle DeepMind Agentic Evals Framework?
- Game Arena
- A PvP benchmarking platform where AI models play games against each other with ELO/Bradley-Terry scoring. Inherently unsaturatable because there is always a winner and a loser, unlike static benchmarks that models eventually max out.
- Standardized Agent Exams
- A one-line prompt interface that allows any agent developer to submit their agent, have it take a standardized exam, and receive a score on a public leaderboard — democratizing eval access for consumer and low-resource agent builders.
- Benchmark (Kaggle Benchmarks product)
- A platform enabling any community member to build, run, and share evaluations openly. Structured as: Assertions → Tasks → Benchmarks, with LLM-as-judge and hard-coded checks combined.
- Proprietary Novel Data Set
- A benchmark dataset created from deep domain expertise that does not exist anywhere on the web and is not covered by AI lab research because it lacks immediate economic productivity — e.g., a wastewater treatment safety protocol benchmark built by a 20-year industry veteran.
- Hill-climbing
- The process of iteratively improving model performance by measuring against a benchmark. If something is not benchmarked, you cannot hill-climb on it — the capability will not improve systematically.
- Cognitive Jaggedness
- The uneven capability profile of AI models — superhuman in benchmarked areas, mediocre or untested in areas that lack evaluations. Caused by the Democratization Problem where a small number of researchers create all evals.
- Democratization Problem
- The structural imbalance where ~30,000 AI researchers create nearly all benchmarks for a world of 30M+ technical professionals and billions of end users, leaving vast swathes of human knowledge unevaluated.
- Harness
- The execution environment, scaffolding, tooling, and prompt structure surrounding a model during evaluation. The harness can account for 22%+ performance differences on identical tasks — making it critical to specify and control when comparing models.
- Difficulty Spectrum
- The calibration axis for benchmarks: too hard means no agent completes the task (no signal); too easy means no differentiation between agents (no signal). Useful benchmarks occupy the meaningful middle zone.
- Bradley-Terry Pairing
- A pairwise comparison scheduling method used in Game Arena to maximize statistical significance of ELO rankings while minimizing the number of games (and thus API cost) required to run.
- LLM Model Proxy
- A consistent interface layer that routes requests to multiple different model APIs in a standardized way, ensuring all models in a benchmark run are called under identical conditions.
- Saturation
- The state where models reach ceiling performance on a benchmark, eliminating its usefulness as a signal. Static benchmarks inevitably saturate; PvP Game Arena architectures are designed to be permanently unsaturatable.
// FREQUENTLY ASKED QUESTIONS
What is the Kaggle DeepMind Agentic Evals at Scale Framework?
It is a structured methodology for building AI evaluation systems that are transparent, resistant to saturation, and accessible to domain experts beyond the AI research community. Developed from practices at Google DeepMind and Kaggle, it separates model, agent, and harness testing; uses PvP Game Arena architectures with ELO scoring for evergreen benchmarks; and creates pathways for non-AI practitioners — like safety engineers or medical specialists — to author benchmarks from proprietary domain knowledge that doesn't exist on the web.
What is a PvP Game Arena benchmark and why is it unsaturatable?
A PvP Game Arena is a benchmarking platform where AI models compete head-to-head with ELO/Bradley-Terry scoring. It is inherently unsaturatable because every match produces a winner and a loser — unlike static benchmarks where models eventually hit ceiling performance and the leaderboard loses signal. This design keeps the benchmark useful indefinitely, making it ideal for capabilities like negotiation, strategy, or deception where you need long-running evaluation.
How do I build an AI evaluation benchmark from scratch using this framework?
Start by explicitly separating what you're testing: the model, the agent, or the harness. Then identify domain expertise gaps — knowledge that only practitioners have. Choose your eval architecture based on saturation risk: PvP Arena for finite task spaces, assertion-based with LLM-as-judge for open-ended domains. Design verifiable assertions, calibrate difficulty by piloting with 3-5 agents, standardize your harness configuration, and publish all artifacts — tasks, configs, logs, and methodology — for full reproducibility.
How do I tell if I'm benchmarking the model or the harness?
You must explicitly define and control three components: the model, the agent, and the harness (execution environment, tooling, prompt structure). SWE-Bench data shows the same frontier models can differ by 22%+ on identical tasks depending on harness configuration. If you change the harness between model runs, you are benchmarking the harness, not the model. Lock two of the three components as controlled constants and only vary the one under test.
How does this framework compare to just using standard benchmarks like MMLU or SWE-Bench?
Standard benchmarks like MMLU saturate quickly and are authored by a tiny pool of AI researchers, leaving vast domains unevaluated. This framework addresses both problems: PvP architectures prevent saturation, and community hackathon pipelines recruit domain experts to cover knowledge gaps. It also enforces harness transparency — standard benchmarks often let publishers tune configurations to favor their own models. The framework requires identical harness conditions across all evaluated models and full reproducibility artifacts.
When should I use this framework instead of running my own ad-hoc AI evaluation?
Use this framework when your evaluations are stale, opaque, narrowly authored, or fail to cover domain-specific knowledge. It's especially valuable when you need benchmarks that won't saturate within 12-18 months, when you want community contributors beyond AI researchers to author evals, when you need transparent and reproducible results that third parties can verify, or when you're deploying agents to production and need a standardized safety baseline before launch.
What is cognitive jaggedness in AI and how do evals cause it?
Cognitive jaggedness is the uneven capability profile of AI models — superhuman in areas that have benchmarks, mediocre or untested in areas that lack them. It's caused by the Democratization Problem: roughly 30,000 AI researchers create nearly all evaluations for a world of 30+ million technical professionals. If a capability isn't benchmarked, teams can't hill-climb on it, so it doesn't improve systematically. This framework combats jaggedness by recruiting domain experts to author evals in underserved areas.
How do I run a community hackathon to generate AI benchmarks?
Define focus areas — for example, five specific cognitive faculties you want evaluated — and set guardrails, but give participants creative latitude in task design. Provide free access to data hosting, API credits for frontier models, and writeup tools so outputs are understandable and reusable. Require all hackathon outputs to be open source. This converts the Democratization Problem into a strength by channeling diverse domain expertise into structured, publishable benchmarks at scale.
What is a Standardized Agent Exam?
A Standardized Agent Exam is a one-line prompt interface that lets any agent developer submit their agent's system prompt, have it take a standardized exam, and receive a score on a public leaderboard. It democratizes evaluation access for consumer and low-resource agent builders who are not running any evals before deployment. Safety-focused exam tracks are prioritized, and scores are contextualized against 500+ already-evaluated agents on the leaderboard.
What results can I expect after implementing this framework?
You can expect benchmarks that remain useful over time instead of saturating, transparent results that withstand third-party scrutiny, and evaluation coverage in domains previously ignored by AI labs. Concretely: PvP arenas produce evergreen ELO rankings, standardized agent exams give consumer developers a safety baseline in minutes, and community hackathons generate proprietary novel datasets from domain experts. You'll also have full reproducibility artifacts — configs, logs, and methodology — that establish credibility with stakeholders.
What is Bradley-Terry pairing and how does it reduce compute costs?
Bradley-Terry pairing is a pairwise comparison scheduling method used in Game Arena to maximize the statistical significance of ELO rankings while minimizing the number of games required. Instead of running a full round-robin — which can require hundreds of thousands of game instances — it selects matchups with the highest information gain given current ELO uncertainty. This lets you reach statistically significant rankings at a fraction of the compute cost of exhaustive tournament play.
What is a Proprietary Novel Data Set in this framework?
A Proprietary Novel Data Set is a benchmark created from deep domain expertise that doesn't exist anywhere on the web and isn't economically productive for AI labs to pursue. Examples include a 20-year wastewater treatment engineer's safety protocols or a rare engineering discipline's failure mode catalog. These datasets are the most valuable benchmarks because they test knowledge no model has been trained on, producing genuine signal about a model's reasoning capabilities rather than memorization.
Turn Any YouTube Video Into An AI Skill
SkillForge captures a creator's exact methodology from their video and turns it into a reusable AI skill you can invoke in Claude, ChatGPT, or any LLM.
Forge your own skill