How Can Domain Experts Build AI Benchmarks From Their Expertise?

For Domain experts in safety-critical industries (engineering, healthcare, compliance) · Based on Kaggle DeepMind Agentic Evals at Scale Framework

// TL;DR

Domain experts in safety-critical fields — wastewater treatment, chemical plant operations, medical specialties, regulatory compliance — hold the most valuable knowledge for AI benchmarking: knowledge that doesn't exist on the web and that AI labs have no economic incentive to pursue. The Kaggle DeepMind Agentic Evals at Scale Framework provides a pathway for these experts to author Proprietary Novel Datasets as benchmarks, combining assertion-based tasks with LLM-as-judge scoring. Published openly, these benchmarks let the entire AI community hill-climb on real-world safety capabilities.

Why Is Domain Expert Knowledge the Most Valuable AI Benchmark Data?

The most valuable AI benchmarks contain knowledge that doesn't exist anywhere on the web. A 20-year wastewater treatment engineer's safety protocols, a chemical plant operator's emergency shutdown procedures, a medical specialist's diagnostic edge cases — this knowledge lives in practitioners' heads, not in training data.

AI labs have no economic incentive to benchmark these domains. The result is what the Kaggle DeepMind framework calls cognitive jaggedness: AI models are superhuman in heavily benchmarked areas like coding and math, but mediocre or untested in the safety-critical domains where mistakes cause real harm. If a capability isn't benchmarked, teams cannot hill-climb on it — meaning AI performance in your domain will not improve systematically.

You are the solution to the Democratization Problem. Approximately 30,000 AI researchers create nearly all benchmarks for billions of end users. Your specialized knowledge is exactly what the eval ecosystem is missing.

How Do You Turn Your Domain Expertise Into an AI Benchmark?

You don't need an AI research background. The framework structures your knowledge into three layers:

1. Assertions: Specific, verifiable checks. "Does the model recommend the correct emergency shutdown procedure for a chlorine gas leak?" "Does the model identify that mixing these two chemicals produces a toxic reaction?" Write these from your lived experience — the incidents you've seen, the protocols you've memorized, the edge cases that trip up even experienced colleagues.

2. Tasks: Groups of related assertions that test a coherent capability. A task might be "Handle a chemical spill incident" with 5-8 assertions covering detection, containment, notification, and cleanup.

3. Benchmarks: Collections of tasks that cover a domain area. Your benchmark might cover "Chemical Plant Emergency Response" with 10-15 tasks spanning different incident types.

Combine hard-coded checks (did the model name the correct chemical?) with LLM-as-judge scoring for nuanced reasoning (is the model's reasoning about containment priorities sound?).

How Do You Calibrate Whether Your Benchmark Is Too Hard or Too Easy?

Test your benchmark against 3-5 current AI models before publishing. If no model gets any assertions right, your benchmark is too specialized for current AI — but bank those tasks for future use as models improve. If all models score above 90%, your questions are too basic — add the edge cases that would trip up a junior employee in your field.

The sweet spot is the Difficulty Spectrum's middle zone: models pass basic protocol questions but fail on the incident edge cases that require years of field experience. This produces meaningful signal about which models are actually safer.

What Happens After You Publish Your Benchmark?

Publish your benchmark fully open source so the domain community can validate and extend it. Other experts in your field will find assertions they disagree with — that's valuable. Build explicit alignment workflows: have multiple experts review contested tasks and document where disagreement persists.

Your benchmark becomes a hill-climbing target for AI labs. Every model improvement on your benchmark means AI gets safer in your domain. The conversation logs from model runs become a dataset that researchers and other practitioners can analyze. You've converted knowledge that existed only in your head into a permanent, scalable force for AI safety improvement.

Hackathons organized through platforms like Kaggle provide free access to data hosting, API credits, and writeup tools so your work is accessible and reusable.

Next step: Write 5 assertion-based questions from your domain expertise that you believe no current AI model can answer correctly. Test them against GPT-4 and Claude. If you find the expected failures, you have the seed of a Proprietary Novel Dataset that the entire AI community needs.

// FREQUENTLY ASKED QUESTIONS

Do I need to know how to code to build an AI benchmark from my domain expertise?

No coding background is required to author benchmark tasks. You write assertions in natural language — specific, verifiable statements about what a correct answer must contain. Platforms like Kaggle Benchmarks provide tools to structure your assertions into Tasks and Benchmarks. Hard-coded checks and LLM-as-judge scoring handle the technical evaluation. Your value is your domain knowledge, not your programming ability.

Why would AI labs care about a benchmark built by a wastewater engineer or a nurse?

AI labs care because they cannot build these benchmarks themselves — the knowledge doesn't exist in their training data or research teams. Your benchmark becomes a hill-climbing target: once it exists, labs can measure and improve model performance on your domain. Without your benchmark, AI performance in your domain stagnates. The Kaggle DeepMind framework specifically identifies Proprietary Novel Datasets from domain experts as the highest-value benchmarks in the ecosystem.

What if other experts in my field disagree with my benchmark answers?

Disagreement among domain experts is expected and valuable. Build explicit inter-expert alignment workflows: have multiple experts independently review the same tasks, measure agreement rates, and discuss contested items. Where disagreement persists, document it and use LLM-as-judge for those tasks rather than hard-coded assertions. Multiple perspectives strengthen the benchmark — a single-author benchmark is more likely to contain blind spots.