Eval Maturity Phases vs GTM Engineering: Which Should You Use?

// TL;DR

These two skills solve completely different problems and do not compete. If you are building or improving an AI agent and need to ensure it works reliably before and during production, use the Hetzel Eval Maturity Phases Framework. If you are a marketer or founder trying to automate go-to-market execution — SEO, ads, content, publishing — using AI agents, use Cody Schneider's GTM Engineering with Claude Code. Pick based on whether your bottleneck is agent quality assurance or marketing execution throughput.

// HOW DO THEY COMPARE?

DimensionHetzel Eval Maturity Phases FrameworkCody Schneider GTM Engineering with Claude Code
Best ForAI/ML engineers building or hardening LLM-powered agents for productionGrowth marketers, founders, and GTM teams automating marketing execution
Core Problem SolvedHow to systematically evaluate and improve AI agent quality across maturity stagesHow to delegate all hands-on-keyboard GTM tasks to Claude Code agents
ComplexityHigh — requires understanding of scoring functions, LLM-as-judge, trace instrumentation, and production observabilityLow to moderate — requires API keys, a project folder, and prompt-level orchestration skills
Time to ApplyDays to weeks — maturity phases are progressive and compound over timeHours — a single research-to-publish loop can run in one session
PrerequisitesAn existing AI agent or prompt, subject matter experts, ideally production tracesClaude Code access, API keys for marketing tools, a campaign brief
Output TypeEval datasets, scoring functions, quality dashboards, continuous improvement flywheelsPublished blog posts, ad campaigns, keyword research, performance reports
Creator BackgroundPhil Hetzel (Braintrust) — AI engineering and LLM evaluation platform expertCody Schneider — growth marketer and GTM automation practitioner
Feedback LoopProduction traces → failure identification → offline evals → agent improvement (The Flywheel)Publish → track via Google Search Console → feed data back to Claude → optimize (Continuous Improvement Loop)
Technical Depth RequiredDeep — involves deterministic scoring, LLM-as-judge validation, trace-level instrumentation, mock APIsShallow — primarily prompt engineering, API key management, and parallel terminal orchestration
Scales ByAdding automated failure-mode discovery (topic modeling) and CI/CD-style eval pipelinesLooping the same agent workflow across every keyword or campaign target in a list

What does the Hetzel Eval Maturity Phases Framework do?

The Hetzel Eval Maturity Phases Framework, created by Phil Hetzel of Braintrust, gives AI engineers a structured, stage-by-stage methodology for building and maturing an evaluation system for LLM-powered agents. It addresses a specific and painful bottleneck: most teams get stuck between proof-of-concept and production because they lack a defensible way to measure whether their agent actually works.

The framework defines four maturity phases: (1) Just Getting Started / Vibe Checking, (2) Measuring to Manage, (3) Accounting for Complexity, and (4) Advanced Techniques. At each stage, you perform specific actions — from structured human annotation with written justifications, to building LLM-as-judge scoring functions, to capturing full production traces and replaying them as eval datasets.

A core principle is that evals are not unit tests. You do not try to exhaustively cover every possible failure. Instead, you identify known failure modes with a subject matter expert and build evals specifically targeting those. Another critical practice is 'eval the eval' — validating your LLM-as-judge outputs against a human-labelled ground truth dataset before trusting them at scale.

The ultimate goal is activating what Hetzel calls The Flywheel: a continuous loop where production traces feed into offline eval datasets, eval results guide agent improvements, and those improvements are validated in the next eval cycle. This shifts your posture from defensive (catching regressions) to offensive (proactively improving quality with every iteration).

What does Cody Schneider's GTM Engineering with Claude Code do?

Cody Schneider's GTM Engineering skill turns Claude Code into a fully autonomous execution layer for go-to-market tasks. The promise is simple: you have the idea, you provide the polish at the end, and Claude Code does everything in between — keyword research, content creation, publishing, ad management, performance analysis.

The infrastructure is deliberately minimal. You create a single project folder containing a `.env` file (all API keys) and a `CLAUDE.md` file (standing instructions). Every Claude Code session launched from that folder inherits the full tool stack automatically. Schneider calls this the Stack-in-a-Folder pattern.

The force-multiplication effect comes from running multiple terminal windows simultaneously, each with an independent Claude Code agent working a different sub-task. While one agent researches keywords, another drafts content, another publishes to your CMS. Schneider calls this role the Conductor — you orchestrate parallel workstreams rather than doing sequential manual work.

Content quality is governed by source material quality, not by the model's inherent capability. Schneider is explicit: AI-generated content that underperforms is a skill issue, not a tool issue. The framework prescribes scraping top-ranking Google results as structural source material, layering in a style guide, and optionally recording a 30-minute AI interview to inject your authentic perspective.

The workflow closes its own loop through a Continuous Improvement Loop: connect Google Search Console via Graph MCP, feed live performance data back into Claude Code, and let the agent generate specific optimization recommendations for underperforming pages.

How do they compare?

These two skills operate in entirely different domains and solve fundamentally different problems. The Hetzel Eval Maturity Phases Framework is an AI engineering methodology — it ensures your agent works correctly and improves over time. Schneider's GTM Engineering is a marketing execution methodology — it ensures your go-to-market tasks get done faster by delegating them to AI agents.

The Eval Maturity framework is technically deep. It requires understanding of scoring functions (both deterministic and LLM-as-judge), production trace instrumentation, mock API configuration for CRUD-based tool calls, and topic modeling for automated failure-mode discovery. It is a weeks-long investment that compounds as your agent matures. Schneider's GTM Engineering is operationally oriented. The technical barrier is low — you need API keys, a folder, and the ability to write clear prompts. A single end-to-end loop can be completed in hours.

Both frameworks feature feedback loops, but they measure different things. The Flywheel measures agent quality against domain-expert standards. The Continuous Improvement Loop measures marketing performance against search engine and ad platform metrics.

One important intersection: if you are building an AI agent that performs GTM tasks autonomously, you would use Schneider's framework to design the workflow and Hetzel's framework to evaluate whether the agent executes it correctly. They are complementary, not competitive.

Which should you choose?

If your problem is agent quality — your LLM agent produces unreliable outputs, you cannot confidently ship it to production, or you have no structured eval process — use the Hetzel Eval Maturity Phases Framework. It is the right tool for AI engineers and teams responsible for the correctness and safety of an AI system.

If your problem is marketing throughput — you are manually doing keyword research, writing blog posts, managing ads, or copying data between tools — use Cody Schneider's GTM Engineering with Claude Code. It is the right tool for growth marketers, founders, and small teams who need to produce more GTM output without hiring.

If you are building an AI agent that automates GTM work for others, you likely need both: Schneider's approach to design the agentic workflow and Hetzel's approach to ensure the agent performs reliably at scale. Start with whichever matches your immediate bottleneck.

// FREQUENTLY ASKED QUESTIONS

Can I use the Eval Maturity Framework and GTM Engineering together?

Yes, and you should if you are building an AI agent that performs go-to-market tasks. Use Schneider's GTM Engineering to design the agentic workflow — research, create, publish, optimize — and Hetzel's Eval Maturity Framework to evaluate whether that agent executes those tasks correctly. They are complementary: one designs the work, the other validates the quality.

Which skill is better for someone with no technical background?

Cody Schneider's GTM Engineering with Claude Code is significantly more accessible. It requires only API keys, a project folder, and the ability to write clear natural-language prompts. The Hetzel Eval Maturity Framework requires understanding of scoring functions, trace instrumentation, and LLM-as-judge validation — skills that assume an AI engineering background.

How long does it take to see results from the Eval Maturity Phases Framework?

You can start vibe checking with documented human annotation in a single day. However, progressing through all four maturity phases — building scoring functions, capturing production traces, activating the flywheel, and scaling with topic modeling — takes weeks to months. The framework is designed to compound value over time as your eval system matures alongside your agent.

Does GTM Engineering with Claude Code work for paid ads, not just SEO?

Yes. Schneider explicitly covers paid ad workflows. You can use Claude Code to research ad angles from competitor data, draft multiple ad copy variations, publish them via the Facebook Ads API, pull performance data after a test period, identify winners and losers, and generate revised copy for scaling — all through parallel agent sessions.

What is the Flywheel in the Eval Maturity Framework and why does it matter?

The Flywheel is a continuous improvement loop: capture agent traces in production, identify failures via human review or automated tooling, pull those examples into an offline eval dataset, rerun evals, and use results to improve the agent. It matters because it shifts your approach from reactive regression-catching to proactively using eval data to drive every agent improvement.

What is Stack-in-a-Folder and do I need it for GTM Engineering?

Stack-in-a-Folder is Schneider's infrastructure pattern: a single project folder with a .env file holding all API keys and a CLAUDE.md file with standing agent instructions. Yes, you need it — it is the foundation that makes every Claude Code session instantly productive without re-entering credentials or context. Set it up once per project and reuse it permanently.

Is LLM-as-judge reliable enough to replace human reviewers?

Not without validation. The Hetzel framework explicitly warns against trusting LLM-as-judge scores blindly. You must build a human-labelled ground truth dataset and measure whether the LLM judge's scores align with expert decisions. Only after demonstrating acceptable alignment should you scale the judge to replace manual review. This practice is called 'eval the eval.'

Which framework helps me ship an AI agent to production faster?

The Hetzel Eval Maturity Phases Framework directly addresses the proof-of-concept-to-production gap. It gives you a structured methodology to build measurable, defensible quality evidence that justifies shipping. GTM Engineering assumes your tools already work and focuses on using them for marketing execution — it does not address agent reliability or production readiness.