Schmid Agent-Ready Engineering Framework

Last updated: 30 May 2026

Diagnose and fix the five specific mindset and architecture gaps that cause experienced engineers to build unreliable AI agents, then redesign your agent system so it is production-ready.

// TL;DR

The Schmid Agent-Ready Engineering Framework is a diagnostic and redesign system that identifies five specific mindset and architecture gaps causing experienced engineers to build unreliable AI agents. It covers treating text as state, handing control to the LLM, handling errors as inputs, replacing unit tests with evals, and making tools self-documenting. Use it whenever you're designing, debugging, or reviewing an AI agent system—especially when your agent feels flaky, hard to test, or you've over-constrained its workflow with rigid step-by-step logic inherited from traditional software engineering.

Framework

// When should I use the Schmid Agent-Ready Engineering Framework?

Use this skill whenever you are designing, debugging, or reviewing an AI agent system and want to audit it against the five structural differences between traditional software engineering and agent engineering. Especially valuable when an agent feels flaky, hard to test, or over-controlled.

// What inputs do I need to apply the Schmid Agent-Ready Engineering Framework?

Agent descriptionrequired
What the agent is supposed to do — its goal, not its steps.
Current architecture or designrequired
How the agent is currently built: tools, prompts, workflows, error handling, and test strategy.
Observed failure modes
What is going wrong — flakiness, wrong outputs, broken flows, poor reliability.
Existing APIs or tools the agent uses
Names and signatures of any functions, endpoints, or tool definitions exposed to the agent.

// What are the core principles of agent-ready engineering?

Text Is Our New State

Agents no longer operate in clear structured data concepts — Booleans, flags, or rigid schemas. Everything is context now: text, images, audio, user preferences. Design your data model around semantic meaning, not typed fields.

Hand Over Control

Stop forcing the agent into a predefined deterministic workflow where step one does this and step two does that. Define the goal on what you want the agent to do, but do not define the exact steps the agent needs to take to achieve that goal. Trust the LLM to navigate; you are a dispatcher, not a traffic controller.

Errors Are Just Inputs

When something in your agent flow fails, treat it as a normal input — very similar to a user input — not a crash requiring a full restart. Provide errors back to the model, design workarounds, and keep moving forward in the flow rather than starting over from the beginning.

Move From Unit Tests to Evals

Agents are non-deterministic; you cannot guarantee that the same input will always produce the same steps and the same result. Stop asserting exact outputs and start measuring how often something works. Reliability is the success metric — use LLM-as-a-judge or human expert review, and always trace what the agent is doing.

Agents Evolve and APIs Don't

Agents only see function schemas, doc strings, and tool definitions — they do not have the years of developer context you have. Every tool exposed to an agent must be agent-ready: self-documenting with semantic interfaces that assume zero background knowledge from the caller.

// How do you apply the Schmid Agent-Ready Engineering Framework step by step?

1
Audit State Management for Semantic Readiness
Review every place where the agent stores or reads state. Ask: is this a Boolean flag or rigid data structure that could instead be expressed as natural-language context? Identify any user preferences, approval flows, or conditional branches that collapse nuance into a binary. Replace or augment them with context-carrying text so the agent can understand semantic meaning — e.g., 'focus on US market, ignore California' rather than a dropdown selection.
2
Redesign Workflows as Goal Definitions, Not Step Sequences
Locate any workflow where you have hard-coded step one, step two, step three logic into the prompt or orchestration layer. Rewrite it as a goal statement plus constraints, then let the agent decide the path. Apply the dispatcher metaphor: tell the agent the destination and the available transport options; do not specify the route. Verify the agent still achieves the outcome even when it takes unexpected intermediate steps — that is acceptable and expected behavior.
3
Implement Error-as-Input Handling Throughout the Agent Flow
Map every point in the agent flow where a tool call, API call, or sub-task can fail. For each failure point, design a path that feeds the error back to the model as a structured input rather than throwing an exception or restarting the whole process. For long-running agents (5+ minutes), this is critical — a full restart wastes compute and loses existing context. Add checkpointing or partial-state preservation where flows are expensive to re-run.
4
Replace or Supplement Unit Tests with Evals
Identify your current test suite. Any test that asserts a single exact output from a single input needs to be converted or supplemented with an eval that measures pass rate across multiple runs. Define your reliability threshold — e.g., 'this prompt must succeed 8 out of 10 times before it goes to production.' Choose a judgment method: LLM-as-a-judge for scalable qualitative scoring, or human expert review for high-stakes outputs. Instrument tracing so you can observe what the agent actually does on each run, not just the final output.
5
Rewrite All Agent-Facing Tools to Be Self-Documenting with Semantic Interfaces
Pull up every function schema, tool definition, or API endpoint the agent can call. For each one, ask: if someone with zero codebase context read only the doc string and parameter names, would they know exactly what this does, what the parameters mean, and what happens on failure? If not, rewrite the doc string and parameter descriptions to be fully self-explaining. Do not assume long-year developer expertise. The delete_item(id) example: add what 'id' refers to, what deletion implies downstream, and what error states exist.
6
Apply the 'Build to Delete' Principle to All Agent Components
Review the architecture with the assumption that it will be rebuilt — possibly soon — as models improve. Avoid over-investing in bespoke scaffolding that cannot survive a model swap. Treat software as disposable. Flag any components that are tightly coupled to a specific model's quirks rather than to durable principles, and document them as candidates for replacement.
7
Run the Iterative Observe-Adjust Loop Explicitly
Define instructions, run the agent, observe what it does, adjust prompts or tools, then run again. This loop — not a one-shot deploy — is the core development process for agents. Schedule deliberate observation sessions rather than assuming correctness after initial deployment. Track changes to prompts and tool definitions as you would track code changes.

// What does the Schmid framework look like in real-world agent scenarios?

A customer support agent for a subscription product is handling cancellation requests. It classifies intent and routes to a fixed cancel-or-retain workflow, but it frequently misroutes and cannot handle users who change their mind mid-conversation.

Apply Step 2 (Hand Over Control): replace the classification-plus-fixed-workflow with a goal statement — 'resolve the customer's underlying need while preserving the relationship if possible' — and let the agent choose the path. Apply Step 1 (Text Is New State): instead of a churn-flag triggering a predefined branch, pass the full conversation context so the agent can detect mid-conversation intent shifts. Apply Step 4 (Evals): measure what percentage of conversations reach a satisfying resolution rather than asserting that a specific cancel endpoint was called.

A deep research agent takes 15 minutes to run, calls multiple search tools, and occasionally fails mid-run when one search API times out, forcing a full restart and wasting the prior context.

Apply Step 3 (Errors Are Just Inputs): instrument each search tool call to catch failures and return them to the model as informational inputs ('search for X failed with timeout; consider alternative sources or proceed without this data'). Add checkpointing after major milestones so a failure at minute 12 does not restart from minute 0. Apply Step 5 (Agent-Ready Tools): ensure each search tool's schema describes what query types it handles and what its failure modes are, so the model can reason about alternatives autonomously.

An engineering team has a suite of unit tests for their coding agent that check exact line-by-line code output, but the agent keeps failing CI despite producing functionally correct code.

Apply Step 4 (Move From Unit Tests to Evals): retire exact-output assertions and replace them with eval criteria — 'does the output compile, does it pass functional tests, does it match the stated requirements?' — scored by an LLM-as-a-judge or automated test harness. Set a reliability threshold (e.g., 9/10 runs must pass functional criteria) as the production gate. Add tracing to understand which steps vary across runs and whether that variance is acceptable.

// What are the most common mistakes when building AI agents?

Fighting the model — trying to force the agent into a rigid step-by-step workflow instead of defining the goal and trusting the LLM to navigate to it.
Treating agent failures as crashes requiring full restarts rather than feeding errors back to the model as inputs and continuing forward in the flow.
Restarting long-running agent flows from scratch when a single tool call fails, wasting compute and losing all accumulated context.
Writing unit tests that assert exact deterministic outputs for non-deterministic agents — the agent will always appear flaky even when it is working correctly.
Building APIs and tools with minimal documentation and assuming the agent has the same contextual background as a developer who built the system.
Shipping an agent to production that only succeeds a fraction of the time — agents are only valuable if they are reliable enough to be trusted.
Over-investing in brittle agent scaffolding tied to a specific model's behavior — software is disposable and must be built to delete and rebuild as models improve.
Collapsing rich user context into Boolean flags or typed fields, discarding semantic meaning the model could otherwise use to respond dynamically.

// What do the key terms in agent-ready engineering mean?

Text Is Our New State: The principle that agents operate on semantic context — text, images, audio, natural-language preferences — rather than Boolean flags and rigid typed data structures. All state should be treated as context-carrying information the model can reason over.
Hand Over Control: The practice of defining the goal for an agent without specifying the exact steps it must take to achieve that goal. The engineer acts as a dispatcher (destination + options), not a traffic controller (precise route and speed).
Dispatcher vs. Traffic Controller: A metaphor for the shift in the engineer's role: a traffic controller dictates every movement deterministically; a dispatcher states the destination and available modes, then trusts the agent to find the path.
Errors Are Just Inputs: The design principle that failures within an agent flow should be treated as a normal model input — similar to a user message — rather than exceptions that halt or restart the process. The model is given the error and expected to reason around it.
Evals: Probabilistic evaluation methods that measure how often an agent succeeds across multiple runs, replacing deterministic unit tests. Evals use qualitative judgment (LLM-as-a-judge or human expert) and reliability thresholds rather than exact output assertions.
LLM-as-a-Judge: An evaluation technique where a language model scores or qualifies another agent's output, enabling scalable qualitative assessment when exact outputs cannot be asserted.
Agent-Ready: Describes tools, APIs, and function schemas specifically designed for agent consumption: fully self-documenting with semantic interfaces, explicit parameter descriptions, and failure-mode documentation — assuming zero prior developer context.
Semantic Interface: A tool or function definition where every parameter, return value, and failure mode is described in natural language precise enough that an agent with no codebase context can use it correctly.
Build to Delete: The 'bitter lesson' principle: agent software is disposable and will be rebuilt — possibly soon — as models improve. Avoid over-coupling architecture to current model quirks; design for replaceability.
Observe-Adjust Loop: The core iterative development cycle for agents: define instructions → run → observe behavior → adjust prompts or tools → run again. Replaces the traditional write-code → test → deploy linear flow.

// FREQUENTLY ASKED QUESTIONS

What is the Schmid Agent-Ready Engineering Framework?

It is a diagnostic and redesign framework that identifies five structural differences between traditional software engineering and AI agent engineering. Created from insights by Philipp Schmid of Google DeepMind, it addresses text-as-state, goal-based control, error handling, probabilistic evaluation, and agent-ready tool design. Engineers use it to audit and fix the root causes of unreliable agent behavior rather than applying surface-level patches.

What are the five principles of the Schmid Agent-Ready Engineering Framework?

The five principles are: (1) Text Is Our New State—use semantic context instead of Boolean flags; (2) Hand Over Control—define goals, not step-by-step workflows; (3) Errors Are Just Inputs—feed failures back to the model instead of crashing; (4) Move From Unit Tests to Evals—measure reliability rates, not exact outputs; (5) Agents Evolve and APIs Don't—make every tool fully self-documenting with semantic interfaces.

How do I apply the Schmid Agent-Ready Engineering Framework step by step?

Start by auditing your state management for semantic readiness—replace rigid flags with natural-language context. Then redesign workflows as goal definitions instead of step sequences. Implement error-as-input handling at every failure point. Replace unit tests with evals that measure pass rates. Rewrite all agent-facing tool schemas to be self-documenting. Apply the build-to-delete principle to avoid brittle scaffolding. Finally, run iterative observe-adjust loops as your core development process.

How does the Schmid framework compare to just using LangChain or other agent frameworks?

LangChain and similar libraries are orchestration tools—they provide the plumbing for chaining LLM calls, tools, and memory. The Schmid framework is an architecture and mindset audit that sits above any specific library. It diagnoses why your agent is unreliable regardless of what framework you use. You can apply its five principles whether you're using LangChain, LangGraph, CrewAI, or custom orchestration code.

When should I use the Schmid Agent-Ready Engineering Framework?

Use it whenever you're designing a new AI agent, debugging a flaky one, or reviewing an agent architecture for production readiness. It's especially valuable when your agent inconsistently follows workflows, fails catastrophically on tool errors, passes tests intermittently despite producing correct results, or when experienced engineers on your team are unconsciously applying traditional software patterns that don't work for non-deterministic systems.

What results can I expect after applying the Schmid framework to my agent?

You should see measurably higher reliability rates, fewer catastrophic restarts on tool failures, more graceful handling of edge cases the agent hasn't seen before, and a testing strategy that accurately reflects real-world performance. Long-running agents benefit most—error-as-input handling and checkpointing prevent costly full restarts. Your development velocity should also increase because iterative observe-adjust loops replace frustrating guess-and-deploy cycles.

What does 'errors are just inputs' mean for AI agents?

It means that when a tool call or API request fails inside an agent flow, you should feed the error back to the LLM as a structured input—similar to a user message—rather than throwing an exception or restarting the entire process. The model can then reason about the failure and choose an alternative path. This is critical for long-running agents where a restart at minute 12 wastes all prior context and compute.

How do I test AI agents if unit tests don't work?

Replace exact-output unit tests with evals—probabilistic evaluations that measure how often your agent succeeds across multiple runs. Define a reliability threshold like '8 out of 10 runs must pass.' Use LLM-as-a-judge for scalable qualitative scoring or human expert review for high-stakes outputs. Instrument tracing so you can observe the agent's reasoning and tool calls on every run, not just the final answer.

What does 'agent-ready' mean for APIs and tools?

An agent-ready tool is fully self-documenting with a semantic interface—its function schema, doc string, and parameter names explain everything the agent needs to know, assuming zero codebase context. Every parameter describes what it represents, every return value is explained, and every failure mode is documented. If a developer with no prior knowledge of your system can't understand the tool from its definition alone, the agent can't either.

Why do senior engineers struggle more with building AI agents?

Senior engineers have deep-rooted habits from years of deterministic software engineering—strict typing, exact test assertions, controlled step-by-step workflows, and minimal documentation for well-known internal APIs. These habits directly conflict with agent engineering, where state is semantic, control must be delegated, outputs are non-deterministic, and tools must be self-documenting. The Schmid framework specifically targets these five habitual gaps.

// GET THIS SKILL — FREE