Frequently Asked Questions About Schmid Agent-Ready Engineering Framework
23 answers covering everything from basics to advanced usage.
// Basics
What is the difference between a dispatcher and a traffic controller in agent engineering?
A traffic controller dictates every movement deterministically—step one does X, step two does Y, step three does Z. A dispatcher states the destination and available transport options, then trusts the agent to find its own path. In the Schmid framework, engineers should act as dispatchers: define the goal and constraints, expose the available tools, and let the LLM decide the route. This is the core of the Hand Over Control principle.
What is the build-to-delete principle in AI agent development?
Build-to-delete means treating your agent software as disposable. Models improve rapidly, and architecture tightly coupled to a specific model's quirks will need to be rebuilt. Avoid over-investing in bespoke scaffolding. Design components for replaceability—document which parts depend on current model behavior and flag them as candidates for future replacement. This prevents sunk-cost attachment to brittle infrastructure.
What is a semantic interface for an AI agent tool?
A semantic interface is a tool or function definition where every parameter, return value, and failure mode is described in natural language precise enough that an agent with zero codebase context can use it correctly. Unlike traditional API docs written for developers who know the system, semantic interfaces assume the caller has never seen the codebase and must understand everything from the schema alone.
Does the Schmid framework apply to simple single-turn LLM applications or only to agents?
The framework is specifically designed for AI agents—systems that make decisions, call tools, and operate over multiple steps. Single-turn LLM applications (e.g., text summarization, translation) don't need most of these principles because they don't have state management, tool calls, or multi-step workflows. However, the eval principle applies universally: any LLM output benefits from probabilistic evaluation rather than exact-match testing.
What's the biggest mistake experienced engineers make when building AI agents?
Fighting the model by forcing it into deterministic, step-by-step workflows. Senior engineers have years of muscle memory around controlled execution—if-then-else branches, exact assertions, and predictable outputs. This instinct causes them to over-constrain the agent, which paradoxically makes it more brittle. The Schmid framework's central insight is that you must shift from traffic controller to dispatcher: define the goal and constraints, then trust the LLM to navigate.
// How To
How do I convert my existing unit tests to evals for AI agents?
Start by identifying every test that asserts a single exact output. Replace the assertion with eval criteria that define what 'success' means qualitatively—does the output compile, does it satisfy the requirements, is the information correct? Run the agent multiple times per test case and measure the pass rate. Set a reliability threshold (e.g., 9/10 must pass). Use LLM-as-a-judge for automated scoring or human review for critical paths. Add tracing to understand variance across runs.
How do I rewrite my agent's workflow from step sequences to goal definitions?
Locate any place where you've hard-coded step-by-step logic into prompts or orchestration code. Replace it with a goal statement describing the desired outcome plus constraints (e.g., 'Resolve the customer's issue while preserving the relationship; you may offer refunds up to $50'). List available tools without prescribing their order. Verify the agent still achieves the goal even when it takes unexpected intermediate steps—this variance is expected, not a bug.
How do I make my existing APIs agent-ready?
Pull up every function schema and tool definition the agent can call. For each one, ask: could someone with zero context understand exactly what this does from the doc string alone? Rewrite parameter names to be descriptive. Add explicit descriptions of what each parameter represents, what the function returns, what side effects occur, and what error states exist. For example, change delete_item(id) to include that 'id' is a UUID from the inventory table, deletion is permanent, and it returns an error if the item has pending orders.
How do I implement error-as-input handling in a long-running agent?
Wrap each tool call or API request in error handling that catches failures and formats them as structured messages for the model—e.g., 'Search for X failed with timeout; consider alternative sources or proceed without this data.' Add checkpointing after major milestones so a failure at minute 12 doesn't restart from minute 0. Design the system prompt to instruct the agent that errors are expected and it should reason about workarounds rather than stopping.
How do I set a reliability threshold for my AI agent?
Define the minimum pass rate your agent must achieve before it's production-ready—typically between 80% and 95% depending on the stakes. Run each critical prompt or workflow at least 10-20 times. Score each run using LLM-as-a-judge or human review against your eval criteria. If the agent hits your threshold consistently, it's ready. If not, iterate on prompts, tool definitions, or system instructions until it does. Track this metric over time as a regression indicator.
How do I handle user preferences in an agent without using Boolean flags?
Instead of collapsing user preferences into typed fields (is_premium: true, region: 'US'), express them as natural-language context the model can reason over: 'This user is on the premium plan, focuses on the US market but wants to exclude California, and prefers concise responses.' This preserves semantic nuance that Boolean flags discard. The model can then weigh these preferences dynamically rather than following rigid if-then branches based on flag values.
// Troubleshooting
My AI agent works sometimes and fails other times with the same input—is it broken?
Not necessarily. Agents are non-deterministic by nature—the same input can produce different steps and different outputs across runs. The question isn't whether it always produces identical results, but whether it reliably produces correct results. Switch from exact-output assertions to eval-based measurement. If the agent succeeds 9 out of 10 times with functionally correct output, it's working. If it's 3 out of 10, iterate on prompts and tool definitions.
My agent keeps failing on tool calls and restarting the entire flow—how do I fix this?
You're treating errors as crashes instead of inputs. Map every failure point in your agent flow—tool calls, API requests, sub-tasks. For each one, design a handler that catches the failure and returns a structured error message to the model rather than throwing an exception. Add checkpointing after expensive operations so failures don't require full restarts. Instruct the agent in its system prompt that tool failures are expected and it should reason about alternatives.
My agent ignores some tools or uses the wrong tool—what's going wrong?
The agent likely can't distinguish between tools because their schemas are too vague. Agents only see function names, parameter descriptions, and doc strings—they don't have your developer context. Rewrite every tool's schema to be fully self-documenting: describe what the tool does, when to use it versus alternatives, what each parameter means, and what it returns. If two tools sound similar, add explicit guidance about when to choose each one.
My agent handles happy paths fine but breaks on edge cases—how do I improve it?
Edge case brittleness usually comes from over-controlling the workflow. If you've hard-coded step sequences, the agent can't adapt when reality deviates from your expected path. Redesign the workflow as a goal statement with constraints and let the agent reason about unexpected situations. Also ensure error-as-input handling is in place so tool failures become reasoning opportunities rather than crashes. Finally, add edge cases to your eval suite and measure pass rates specifically for those scenarios.
// Comparisons
How does the Schmid framework compare to standard prompt engineering?
Prompt engineering focuses on optimizing the text you send to the LLM—word choice, examples, chain-of-thought instructions. The Schmid framework operates at a higher architectural level: it addresses how you structure state, delegate control, handle failures, test reliability, and design tool interfaces. Good prompt engineering is necessary but insufficient—if your architecture fights the model with rigid workflows and exact-output tests, better prompts alone won't make the agent reliable.
How is the Schmid framework different from the ReAct agent pattern?
ReAct (Reason + Act) is a specific prompting pattern where the agent alternates between reasoning steps and tool actions. The Schmid framework is architecture-level guidance that applies regardless of whether you use ReAct, function calling, or any other pattern. You could implement a ReAct agent and still violate all five Schmid principles—e.g., by hard-coding the reasoning steps, crashing on tool errors, and testing with exact assertions. The two are complementary, not competing.
Should I use the Schmid framework with LangChain or build agents from scratch?
The framework is tool-agnostic. It diagnoses architectural and mindset problems, not implementation choices. You can apply all five principles whether you use LangChain, LangGraph, AutoGen, CrewAI, or raw API calls. The key question is whether your chosen framework lets you implement the principles—goal-based workflows, error-as-input handling, eval-based testing, and self-documenting tools. If your framework forces rigid step sequences or hides errors, it may be working against you.
// Advanced
Can I apply the Schmid framework to multi-agent systems?
Yes. In multi-agent systems, each agent's tools include the other agents, so the Agents Evolve and APIs Don't principle becomes even more critical—every agent's interface must be self-documenting. Error-as-input handling matters more because failures cascade across agents. The Hand Over Control principle applies at the orchestration level: define the overall goal and let agents coordinate, rather than scripting exact interaction sequences between them.
How do I apply the Schmid framework when my agent uses RAG?
RAG (Retrieval-Augmented Generation) is essentially a tool the agent uses. Apply the agent-ready tool principle: ensure the retrieval function's schema clearly describes what it searches, what it returns, how results are ranked, and what happens when no results are found. Apply errors-as-inputs: if retrieval fails or returns empty, feed that to the model as context rather than crashing. Apply evals: measure whether the agent's final answers are correct with the retrieved context, not whether specific documents were retrieved.
What does the observe-adjust loop look like in practice for agent development?
Define your agent's instructions and tools. Run the agent against a set of test inputs. Observe the full trace—not just the final output, but every reasoning step and tool call. Identify where behavior diverges from expectations. Adjust prompts, tool descriptions, or system instructions. Run again and compare. This loop replaces the traditional write-test-deploy cycle. Schedule dedicated observation sessions; don't assume correctness after initial deployment. Track prompt and tool changes like code commits.
How do I convince my team to adopt evals instead of unit tests for agents?
Show them a concrete example: run the same agent input 10 times and demonstrate that it produces functionally correct but textually different outputs each time. Point out that traditional unit tests would fail on 9 of those runs despite all being correct. Then propose a reliability threshold—e.g., 8/10 must pass functional criteria—and show how evals measure real-world performance more accurately. Frame it as upgrading your testing strategy for non-deterministic systems, not abandoning rigor.
How often should I run the observe-adjust loop for my AI agent?
At minimum, run it after every significant change to prompts, tools, or system instructions. During initial development, expect to run it dozens of times per day. After deployment, schedule periodic observation sessions—weekly or after any model update—to catch drift. Treat prompt and tool changes with the same discipline as code changes: version them, review them, and validate them through evals before promoting to production.