Hablich Agent Interface Engineering Framework
Design MCP tools and agent interfaces that are fuel-efficient, self-healing, discoverable, and trustworthy — so agents complete user journeys without flying blind, burning excess tokens, or creating security backdoors.
// TL;DR
The Hablich Agent Interface Engineering Framework is a systematic method for designing MCP servers, CLI tools, and agent-facing APIs that are fuel-efficient, self-healing, discoverable, and trustworthy. Use it whenever you're building or auditing any interface that AI agents consume — especially when agents fail tasks, burn excessive tokens, pick the wrong tools, choke on raw data dumps, or when you need to make trust and permission decisions across different deployment tiers. It provides a step-by-step workflow covering trust boundaries, semantic summaries, tool categorisation, description auditing, error recovery playbooks, and measurement of tokens per successful outcome.
// When should I apply the Hablich Agent Interface Engineering Framework?
Use this skill whenever you are building, auditing, or improving an MCP server, CLI tool, or any interface that agents will consume. Trigger it when you notice agents failing tasks, burning too many tokens, getting stuck on errors, not calling the right tools, or when you are making trust/permission decisions in an agentic system.
// What inputs do I need before applying the Hablich framework?
- Agent interface descriptionrequired
What MCP server, CLI tool, or agent-facing API you are designing or auditing — its purpose, current tools, and target agent harness (e.g. Claude Code, Gemini CLI, Codex). - Target user journeysrequired
The specific task classes or workflows agents will perform through this interface. List each distinct journey separately because fuel efficiency metrics must not be compared across journey types. - Current tool inventoryrequired
List of tools currently exposed, with their names and descriptions (or intended descriptions). Can be a draft or existing schema. - Deployment tierrequired
Which tier the agent operates in: Tier 1 (local dev, human in loop), Tier 2 (CI/controlled environment), or Tier 3 (full internet access / browsing agent fleet). - Observed failure modes
Known or suspected ways the interface currently fails — context window blowout, agents getting stuck, wrong tool selection, security concerns, etc.
// What are the core principles behind the Hablich Agent Interface Engineering Framework?
Agents Are a Different User Class
Agents and humans share the same intent and goal, but have fundamentally different cognitive bottlenecks. Humans need visual complexity — layout, colour, signal. Agents have no such need; their bottleneck is token cost and reasoning load. Treat agents as a separate user segment with their own non-functional requirements: efficiency, discoverability, security, stability.
Don't Force the Agent to Read the Entire Book
Throwing raw data (e.g. a 50,000-line JSON trace file) at an agent pushes it into the dump zone — context window overflow and degraded reasoning. Instead, point the agent at the right sentence: return semantic summaries and structured markdown that surface only the signal needed for the task.
Fuel Efficiency of the Interface
The core metric for agent interface quality is tokens per successful outcome — the fuel efficiency of the interface. Fuel efficiency is worthless if you can't reach the destination, so always measure effectiveness (did the agent complete the full user journey?) alongside efficiency (token cost, tool calls, duration). Never compare this metric globally across different journey types; measure it within each specific user journey.
The Schema Is the UI for the Agent
Tool descriptions are the user interface agents navigate. Just as a bad UI loses human users, quality smells in MCP tool descriptions cause agents to call wrong tools, skip correct ones, or fail entirely. Auditing and improving descriptions is a first-class engineering task, not documentation hygiene.
Every Trade-off Shifts, It Doesn't Disappear
Every optimisation in agent interface design introduces a new trade-off. Adding more tool descriptions grows context window size. Reducing tools (Slim Mode) saves tokens but removes capability and forces extra turns. Adding more skills risks agents calling them inappropriately. Always name the trade-off explicitly and decide consciously rather than assuming a fix is free lunch.
Never Compromise Trust for Convenience
In traditional UX, removing friction is always a win. In agentic systems, some friction is by design. Convenience features that reduce human consent checkpoints — like auto-remembering permissions — can open backdoors exploitable via prompt injection. Tier your security model by deployment context and do not share security assumptions between a local Tier 1 agent and a Tier 3 browsing agent fleet.
// How do you apply the Hablich framework step by step?
- 1
Identify deployment tier and establish trust boundaries before any other design decision
Classify the agent's operating environment: Tier 1 (local dev, human in loop, default Chrome profile / local data, time-bound consent), Tier 2 (CI / controlled, use data separation — containers, separate profiles, remote debugging port), Tier 3 (full internet access — domain allow lists, prompt injection mitigations, maximum isolation). A tool can be shared across tiers; the security model must not be. Apply the Lethal Trifactor lens (Simon Willison) to Tier 1 specifically: confirm a human must consent at each sensitive action, and treat convenience requests that remove that consent as security risks, not UX wins.
- 2
Map each target user journey and diagnose whether agents are currently flying blind
List each distinct task class the agent must complete. For each, ask: what data does the agent currently receive, and is it raw/voluminous or semantically summarised? If the agent is receiving multi-megabyte raw files or deeply nested JSON, flag this as a dump zone risk. Identify which journeys are lightweight (e.g. web scraping) vs. intricate (e.g. debugging responsive layout) — fuel efficiency targets will differ and must not be compared across these classes.
- 3
Replace raw data outputs with semantic summaries for data-heavy tool responses
For any tool that currently returns large raw payloads (traces, logs, trees, JSON blobs), engineer an alternative return format: structured markdown and semantic summaries that surface only the actionable signal. Do not delete the raw output capability — it may be needed for post-processing pipelines — but make the semantic summary the default agent-facing response. The goal is pointing the agent at the right sentence, not handing it the entire book.
- 4
Categorise tools and apply tool categorisation to control default context exposure
Review your full tool inventory. Identify niche tools that only apply to specific sub-populations of users or use cases. Hide these behind command line parameters or opt-in flags — do not add them to the default context. Consider offering a Slim Mode (minimal tool set — e.g. select, navigate, evaluate) for cost-sensitive or context-constrained deployments, but explicitly document the capability trade-offs: slim mode reduces token cost but may force extra turns or block certain tasks entirely.
- 5
Audit every tool description for intent clarity using the 'schema as UI' standard
For each tool, evaluate the description against two criteria: (1) Purpose — does it clearly explain the tool's core function? (2) Usage guidelines — does it provide clear activation criteria, i.e. when should an agent call this tool? Add concrete trigger signals: domain terms, task types, metric names that an agent can pattern-match to. Example pattern: 'Use to find front-end performance issues and core web vitals: LCP, INP, CLS' — this lets the agent infer the tool is relevant when asked to improve page load. Be aware that longer descriptions grow context window size and can bias smaller models toward wrong tool selection; aim for minimum viable description per tool.
- 6
Build error recovery playbooks — turn every error state into a self-healing opportunity
For each tool, enumerate its failure modes. Then: (a) Useful error messages — rewrite vague errors to include actionable recovery information so the agent can self-heal without human intervention. (b) Proactive detours — where model training data might cause the agent to reach for the wrong tool, add explicit redirection in the tool description or schema pointing to the correct tool. (c) Diagnostic playbooks / troubleshooting skills — for recurring setup or configuration failures, build a dedicated skill that walks the agent (and human) through resolution. Every unhandled error costs tokens on retry; every self-healed error saves them.
- 7
Add skills for intricate multi-step workflows, but budget their context cost consciously
Skills (pre-built agentic workflows) are powerful for complex, repeatable journeys. Add them when a user journey has too many steps to be reliably assembled by the agent from individual tools alone. However, skills are not free lunch: piling in too many skills inflates context window size and causes agents to invoke skills inappropriately — the same discoverability problem re-emerges at the skill level. Apply the same minimum viable description discipline to skill descriptions as to individual tools.
- 8
Instrument and measure tokens per successful outcome per user journey
Implement measurement even if imperfect — data-informed decisions beat gut-driven decisions. Track: token cost, tool calls, duration, and task completion (binary: did the agent complete the full user journey?). Visualise per-journey fuel efficiency (e.g. bar chart where bar length = effectiveness). Prioritise engineering effort on the journeys with the worst tokens per successful outcome. Do not aggregate across journey types — a debugging session will always use more tokens than a scraping session, and that is appropriate, not a problem.
// What does the Hablich framework look like in practice?
A developer has built an MCP server for a code analysis platform. It exposes one monolithic tool called 'analyse_codebase' that returns a full AST dump and raw linting output. Agents frequently hit context limits and fail to complete tasks.
Apply Step 3: replace the raw AST and lint dump with a semantic summary returning only the top issues, file locations, and suggested fix types in structured markdown. Apply Step 4: decompose 'analyse_codebase' into targeted tools (e.g. 'find_security_issues', 'check_type_errors', 'audit_dependencies') and hide rarely-used tools like 'export_full_ast' behind an opt-in flag. Apply Step 5: give each tool a description with explicit activation criteria — 'Use find_security_issues when asked to audit for SQL injection, XSS, or exposed secrets.' Measure tokens per successful outcome per task class (security audit vs. dependency check vs. full review) separately.
A team is building a browser automation MCP server for an internal QA fleet running in CI, and separately a local debugging tool for individual engineers. They are considering sharing the same permission model across both.
Apply Step 1 immediately: classify the CI fleet as Tier 2 and the local tool as Tier 1. They can share the same tools but must not share the same security model. Tier 2 requires data separation (containers, isolated browser profiles, remote debugging port connectivity). Tier 1 requires explicit human consent at each sensitive action — do not implement 'remember my choice' / autoconnect as a default, because this removes the by-design friction that protects against prompt injection. Document this split explicitly in the interface architecture.
An agent interface for a data pipeline tool has 30 tools exposed by default. Agents repeatedly call the wrong tools and require multiple correction turns, inflating token costs.
Apply Steps 4 and 5 together. First, audit which tools are niche (used in <20% of journeys) and move them behind opt-in parameters. Consider offering a Slim Mode exposing only the 3-5 highest-utility tools for the most common journey. Second, audit every remaining tool description: does it define purpose and provide activation criteria? Rewrite descriptions to include the specific task triggers an agent would encounter — use the exact domain vocabulary the agent will see in user prompts. Monitor whether smaller models in the harness become biased toward over-using newly described tools, and trim descriptions if so.
// What mistakes should I avoid when designing agent interfaces?
- Throwing raw, voluminous data (multi-megabyte logs, full trace files, complete AST dumps) at agents — this pushes them into the dump zone and blows through the context window.
- Building one monolithic tool and assuming the agent will figure out sub-tasks — this was the 'debug_webpage' mistake; agents need decomposed, targeted tools with clear activation criteria.
- Decomposing into many tools but leaving descriptions with quality smells — the schema is the UI; 97% of MCP tool descriptions have quality smells that cause wrong tool selection.
- Treating skills and rich descriptions as free lunch — every addition to context costs tokens and can cause agents to invoke tools or skills they shouldn't. The trade-off shifts, it never disappears.
- Comparing tokens per successful outcome globally across different user journey types — a debugging journey will always cost more than a scraping journey; aggregate comparison masks real signal.
- Removing human consent friction for convenience (e.g. auto-remembering permissions / autoconnect by default) — in agentic systems, some friction is by design. Convenience that eliminates consent checkpoints creates backdoors exploitable via prompt injection.
- Applying the same security model to a local Tier 1 agent and a Tier 3 internet-browsing agent fleet — these tiers may share tools but must never share security assumptions.
- Writing error messages that are vague or tool-internal — every unhelpful error forces the agent to burn tokens on retry or require human intervention; good error messages enable self-healing.
- Skipping measurement because it is hard to instrument perfectly — even an imperfect measurement of tokens per successful outcome is better than gut-driven decisions.
// What are the key terms in the Hablich Agent Interface Engineering Framework?
- Flying Blind
- The state of a coding agent that can generate code but cannot validate what it has actually done — it has no sensory interface back to the environment it is affecting. The motivating problem this framework solves.
- Dump Zone
- The degraded reasoning state an agent enters when its context window is overwhelmed with too much data (e.g. a 50,000-line JSON trace file). Named in reference to Matt's talk on context window management.
- Tokens Per Successful Outcome
- The primary fuel efficiency metric for an agent interface. Measures token cost (plus tool calls and duration) normalised only against successful task completions — not all attempts. 'Fuel efficiency is worthless if you can't reach your destination.'
- Fuel Efficiency of the Interface
- The aggregate quality of an agent-facing interface as measured by tokens per successful outcome. High fuel efficiency means the agent completes user journeys using the minimum token expenditure.
- Effectiveness
- Whether the agent completes the entire user journey and fulfils the functional intent. Binary measure: yes or no. Must be measured alongside efficiency — an optimised but ineffective interface is worthless.
- Semantic Summary
- A structured, human-readable (markdown) distillation of a large raw data payload that surfaces only the actionable signal relevant to the agent's task. The alternative to throwing the entire book at the agent.
- Schema Is the UI for the Agent
- The principle that MCP tool descriptions (names, parameter schemas, docstrings) function as the user interface the agent navigates. Poor descriptions = bad UI = wrong tool selection and task failure.
- Tool Categorisation
- The practice of hiding niche or rarely-needed tools behind opt-in command line parameters rather than exposing them in the default context, to reduce context window bloat and wrong tool selection.
- Slim Mode
- An extreme application of tool categorisation that exposes only the minimum viable tool set (e.g. 3 tools: select page, navigate page, evaluate script) for maximum token efficiency. Explicit trade-off: reduced capability and potential for extra agent turns.
- Proactive Detours
- Explicit redirections built into tool descriptions or schemas that counteract an agent's training-data biases — steering the agent toward the correct tool before it reaches for the wrong one.
- Diagnostic Playbooks
- Pre-built troubleshooting skills that activate when an agent (or human) encounters a known recurring failure mode, enabling self-healing without human intervention.
- Self-Healing
- The ability of an agent to recover from errors without requiring human intervention, enabled by useful error messages, proactive detours, and diagnostic playbooks.
- Minimum Viable Description
- The shortest tool description that still provides sufficient purpose definition and activation criteria for the agent to select and use the tool correctly. The target end-state of iterative description optimisation — never fully finished because models and harnesses keep changing.
- Lethal Trifactor
- Simon Willison's framework (referenced, not fully reproduced) for reasoning about the three converging risk factors in agentic browser automation that make prompt injection dangerous. Drives the by-design friction principle in Tier 1 trust design.
- Tier 1 / Tier 2 / Tier 3 (Trust Tiers)
- A three-tier deployment classification for browser-using agents. Tier 1: local dev environment, human in loop, time-bound consent required. Tier 2: CI/controlled environment, data separation (containers, isolated profiles) required. Tier 3: full internet access (YOLO mode) — domain allow lists, prompt injection mitigations, and full Tier 2 controls required.
- Autoconnect
- A convenience feature that lets a human share their screen/session with an agent without repeated consent prompts. Flagged as a trust boundary risk — convenience that removes consent friction is by-design rejected in Tier 1 deployments.
- Agents Are a Different User Class
- The foundational insight that agents are a separate user segment from humans, sharing the same goals and intents but having fundamentally different cognitive bottlenecks — token cost and reasoning load rather than visual complexity.
// FREQUENTLY ASKED QUESTIONS
What is the Hablich Agent Interface Engineering Framework?
The Hablich Agent Interface Engineering Framework is a structured approach for designing and auditing MCP servers, CLI tools, and agent-facing APIs so that AI agents can complete user journeys without wasting tokens, getting stuck on errors, picking wrong tools, or creating security backdoors. It was derived from lessons building Chrome DevTools MCP at Google and introduces key concepts like fuel efficiency (tokens per successful outcome), trust tiers, semantic summaries, tool categorisation, and the principle that the schema is the UI for agents.
What is fuel efficiency in the context of agent interfaces?
Fuel efficiency measures the token cost required for an agent to successfully complete a user journey through your interface. The core metric is tokens per successful outcome — the total tokens consumed (plus tool calls and duration) divided by successful completions only. An interface with high fuel efficiency means the agent accomplishes tasks with minimal token expenditure. Critically, you must measure this per journey type, never globally — a debugging session naturally costs more tokens than a scraping task.
How do I design an MCP server that agents can actually use well?
Start by classifying your deployment tier and setting trust boundaries. Then map every user journey the agent must complete and check whether it currently receives raw data dumps or semantic summaries. Replace large raw payloads with structured markdown summaries. Decompose monolithic tools into targeted ones with clear activation criteria. Audit every tool description as if it were a UI element. Build error recovery playbooks for self-healing. Finally, instrument tokens per successful outcome per journey type to guide ongoing improvements.
How do I audit MCP tool descriptions for quality?
Evaluate each tool description against two criteria: purpose (does it clearly explain the tool's core function?) and activation criteria (does it tell the agent when to call this tool?). Add concrete trigger signals — domain terms, task types, metric names the agent can pattern-match. For example: 'Use to find front-end performance issues and core web vitals: LCP, INP, CLS.' Aim for minimum viable descriptions — long descriptions grow context windows and can bias smaller models toward wrong tool selection.
How does the Hablich framework compare to just writing good API documentation?
Traditional API documentation targets human developers who can browse, search, and reason across pages. The Hablich framework treats agents as a fundamentally different user class whose bottleneck is token cost and reasoning load, not visual navigation. It goes beyond documentation by engineering semantic summaries, tool categorisation with slim modes, proactive detours to counter training-data biases, tiered security models, self-healing error playbooks, and fuel-efficiency measurement — none of which are standard API documentation practices.
When should I use the Hablich Agent Interface Engineering Framework?
Use it whenever you are building, auditing, or improving any interface that AI agents consume — MCP servers, CLI tools, or agent-facing APIs. Trigger it specifically when you notice agents failing tasks, burning too many tokens, hitting context window limits, getting stuck on errors, selecting the wrong tools, or when you need to make trust and permission decisions across deployment tiers (local dev, CI, or internet-browsing agent fleets).
What are trust tiers in agent interface design?
Trust tiers classify agent deployment environments into three levels. Tier 1 is local development with a human in the loop and time-bound consent required at each sensitive action. Tier 2 is CI or controlled environments requiring data separation via containers and isolated profiles. Tier 3 is full internet access ('YOLO mode') requiring domain allow lists, prompt injection mitigations, and maximum isolation. Tools can be shared across tiers, but the security model must never be shared between them.
What results can I expect from applying the Hablich framework to my MCP server?
You can expect measurably lower token costs per successful task completion, higher task completion rates, fewer wrong-tool-selection errors, and agents that self-heal from errors instead of requiring human intervention. Teams applying the framework typically see reduced context window blowouts by replacing raw data dumps with semantic summaries, better tool discoverability through audited descriptions, and a clear security posture via trust tier separation. The key outcome is a data-informed improvement loop driven by tokens per successful outcome metrics.
What is the dump zone in agent interfaces?
The dump zone is the degraded reasoning state an agent enters when its context window is overwhelmed with too much data — like a 50,000-line JSON trace file or full AST dump. When an agent hits the dump zone, its reasoning quality collapses and it either fails the task or produces unreliable output. The fix is replacing raw data payloads with semantic summaries that surface only the actionable signal, pointing the agent at the right sentence instead of handing it the entire book.
What is Slim Mode for MCP tools?
Slim Mode is an extreme application of tool categorisation where you expose only the minimum viable tool set — typically 3-5 core tools like select, navigate, and evaluate — for maximum token efficiency. It's designed for cost-sensitive or context-constrained deployments. The explicit trade-off is reduced capability: the agent may need extra turns to accomplish tasks or may be blocked from certain tasks entirely. Slim Mode should be documented with these capability trade-offs clearly stated.