How Do I Build MCP Tools That Agents Use Correctly?

For MCP server developers building tools for coding agents like Claude Code, Gemini CLI, or Codex · Based on Hablich Agent Interface Engineering Framework

// TL;DR

If you're building an MCP server for coding agents like Claude Code or Gemini CLI, the Hablich Agent Interface Engineering Framework helps you design tools that agents discover correctly, use efficiently, and recover from gracefully. Apply it to replace raw data dumps with semantic summaries, audit tool descriptions as agent-facing UI, categorise tools to prevent context bloat, build self-healing error responses, and measure tokens per successful outcome per journey type. The result is agents that complete tasks reliably at lower token cost.

Why Do Agents Keep Calling the Wrong Tools on My MCP Server?

Agents select tools by reading their descriptions — the schema is literally the UI for the agent. If your tool descriptions lack clear purpose definitions and activation criteria, agents will guess, and they'll guess wrong. Research from the Chrome DevTools MCP project found that 97% of MCP tool descriptions have quality smells that cause incorrect tool selection.

The Hablich framework treats description auditing as a first-class engineering task. For each tool, write a minimum viable description that answers two questions: What does this tool do? When should an agent call it? Include domain-specific trigger terms the agent will encounter in user prompts. For example: 'Finds unused CSS rules in the current page. Use when asked to reduce page weight, improve loading performance, or clean up stylesheets.'

Also add proactive detours — explicit redirections that counteract training-data biases. If agents frequently confuse your `profile_cpu` tool with your `detect_memory_leaks` tool, add a note: 'For memory analysis, use detect_memory_leaks instead.'

How Do I Stop My MCP Server From Blowing Up the Agent's Context Window?

If any of your tools return large raw payloads — full trace files, AST dumps, log files, deeply nested JSON — you're pushing agents into the dump zone. The dump zone is the degraded reasoning state that occurs when an agent's context window is overwhelmed with too much data.

The fix is semantic summaries: structured markdown responses that surface only the actionable signal. Don't delete the raw output — other pipelines may need it — but make the semantic summary the default response for agents.

Beyond response size, audit your tool inventory. If you expose 30 tools by default, every tool's description consumes context tokens before the agent even starts working. Apply tool categorisation: hide niche tools behind opt-in flags. Consider offering a Slim Mode with only 3-5 core tools for cost-sensitive deployments. Document the trade-off: slim mode saves tokens but may force extra turns.

How Do I Make My MCP Server Self-Healing When Agents Hit Errors?

Every vague error message forces the agent to burn tokens on retries or require human intervention. The Hablich framework requires you to enumerate failure modes for every tool and engineer three self-healing mechanisms:

1. Useful error messages — rewrite generic errors to include actionable recovery steps. Instead of 'Connection failed,' return 'Connection to Chrome debugging port 9222 failed. Verify Chrome is running with --remote-debugging-port=9222.'

2. Proactive detours — redirect agents away from common wrong-path choices before they happen.

3. Diagnostic playbooks — pre-built troubleshooting skills for recurring setup failures.

How Do I Measure Whether My MCP Server Is Actually Getting Better?

Track tokens per successful outcome per journey type. Instrument token cost, tool call count, duration, and task completion (binary). Don't compare across journey types — a debugging session naturally costs more than a scraping task. Visualise per-journey fuel efficiency and prioritise engineering effort on the journeys with the worst ratios.

Start measuring now, even if your instrumentation is imperfect. Data-informed decisions always beat gut-driven decisions.

Next step: Take your highest-traffic MCP tool, audit its description against the purpose + activation criteria standard, and replace any raw data response with a semantic summary. Measure tokens per successful outcome before and after.

// FREQUENTLY ASKED QUESTIONS

How many tools should my MCP server expose by default?

Expose only the tools relevant to the most common user journeys — typically 5-10 core tools. Hide niche tools behind opt-in flags or command line parameters. For extremely cost-sensitive deployments, offer a Slim Mode with just 3-5 tools. The right number depends on your journey analysis: measure whether agents select tools correctly at each exposure level and adjust based on wrong-tool-selection rates.

Should I write different tool descriptions for different agent harnesses?

The Hablich framework recommends minimum viable descriptions that work across harnesses, but acknowledges that different models and harnesses may respond differently to description length and vocabulary. Monitor whether smaller models over-select tools with longer descriptions and trim accordingly. The key practice is iterative: write the shortest description that provides purpose and activation criteria, then adjust based on measured performance per harness.

How do I test whether my MCP tool descriptions are good enough?

Run your target agent against each user journey and track two metrics: correct tool selection rate and tokens per successful outcome. If the agent picks the wrong tool for a journey, the description for either the selected or intended tool needs improvement. A/B test description variants and measure. Also check for over-selection — if adding a rich description causes the agent to call that tool in unrelated contexts, the description is too broad.