How Do I Measure and Improve Agent Feature Performance?

For AI product managers evaluating and improving agent-powered features · Based on Hablich Agent Interface Engineering Framework

// TL;DR

For AI product managers overseeing agent-powered features, the Hablich Agent Interface Engineering Framework provides a measurement and prioritisation system for agent interface quality. The key metric is tokens per successful outcome per user journey — measuring both whether agents complete tasks (effectiveness) and how efficiently they do it (fuel efficiency). Use this framework to identify which agent journeys are failing or wasteful, prioritise engineering investment, and avoid common mistakes like comparing metrics across different journey types or optimising for efficiency at the expense of task completion.

How Do I Know If My Agent-Powered Feature Is Actually Working Well?

Measure two things for every user journey: effectiveness (did the agent complete the task?) and fuel efficiency (how many tokens did it cost?). The Hablich framework's core metric is tokens per successful outcome — token cost normalised only against successful completions.

Effectiveness is binary: the agent either completed the full user journey or it didn't. An interface that's cheap but doesn't finish tasks is worthless. An interface that finishes tasks but burns 10x more tokens than necessary is a cost problem. You need both metrics, measured per journey type.

Visualize this as a per-journey bar chart where bar length represents effectiveness and bar position (or colour) represents token cost. This immediately reveals which journeys need attention: low effectiveness means the agent can't complete the task; high token cost with high effectiveness means there's an efficiency opportunity.

Why Shouldn't I Compare Agent Performance Across Different Task Types?

Because different user journeys have inherently different complexity. A multi-step debugging session involving performance profiling, DOM inspection, and CSS modification will always consume more tokens than a simple page-scraping task. Comparing them globally would suggest debugging is "worse" when it's simply more complex.

The Hablich framework requires measuring tokens per successful outcome within each journey type. Track trends over time within each category. A debugging journey that cost 50,000 tokens last month and costs 35,000 this month after interface improvements shows clear progress. Comparing that 35,000 to a 5,000-token scraping journey tells you nothing actionable.

This per-journey measurement also helps you prioritise: focus engineering effort on the journey types with the worst efficiency-to-value ratios.

What Trade-offs Should I Expect When the Engineering Team Optimises Agent Interfaces?

Every optimisation shifts a trade-off — it doesn't eliminate it. The Hablich framework's fifth principle states this explicitly: there is no free lunch in agent interface design.

Common trade-offs you'll encounter as a PM:

- Adding richer tool descriptions improves discoverability but increases context window consumption and can cause smaller models to over-select tools.

- Slim Mode (reducing tools to a minimal set) cuts token costs dramatically but may block certain tasks or force extra agent turns.

- Adding skills (pre-built multi-step workflows) improves reliability for complex journeys but inflates context and can cause inappropriate skill invocation.

- Semantic summaries replace raw data and reduce context load, but may occasionally omit details the agent needs for edge-case tasks.

Your role is to ensure the engineering team names each trade-off explicitly, decides consciously rather than assuming a fix is free, and measures the impact on both effectiveness and efficiency.

How Do I Prioritise Which Agent Journeys to Fix First?

Rank journeys by the product of user frequency and failure cost. A journey that's used 1,000 times per day with a 40% failure rate is a higher priority than a journey used 10 times per day with a 90% failure rate. Layer in token cost: a frequently-used journey that succeeds but costs 5x more tokens than it should is a cost optimisation opportunity, not a reliability emergency.

Use the Hablich framework's diagnostic approach: Is the journey failing because agents get the wrong tool (description quality issue)? Because they're overwhelmed with data (dump zone issue)? Because errors aren't self-healing? Each root cause maps to a specific framework step and engineering action.

Next step: Instrument tokens per successful outcome for your top 5 agent user journeys this sprint. Identify the journey with the worst ratio and apply the Hablich framework's diagnostic questions to determine whether the root cause is description quality, data volume, tool selection, or error handling.

// FREQUENTLY ASKED QUESTIONS

What's a good benchmark for tokens per successful outcome?

There is no universal benchmark because token cost varies enormously by journey complexity, model, and interface design. The right approach is to establish your own baselines per journey type, then improve against them. A 20-30% reduction in tokens per successful outcome after applying semantic summaries and tool description auditing is a common initial improvement. Track trends within journey types over time rather than targeting an absolute number.

How do I justify engineering investment in agent interface improvements to stakeholders?

Frame it in two metrics: task completion rate (effectiveness) and cost per completed task (tokens per successful outcome). Show that current interface issues cause measurable task failures and token waste. Calculate the monthly token cost savings from efficiency improvements and the revenue or user-satisfaction impact of higher task completion rates. The Hablich framework's per-journey measurement gives you concrete, defensible data for each improvement.

Should I build a Slim Mode for my agent feature?

Build Slim Mode if you have cost-sensitive deployments, context-constrained models, or user journeys where only a small subset of tools are needed. Slim Mode typically exposes 3-5 core tools and can reduce token costs by 40-60%. However, explicitly document and accept the capability trade-offs: some tasks may require extra turns or become impossible in Slim Mode. Let users or deployment configs choose between full and slim mode based on their priorities.