Frequently Asked Questions About Tejas Agent Harness Engineering Framework

22 answers covering everything from basics to advanced usage.

// Basics

What is the difference between an agent harness and an agent loop?

The agent loop is just the inner while-true cycle that sends prompts to the model and collects responses. The agent harness is everything surrounding that loop — guardrails, verify steps, deterministic handlers, context compression, the retry wrapper, and the tool registry. Treating the agent loop as the harness is a common mistake. The loop is one component inside the harness; the harness is the full deterministic environment anchoring the non-deterministic model.

What does 'don't prompt it harder' mean in agent harness engineering?

It means that when an AI agent fails, the instinct to rewrite or refine the system prompt is almost always wrong. In Tejas's original demonstration, the prompt was never changed — only harness components (guardrails, verify steps, deterministic handlers) were added, and those alone made the agent succeed. Prompt changes address the model's probabilistic interpretation; harness changes enforce deterministic behavior that the model cannot override.

Why should I never put API keys or passwords in the agent's prompt?

Credentials placed in the prompt or model context can be leaked through model outputs, logged in traces visible to third parties, or inadvertently included in generated text. The harness should own all secrets in environment variables or secure stores and inject them deterministically during handler execution. The model never sees the actual credentials — it only receives a notification that the harness completed the authenticated action on its behalf.

What is the 'token billionaires' concept and why does it matter for harness engineering?

Token billionaires is Tejas's term for engineers at companies like Anthropic or Google who have essentially unlimited model access and can brute-force problems with expensive frontier models. Harness engineering is designed for everyone else — teams that pay for compute and need to do more with less. By investing in the harness rather than the model, you achieve production reliability without frontier-model costs. The harness democratizes reliable AI agents for budget-constrained teams.

Does the harness approach work for non-browser agents like API-calling or data-processing agents?

Yes, the framework applies to any agentic workflow. For API-calling agents, deterministic handlers manage authentication headers and rate limiting. For data-processing agents, the verify step checks output format and content validity. The tool registry wraps API endpoints or data transforms. The principles are universal: guardrails prevent runaway behavior, handlers manage critical deterministic operations, and the verify step ensures correct outcomes. Browser use is just one application.

// How To

How do I set the right max_iterations guardrail for my agent?

Start with a number slightly above the minimum tool calls needed to complete the task. For a simple form submission, 5-8 iterations is reasonable. For complex multi-step workflows, you may need 15-25. Monitor your traces: if the agent consistently completes in 4 steps, set max_iterations to 6-8 to allow for edge cases without enabling runaway loops. Adjust based on real trace data, not guesswork. Too low causes premature kills; too high wastes tokens.

How do I write a good verify step for my agent harness?

Write it as a deterministic function that reads the trace and returns pass or fail. Check for positive evidence (the expected tool call was made, the expected page state was reached) and negative evidence (no login failures, no error redirects, no repeated failed actions). Use early returns for each known failure pattern. Never ask the model if it succeeded. Start simple — check for one or two conditions — and add cases as you discover new failure modes in your traces.

How do I build a context compressor for my agent harness?

Start with the naive approach: always preserve the system prompt, user prompt, and last two messages; discard everything in the middle. This ships fast and handles most context overflow issues. Later iterations can use summarization (ask the model to compress middle messages into a summary), semantic filtering (keep only messages relevant to the current sub-task), or sliding windows. Don't wait for a perfect compressor — the naive version is good enough to prevent context blowouts.

How do I add a login handler to my agent harness?

Create a function that runs every agent loop iteration before the trace is updated. It checks the current state (e.g., browser URL) against known login page patterns. When detected, it reads credentials from environment variables, fills the login form programmatically, and submits — all in deterministic code with no model involvement. After login succeeds, inject a message into the agent's context: 'Harness: Login completed. Proceed with your task.' The model never sees the actual credentials.

Should I build my own tool registry or use an existing SDK?

Use an existing SDK like the OpenAI tool-calling SDK rather than inventing the interface. The tool registry pattern is well-established: each tool has a name, description, parameters, and an execute function. Building your own adds complexity without value. The harness's strength comes from what surrounds the tool registry — guardrails, handlers, and the verify step — not from the registry implementation itself. Focus your engineering effort on the harness logic, not plumbing.

What's the minimum viable harness I should build before shipping?

At minimum, implement: a tool registry with execute functions, a max_iterations guardrail, a max_messages guardrail with naive context compression (keep system prompt + user prompt + last two messages), a verify step that checks the trace for at least one success condition and one failure pattern, and the run_harness retry wrapper. This takes a few hours to build and immediately transforms an unreliable agent into a predictable system. Add handlers and verify cases incrementally after shipping.

// Troubleshooting

My agent keeps looping on the same action — how do I fix this with a harness?

This is a classic failure the harness solves with two mechanisms. First, the max_iterations guardrail kills the run after N steps, preventing infinite loops. Second, inspect the trace to identify the loop pattern — the agent is likely stuck because it doesn't realize an action failed or a state changed. Add a deterministic handler that detects the repeated action in the trace and either resolves the underlying issue or injects a corrective message into the agent's context.

My agent passes the verify step but the task isn't actually complete — what went wrong?

Your verify step isn't checking enough conditions. Review the trace manually to find what the verify step missed — perhaps it checked for a button click but not for a confirmation page, or verified a form submission but not the response status. Add the missing condition as a new check in your verify function. The verify step should be iteratively strengthened every time you find a gap. Think of it as a test suite: each discovered failure mode becomes a new assertion.

The agent's context window fills up and performance degrades — how does the harness fix this?

The harness prevents this with the max_messages guardrail and context compressor. When the message count exceeds your threshold, the compressor activates and trims the history. The naive approach keeps the system prompt, user prompt, and last two messages while discarding the middle. This preserves the agent's instructions and recent state while preventing context overflow. Set max_messages based on your model's context window, leaving headroom for tool call responses.

How do I debug an agent harness that isn't working?

Read the trace. The trace is the accumulated history of all tool calls, messages, and events from the agent loop run. Walk through it step by step to find where the agent diverged from the expected path. Common issues: a handler not firing because its condition is too narrow, a verify step missing a failure pattern, or context compression removing critical information. Add logging at each harness component boundary. The trace is your primary debugging tool — invest in making it comprehensive.

// Comparisons

How does the Tejas Harness Framework compare to LangChain or CrewAI for building agents?

LangChain and CrewAI are agent orchestration frameworks that provide abstractions for chaining LLM calls and tool use. The Tejas Harness Framework is a design methodology that can be applied within or alongside those frameworks. It specifically focuses on wrapping any agent — built with any framework — in deterministic guardrails, verify steps, and handlers. You could use LangChain to build your agent loop and still apply harness engineering principles around it for reliability.

How is a harness different from just adding error handling to my agent code?

Error handling catches exceptions after they occur. A harness is a proactive, architectural approach that prevents entire categories of failure. It includes preemptive handlers that intercept known obstacles before the model encounters them, a verify step that validates outcomes deterministically, guardrails that enforce resource limits, and context management that prevents degradation. Error handling is one small part of what a harness does; the harness is a complete reliability layer.

Can I use the harness approach with open-source models like Llama or Mistral?

Absolutely — the framework is explicitly model-agnostic and designed to make cheap or small models reliable. Open-source models like Llama or Mistral are ideal candidates because the harness compensates for their weaknesses. You only need the model to support basic tool calling or structured output. The harness handles everything else: authentication, secret management, context limits, and result verification. This lets you run production agents without paying for frontier API access.

// Advanced

What is a 'dynamic on-the-fly harness' and is it possible today?

A dynamic on-the-fly harness is Tejas's predicted next evolution: an agent that autonomously generates its own harness before executing a task. It would identify where it might hallucinate or fail, create appropriate guardrails and verify steps, then execute the harnessed plan. This is not fully realized today but represents the direction of agentic AI — moving from manually coded harnesses to self-generated reliability layers. Current implementations still require human-authored harness components.

How do I make my agent harness reusable across different tasks?

The key is the run_harness / run_harness_attempt abstraction from Step 4 of the workflow. Extract your agent loop, guardrails, and retry logic into generic functions. Make the tool registry, verify step, and handlers pluggable — pass them as parameters or configuration. The core harness structure (loop → guardrails → verify → retry) stays the same; only the task-specific tools, handlers, and verify logic change. Your entry point should be ~20 lines: define the prompt, configure handlers, call run_harness.

How do I handle multiple failure modes in the same agent harness?

Stack multiple deterministic handlers, each targeting a specific failure pattern. Order them by priority — authentication handlers should fire before navigation handlers. In your verify step, check for each failure mode as a separate early-return condition. The harness iteratively grows as you discover new failure modes: run the agent, inspect the trace, identify the new failure, add a handler or verify case. This additive approach means the harness gets more robust with each iteration.

Can I nest harnesses or compose them for multi-step workflows?

Yes, the run_harness abstraction is designed to be composable. For multi-step workflows, each step can have its own harness with task-specific handlers and verify steps. An outer orchestrator calls each step's run_harness in sequence, passing context between them. If step 2 fails, the orchestrator can retry it independently without rerunning step 1. This composability is why extracting the harness into a clean abstraction in Step 4 of the workflow is critical.