Klingen Coding Agent Skill Architecture Method

Design, build, and iteratively improve a reusable coding-agent skill that reliably onboards users to a complex technical tool — replacing documentation overload with expert-guided, up-to-date, context-aware automation.

// TL;DR

The Klingen Coding Agent Skill Architecture Method is a structured framework for designing, building, and iteratively improving reusable coding-agent skills (like CLAUDE.md or .clinerules) that reliably guide AI coding agents through complex technical product integrations. Use it when your product has deep documentation, multiple integration patterns, or frequent API changes, and users struggle to get coding agents to set things up correctly. It replaces documentation overload with progressive, context-aware automation — combining agent sitemaps, style rules, search endpoints, LLM-as-judge evals, and auto-research loops to keep skills accurate and up to date.

// When should you use the Klingen Coding Agent Skill Architecture Method?

Use this skill when you need to create or improve a Claude/Cursor/Codex-style agent skill (CLAUDE.md, .clinerules, etc.) for a technical product with deep documentation, multiple integration patterns, or frequent interface changes. Especially relevant when your product is infrastructure-level or 'unopinionated' and users struggle to know which setup path is right for their application.

// What inputs do you need to build a coding agent skill with the Klingen method?

  • product_descriptionrequired
    What the tool/SDK/platform does and its core integration surface (e.g., tracing, evals, prompt management).
  • documentation_scoperequired
    Approximate size and structure of existing docs (page count, feature areas, flexibility level).
  • user_entry_scenariorequired
    The typical user request that triggers the skill — e.g., 'add observability to my agent'.
  • known_failure_modes
    Ways the agent currently gets it wrong without the skill: hallucinated APIs, stale context, wrong setup path, etc.
  • target_use_case_for_auto_research
    A specific, bounded workflow you want to use auto-research to optimise (e.g., 'migrate prompts from local repo to managed prompt system').

// What core principles guide the Klingen Coding Agent Skill Architecture Method?

Skills as Rubik's Cube Manual

A skill is not intelligence — it is a manual. The agent already has the bash tool and can do anything; the skill gives it the step-by-step system so it solves the problem correctly instead of turning the cube randomly. Without the manual, capability is wasted.

Progressive Disclosure of Context

The skill should not front-load all documentation. Instead, it should reference product modules progressively, surfacing only the hints needed at each decision point. This mirrors how an expert guides a conversation rather than handing over a manual.

Reference Over Duplication

Dynamic content — docs, API references, changelogs — must be pointed to, not copied into the skill. Embedding content creates a local cache that goes out of date, producing the same pre-training-context staleness problem the skill was designed to solve.

Traces Get You 80% of the Way

Before engineering elaborate evaluations, instrument the agent and manually walk through traces. Reading what the agent actually did at runtime reveals unexpected use cases and broken paths more directly than any automated eval suite at this stage.

Production Signals Over Assumptions

Use runtime signals (search endpoint queries, trace data, execution logs) to discover what users are actually trying to do and where the skill fails — not what you assumed they would do during design.

Target Function Defines Everything

In auto-research loops, the target function is the ceiling of quality. If it omits a desired behaviour (e.g., linking prompt versions to traces), the optimiser will remove anything that nudges toward it as 'noise'. Define the target function with the same care you would define a product requirement.

Basic Eval Setup Is Better Than None

A minimal evaluation — natural-language assertions checked by an LLM-as-judge against filesystem or state diffs — unblocks iteration immediately. Waiting for a perfect eval framework stalls progress on a product with many valid use cases.

// How do you apply the Klingen method step by step?

  1. 1

    Audit the pre-skill failure state

    Run the user's natural-language request (e.g., 'add tracing to my agent') through the coding agent without any skill. Instrument the agent and capture the full execution trace. Document: (a) what it got wrong, (b) where it used stale pre-training context, (c) how many extra turns it needed to self-correct, (d) what it never discovered at all. This is your baseline.

  2. 2

    Identify the skill's two jobs

    Every skill must do exactly two things: (1) surface new use cases users didn't know they needed — discoverable only from production trace data; (2) keep existing skill paths accurate and efficient as the product evolves. Design the skill file to address both, not just the happy path.

  3. 3

    Build the Skill MD with style rules and an agent sitemap

    The skill file (CLAUDE.md or equivalent) has two parts: (a) style rules — how the agent should behave, e.g., 'ask clarifying questions before making decisions', 'fetch the help flag before assuming CLI parameters'; (b) an agent sitemap — a structured index of available documentation URLs so the agent goes there first instead of Googling and landing on stale third-party content. Do NOT embed the documentation itself.

  4. 4

    Expose a search endpoint for documentation, not just static pages

    Replace or supplement page-by-page doc fetching with a natural-language search endpoint that returns relevant documentation chunks. This (a) reduces turns needed to find the right information, (b) lets you track what problems users are actually running into at query time, which feeds skill improvement. Advertise markdown-negotiation headers (e.g., append /md or send Accept: text/markdown) to avoid agents parsing HTML and wasting tokens.

  5. 5

    Eliminate agent-unfriendly UX assumptions inherited from human-facing design

    Review every place your product made UX simplifications for humans (e.g., defaulting a data region to reduce friction, omitting environment variables). Agents don't experience friction the same way — adding an extra environment variable costs zero effort. Audit and reverse these assumptions in the skill: prompt the agent to ask clarifying questions humans would have found annoying.

  6. 6

    Set up a basic eval suite using LLM-as-judge on state diffs

    Create a sample repository folder representing a realistic user application (e.g., OpenAI custom-function RAG app). Write 3–7 natural-language assertion statements that describe expected post-execution state (e.g., 'OpenAI instrumentation was added', 'retrieval spans appear in trace'). Run these assertions via an LLM-as-judge comparing filesystem/trace state before and after skill execution. Imperfect coverage is acceptable — basic eval setup is better than none.

  7. 7

    Walk traces manually before automating

    Before running auto-research, sit with the traces yourself several times. Look for: agent wandering toward the goal instead of shooting straight; hallucinated method names or CLI parameters; missing retrieval spans; absence of domain-specific evals. Each observation becomes a concrete rule or reference addition in the skill file.

  8. 8

    Define the target function for auto-research with extreme precision

    Choose one bounded workflow for auto-research optimisation. Write the target function to include every behaviour you want to preserve, not just the primary success metric. Anti-pattern: optimising on 'number of turns' will cause the agent to remove documentation-fetching instructions — which destroys up-to-date context. Include checks for: correct instrumentation, presence of desired spans, no hallucinated APIs, appropriate clarifying questions asked.

  9. 9

    Run auto-research loop and human-review all suggestions

    Let the agent generate skill improvement candidates autonomously. Treat its suggestions as ideas, not decisions. Human-review every suggestion against the full intended behaviour — especially things the target function didn't capture. Expect to accept roughly 50% of suggestions; accepting all means the target function was too narrow. Keep an approval gate for any action that moves user data outside their local environment.

  10. 10

    Timestamp the skill and design for staleness detection

    Embed the fetch/creation date in the skill file. Instruct the agent: 'If this skill is older than N days, alert the user and suggest fetching a fresh version.' This is currently better than attempting auto-update, as skill distribution and auto-upgrade pipelines are immature across coding agent environments.

// What are real-world examples of the Klingen method in action?

A developer instrumentation SDK with 400+ documentation pages across five feature areas, flexible enough to support chat, voice, batch processing, and RAG — but users consistently set it up wrong when asking a coding agent to 'add observability to my project'.

Build a skill that: (1) opens with clarifying questions about the user's application type; (2) provides an agent sitemap pointing to the relevant feature-area docs; (3) exposes a search endpoint so the agent asks questions in natural language rather than fetching five separate pages; (4) includes style rules like 'fetch the CLI help flag before assuming parameters exist'; (5) runs an LLM-as-judge eval checking that the correct instrumentation spans appear in traces post-setup.

A platform team wants to automate migration of hardcoded prompts from engineers' local git repositories into a centralised prompt management system, but the migration workflow has many variants depending on the codebase structure.

Define a tight target function that includes: prompts successfully moved, prompt versions linked to production traces, and an approval gate before any data leaves the user's machine. Run auto-research to generate skill improvement candidates. Reject any suggestions that remove documentation-fetching steps (even if they reduce turn count) because they trade short-term efficiency for long-term staleness. Accept candidates that improve clarifying-question quality or add missing span assertions to the eval suite.

// What mistakes should you avoid when building coding agent skills?

  • Embedding documentation content directly into the skill file — this creates a duplicate that goes out of date, reproducing the same pre-training-context staleness the skill was meant to fix.
  • Optimising the auto-research target function on a proxy metric like 'number of turns' — the agent will strip out documentation-fetching instructions, destroying the up-to-date context guarantee.
  • Inheriting human-UX simplifications in the skill (e.g., defaulting environment variables, skipping clarifying questions) — agents don't experience these shortcuts as helpful; they cause incorrect setups.
  • Skipping manual trace review before building evals — automated metrics miss the qualitative 'wandering vs. straight-shooting' signal that trace reading gives you cheaply.
  • Waiting for a perfect evaluation setup before iterating — a basic LLM-as-judge on state diffs unblocks improvement immediately; complexity can be added later.
  • Letting the agent self-correct from stale pre-training context without a skill — it will first implement incorrectly, discover the error, then fetch documentation, adding unnecessary turns and potentially shipping wrong instrumentation.
  • Assuming all users have the same application type — a skill without clarifying questions will recommend evals and instrumentation patterns misaligned with the user's actual use case (chat vs. batch vs. voice vs. RAG).
  • Relying on a plugin marketplace or proprietary integration layer for skill distribution — this creates maintenance overhead across multiple agent environments and is a distraction for small teams.

// What are the key terms in the Klingen Coding Agent Skill Architecture Method?

Skill
A formalised, reusable instruction set installed into a coding agent environment that gives the agent a reliable methodology for a specific task — analogous to a Rubik's Cube manual: the agent already has all the moves, the skill tells it in what order to apply them.
Agent Sitemap
A structured index of available documentation URLs exposed to the coding agent via the skill file, so the agent navigates to the right docs first rather than searching the open web and landing on stale or irrelevant content.
Skill MD
The skill file itself (e.g., CLAUDE.md or .clinerules), containing two components: style rules governing how the agent should behave, and references (not copies) to the documentation modules it should consult.
Target Function
The precise definition of success used during auto-research to evaluate and select skill improvement candidates. Everything not in the target function will be optimised away, so it must include all desired behaviours, not just primary metrics.
Auto-Research
A loop in which an agent autonomously generates skill improvement candidates against a target function, which a human then reviews and selectively accepts — enabling faster exploration of the skill design space than manual iteration alone.
LLM-as-Judge
An evaluation pattern where a language model assesses natural-language assertions about the state of the filesystem, traces, or application after a skill execution — used to build a basic eval suite without a formal testing framework.
Progressive Disclosure
The skill design principle of revealing only the documentation references and hints the agent needs at each decision point, rather than front-loading all possible context.
Traces
Runtime execution records of an agent's actions — LLM calls, tool uses, span data — that reveal what the agent actually did, including unexpected paths, hallucinations, and missing instrumentation.
Production Signals
Runtime data — especially search endpoint queries and trace logs — that reveal what users are actually trying to do with the skill in production, as opposed to what was assumed during design.
Unopinionated Infrastructure
A product philosophy where the tool provides reliable, flexible primitives (e.g., tracing that works at billions of events) rather than prescribing end-to-end workflows — making agent-driven customisation via skills the natural completion layer.

// FREQUENTLY ASKED QUESTIONS

What is the Klingen Coding Agent Skill Architecture Method?

It is a structured framework for designing reusable skill files (like CLAUDE.md) that guide AI coding agents through complex technical product integrations reliably. Instead of dumping documentation on the agent, you create progressive instruction sets with agent sitemaps, style rules, and search endpoints — then iteratively improve them using trace analysis, LLM-as-judge evals, and auto-research loops. It was introduced by Marc Klingen based on experience building coding agent skills for Langfuse.

What is a coding agent skill file and how is it different from documentation?

A coding agent skill file is a formalised instruction set installed into a coding agent's environment — like a CLAUDE.md or .clinerules file — that tells the agent how to solve a specific task step by step. Unlike documentation, it references docs progressively rather than embedding them, includes style rules for agent behaviour, and provides an agent sitemap so the agent navigates to the right sources instead of hallucinating or using stale pre-training context.

How do I build a coding agent skill from scratch?

Start by running your target user request through the agent without any skill and capturing the full execution trace to document failures. Then build a Skill MD with two parts: style rules (e.g., 'ask clarifying questions before choosing a setup path') and an agent sitemap pointing to relevant documentation URLs. Expose a search endpoint for docs, set up a basic LLM-as-judge eval suite, walk traces manually, and iterate using auto-research loops with human review.

How do I improve an existing coding agent skill?

Review production traces to find where the agent wanders, hallucinates APIs, or misses instrumentation steps. Add rules or sitemap references to address each failure. Then define a precise target function for auto-research — including all desired behaviours, not just primary success — and run an auto-research loop to generate improvement candidates. Human-review every suggestion; expect to accept roughly 50%. Update the skill's timestamp and staleness detection accordingly.

How does the Klingen method compare to just giving the agent all the docs?

Giving the agent all the docs front-loads context, wastes tokens, and embeds content that goes stale. The Klingen method uses progressive disclosure — surfacing only the hints needed at each decision point — and references documentation via URLs and search endpoints rather than copying it. This keeps the agent's context current, reduces hallucination, and mirrors how an expert would guide a conversation rather than handing over a 400-page manual.

When should I use the Klingen Coding Agent Skill Architecture Method?

Use it when you need to create or improve a coding-agent skill for a technical product with deep documentation, multiple integration patterns, or frequent interface changes. It is especially relevant for infrastructure-level or 'unopinionated' products where users struggle to choose the right setup path. If your coding agents currently hallucinate APIs, use stale context, or take excessive turns to self-correct, this method directly addresses those failures.

What results can I expect after applying the Klingen method to my agent skill?

You can expect coding agents to reach correct integrations in fewer turns, ask appropriate clarifying questions instead of guessing, use up-to-date documentation instead of stale pre-training context, and avoid hallucinated APIs or CLI parameters. Production signals from search endpoints will reveal real user needs, enabling continuous improvement. Teams typically see measurable reductions in agent wandering and incorrect instrumentation within the first iteration cycle.

What is an agent sitemap in the Klingen method?

An agent sitemap is a structured index of available documentation URLs embedded in the skill file. It ensures the coding agent navigates to the correct, current docs first instead of searching the open web and landing on stale or irrelevant third-party content. It does not contain the documentation itself — only references — so it stays lightweight and avoids the staleness problem that comes from duplicating content.

What is a target function in auto-research for agent skills?

The target function is the precise definition of success used to evaluate skill improvement candidates during auto-research. It must include every behaviour you want to preserve — not just the primary success metric. If the target function omits a desired behaviour like linking prompt versions to traces, the optimiser will remove anything supporting it. Defining it carelessly leads to agents stripping out documentation-fetching steps, destroying the up-to-date context guarantee.

How does LLM-as-judge work for evaluating coding agent skills?

LLM-as-judge is an evaluation pattern where a language model assesses natural-language assertions about the state of the filesystem, traces, or application after a skill execution. You write 3–7 assertion statements like 'OpenAI instrumentation was added' or 'retrieval spans appear in trace,' then an LLM compares pre- and post-execution state to judge whether each assertion holds. This unblocks iteration immediately without needing a formal testing framework.

// GET STARTED

Turn Any YouTube Video Into An AI Skill

SkillForge captures a creator's exact methodology from their video and turns it into a reusable AI skill you can invoke in Claude, ChatGPT, or any LLM.

Forge your own skill