Anatomy of a Complex Skill: What I Learned Teaching Claude to Ship Code Autonomously

I've been building a Claude Code skill called auto-dev for the past couple weeks. You give it a Linear ticket, and it analyzes the codebase, writes an implementation plan, codes the feature in isolated subagents, runs verification at four different levels, and opens a pull request. It handles batch processing with parallel git worktrees. It has circuit breakers and a self-correction system that tracks every failed approach so it doesn't try the same thing twice.

Linear Cycle 3 burndown chart — 265 of 266 issues completed at 100% — Our Linear Cycle 3 burndown — 265 of 266 issues completed. Auto-dev handled the implementation.

What I want to talk about is the process of building it, because most of what I know about skill architecture I learned by getting it wrong first, and I think the lessons transfer to anyone building skills that are more ambitious than a commit message formatter.

Anthropic recently published The Complete Guide to Building Skills for Claude, and it's a solid resource for getting started. It covers the fundamentals, five common patterns, testing strategies, distribution. But there's a gap between the patterns in that guide and the reality of building something that runs autonomously for hours across multiple phases. This post is about what lives in that gap.

The Three-Level System That Saved Everything

When you build a skill, it's just a folder with a SKILL.md file inside it. Markdown with YAML frontmatter at the top. There's optionally a references/ directory for supplementary docs, a scripts/ folder for code, and assets/ for templates. That's the whole structure.

What took me way too long to internalize is how the loading actually works. It's a three-level progressive disclosure system, and understanding it changed how I designed everything.

Progressive Disclosure: How Skills Load

FrontmatterName + description

Every conversation

SKILL.md BodyFull instructions

When skill matches task

Reference FilesDetailed specs

On-demand during execution

Each level loads more context. Wider boxes = more tokens in the context window.

The first level is the YAML frontmatter, the name and description fields at the top of your SKILL.md. This gets loaded into Claude's system prompt for every single conversation, whether your skill is relevant or not. It's how Claude decides whether to load the skill right now. The second level is the body of SKILL.md itself, which only gets loaded when Claude thinks the skill matches the current task. The third level is the reference files, which Claude discovers and reads on demand as it works through the instructions.

This matters because your frontmatter is competing for attention with every other skill the user has installed. The Anthropic guide recommends structuring it as what the skill does, when to use it, and key capabilities. The when-to-use-it part is doing about 80% of the work. My early versions had a description that triggered on any mention of Linear or ticket, which meant the full skill loaded when someone just wanted to check a ticket status. Adding specific trigger phrases fixed it, but I should have been more deliberate about this from the start.

Why I Ended Up With Ten Reference Files

Auto-dev's main SKILL.md is around 5,000 words, which is already pushing against what Claude can reliably follow. But the actual system is closer to 25,000 words of specification. Phase-by-phase execution commands, verification gate specs, self-correction logic, state file schemas, PR templates, safety limits, concurrent session management. There's a lot.

If I loaded all of that into context at once, Claude would start ignoring things. Instructions that are too verbose get skipped. Bullet points and numbered lists get followed more reliably than prose paragraphs. Critical instructions need to be at the top, not buried. And detailed reference material needs to live in separate files.

auto-dev Skill Structure

auto-dev/

SKILL.md~5,000 words

references/

workflow-phases.mdPhase commands

verification-levels.mdL0–L3 gates

self-correction.mdError handling

safety-limits.mdCircuit breakers

pr-template.mdPR body

progress-detection.mdSessions

configuration.mdOptions

state-schemas.mdJSON defs

hooks-configuration.mdHooks

mock-test-orchestration.mdTest validation

Total specification~25,000 words across 11 files

The main SKILL.md tells Claude what to do at each phase. The reference files tell it how. When Claude enters Phase 4 (Implementation), it reads the Phase 4 section from workflow-phases.md. When a verification gate fails, it reads self-correction.md. This keeps the working context small while giving Claude access to the full spec when it needs it.

If your instructions are getting long, don't try to make them shorter by cutting detail. Move the detail to reference files and point to them from the main SKILL.md.

I think this is the single most important architectural decision for any skill over about 2,000 words. If your instructions are getting long, don't try to make them shorter by cutting detail. Move the detail to reference files and point to them from the main SKILL.md.

Context Compaction Will Break Your Skill

Here's the thing that caught me completely off guard. I had the progressive disclosure working. The reference files were clean. The instructions were clear and well-structured. And the skill still broke in later phases.

The issue is context compaction. When Claude's context window fills up during a long session, it compresses older messages to make room for new ones. That compression is lossy. Instructions that were loaded at the start of the session can be partially or completely gone by the time Claude reaches Phase 5 of an eight-phase workflow.

Context Compaction Over Time

How much of your original instructions Claude remembers at each phase

P0Analysis

100%

P1Brief

95%

P2Planning

85%

P3Prep

70%

P4Implement

45%

P5Verify

25%

P6PR

10%

Instructions intact

Partial loss

Critical loss — steps skipped

So Claude would execute Phases 0 through 3 perfectly, then start skipping steps in Phase 5. Not because the instructions were bad, but because they'd been compressed out of the context window. The skill was literally forgetting its own rules.

I solved this with what I think of as anti-compaction patterns, and if you're building anything that runs for more than about fifteen minutes, you probably need something similar.

Anti-Compaction Patterns

Phase Instruction Re-reads

Re-read workflow-phases.md at the start of every phase

Boundary Checkpoints

Write a markdown summary to disk at the end of every phase

Externalized State

Everything lives in .dev-state/ — disk is the source of truth

The first pattern is phase instruction re-reads. At the start of every phase, auto-dev uses Glob to find the workflow-phases.md file and reads the section for the current phase. Every time. Even if it just read it two phases ago. This is explicitly a defense against compaction. It reloads the detailed instructions that might have been compressed away. This single change eliminated most of the problems I was seeing with skipped steps in later phases.

The second pattern is phase-boundary checkpoints. At the end of every phase, auto-dev writes a markdown summary to disk capturing what was accomplished, key decisions made, files affected, verification results, and what the next phase should focus on. At the start of each phase, it reads the previous phase's summary. This creates a paper trail that survives compaction. Even if Claude has lost all memory of Phase 2 by the time it reaches Phase 5, the summary file restores the context.

The third pattern is externalizing state entirely. Everything lives in a .dev-state/ directory. The workflow state, implementation plans, unit results, verification outcomes, attempt histories. Claude's working memory is treated as volatile. The disk is the source of truth.

Design for context loss from the beginning. Assume that whatever Claude knows at the start of your workflow will be gone halfway through.

If I could condense everything I learned into one sentence, it would be: design for context loss from the beginning. Assume that whatever Claude knows at the start of your workflow will be gone halfway through. Write important things to disk. Re-read your own instructions before executing them. Your skill should be able to reconstruct its complete state from files alone, at any point, without relying on what Claude remembers from earlier in the conversation.

Teaching Claude to Stop

Auto-dev has four verification levels. L0 runs after every file edit (format, lint, typecheck). L1 runs after each implementation unit (typecheck plus unit tests). L2 runs after all implementation (full test suite). And L3 runs before the PR (scope check plus a seventeen-point code review).

Four Verification Gates

L0Per-File

After every file edit

Format

Lint

Typecheck

L1Per-Unit

After each implementation unit

Typecheck

Unit tests

L2Full Suite

After all implementation

Full test suite

Integration

L3Pre-PR

Before opening PR

Scope check

17-point code review

The verification levels themselves were straightforward to implement. The hard part was teaching Claude when to give up.

Without explicit limits, Claude will retry a failing test in a loop. It'll try the same fix with slightly different syntax. It'll start modifying things that weren't broken, introducing new problems while chasing the original one. I watched this happen over and over during development. Claude would get stuck on a type error, try the same approach from three different angles, then start modifying unrelated files trying to make the types line up. Twenty minutes later the codebase was worse than when it started.

Self-Correction Escalation

1st failure

Direct targeted fix

2nd failure

Fundamentally different approach

3rd failure

Stop, re-read requirements, reconsider design

Same signature 3x

Mark unit as stuck

10 total failures

Circuit breaker — notify human

The self-correction protocol I ended up with works on a simple escalation model. First failure: direct targeted fix. Second failure on the same issue: fundamentally different approach. Third failure: stop, re-read the requirements, reconsider the design. If the same error signature (a combination of file path, line number, and error code) appears three times, stop entirely and mark the unit as stuck. If Claude starts oscillating, trying approach A, then B, then A again, stop immediately. After ten total failures on a single ticket, trigger a circuit breaker and notify the human.

The error signature tracking is what makes this work in practice. Every failure gets catalogued with a signature like src/login.tsx:42:TS2345. Before attempting a fix, Claude checks whether it's seen that signature before and what it tried last time. This prevents the loop where it tries the same fix with different variable names and wastes twenty minutes going nowhere.

✨

You have to explicitly tell Claude to prioritize quality over speed. Adding instructions like "take your time to do this thoroughly" and "do not skip validation steps" made a real difference. Without those nudges, Claude will cut corners on verification to get to the PR faster.

Human Checkpoints and the Autonomy Spectrum

Full autonomy is a nice idea right up until Claude decides it needs to npm install a new dependency at 2am. Or it's refactoring thirty files when the ticket asked for a three-file change. Or it's hit a merge conflict in batch mode and decided to resolve it by picking a side at random.

Auto-dev has eight situations where it stops and asks a human before proceeding:

Large scope after planning
Scope limits exceeded during implementation
Circuit breaker triggered
New dependency needed
Dependency cycles in batch mode
Non-trivial merge conflicts
Integration test failures
Ad-hoc tasks: always stops after elaborating the task brief

The pattern I landed on is: automate the predictable stuff, checkpoint the uncertain stuff. A small ticket that touches three files and has a clear implementation path? Let it run. A large refactor with a new npm package and database migrations? Stop and verify.

Each checkpoint sends a macOS notification and presents an interactive picker so the human gets a clear set of options instead of a wall of text. This UX detail matters more than I expected. If your checkpoints are annoying to respond to, people will just start approving everything without reading, which defeats the purpose.

Layering Problem-First and Tool-First

The Anthropic guide makes a distinction between problem-first and tool-first skills. Problem-first is when the user describes an outcome and the skill orchestrates the right tools. Tool-first is when the user already has a tool and the skill teaches Claude best practices for using it.

Problem-First (Orchestration)

User says 'implement this ticket'
Skill decides which tools to use
Skill decides the execution order
User never thinks about mechanics

Tool-First (Expertise)

How to use testing tools effectively
How to read error output and decide what to try
How to run verification commands correctly
Best practices for each specific step

Auto-dev is aggressively problem-first at the top level. The user says "implement this ticket" and the skill handles everything. Which tools to use, what order to do things in, when to spawn subagents, how to verify the work. The user never thinks about the underlying mechanics.

But the reference files are tool-first. verification-levels.md is essentially a guide on how to use the testing tools effectively. self-correction.md teaches Claude how to read error output and decide what to try next. The main SKILL.md provides the problem-first orchestration, and the reference files provide tool-first expertise for each step.

If you're building something complex, I think this layered approach is the way to go. The main instructions worry about what needs to happen. The reference files worry about how to do each step well.

Garbage In, Garbage Out

Auto-dev can only be as good as the tickets it receives. If the ticket is vague or missing acceptance criteria, the analysis phase produces a bad brief, the plan is wrong, and the implementation is worse. I learned this the hard way watching auto-dev confidently build the wrong thing from an underspecified ticket.

So we built another skill first: enhance-linear-issues. You point it at a set of Linear tickets and it reviews each one, fills in missing acceptance criteria, decomposes tickets that are too large, adds technical context from the codebase, and flags ambiguities that need human input before implementation starts. It's the prep work that makes auto-dev's job possible. This one's open source:

npx skills add super-mega-lab/toolkit

The workflow is: enhance-linear-issues cleans up and structures the tickets, then auto-dev takes them and runs. The quality of auto-dev's output went up dramatically once we stopped feeding it raw tickets and started feeding it well-structured ones. It turns out that the skill you build to prepare the input matters as much as the skill that does the work.

What I Got Wrong

I got a lot wrong and I'm still fixing things.

I should have started with one phase and gotten it solid before building the rest. I built the entire eight-phase system end-to-end, then spent days debugging cascading failures where a bad analysis in Phase 1 produced a bad plan in Phase 2 that produced bad code in Phase 4. If I'd built and stabilized Phase 4 (implementation with verification) first, then layered on the surrounding phases, I would have saved a lot of time.

I should have designed my state schema upfront. I evolved workflow-state.json organically as I added features, which led to inconsistencies. Some fields camelCase, others snake_case, some objects with redundant data. It's the kind of tech debt that's annoying to fix after the fact and trivial to prevent with thirty minutes of design work.

I should have tested my frontmatter more systematically. Running ten to twenty test queries to check trigger accuracy, does the skill load when it should, does it stay quiet when it shouldn't. I did this informally but not rigorously, and it showed.

And I still haven't fully solved the logic error problem. Auto-dev is great at catching and fixing type errors and lint violations, those have clear error messages with specific file and line information. But the situation where the test passes and the feature doesn't actually work is much harder to catch autonomously. The requirements retrospective in Phase 5 helps (it re-reads the original ticket and checks each requirement against the diff), but it's not as reliable as I want it to be.

Where This Is Going

Skill building is becoming its own thing. It's not prompt engineering, the instructions are too structured and long-lived for that label. It's not traditional programming, there's no compiler telling you when you're wrong. It's somewhere in between. You're writing a detailed playbook for a colleague who's extremely capable but takes everything literally and will forget what you told them three hours ago if you don't write it down.

You're writing a detailed playbook for a colleague who's extremely capable but takes everything literally and will forget what you told them three hours ago if you don't write it down.

The infrastructure for this is getting built right now. The Anthropic guide, the skill-creator tool, the Skills API, the open standard for portability across platforms. What's missing is the accumulated knowledge of what actually works when you push past the simple patterns. What happens when your skill needs to run for two hours. How you handle context compaction. Where to put human checkpoints. How to teach self-correction without infinite loops.

That's what I spent a couple weeks learning. Most of it the hard way. Hopefully some of it saves you a few days if you're building something similar.

The Three-Level System That Saved Everything

What took me way too long to internalize is how the loading actually works. It's a three-level progressive disclosure system, and understanding it changed how I designed everything.

Progressive Disclosure: How Skills Load

FrontmatterName + description

Every conversation

SKILL.md BodyFull instructions

When skill matches task

Reference FilesDetailed specs

On-demand during execution

Each level loads more context. Wider boxes = more tokens in the context window.

Why I Ended Up With Ten Reference Files

auto-dev Skill Structure

auto-dev/

SKILL.md~5,000 words

references/

workflow-phases.mdPhase commands

verification-levels.mdL0–L3 gates

self-correction.mdError handling

safety-limits.mdCircuit breakers

pr-template.mdPR body

progress-detection.mdSessions

configuration.mdOptions

state-schemas.mdJSON defs

hooks-configuration.mdHooks

mock-test-orchestration.mdTest validation

Total specification~25,000 words across 11 files

If your instructions are getting long, don't try to make them shorter by cutting detail. Move the detail to reference files and point to them from the main SKILL.md.

Context Compaction Will Break Your Skill

Context Compaction Over Time

How much of your original instructions Claude remembers at each phase

P0Analysis

100%

P1Brief

95%

P2Planning

85%

P3Prep

70%

P4Implement

45%

P5Verify

25%

P6PR

10%

Instructions intact

Partial loss

Critical loss — steps skipped

I solved this with what I think of as anti-compaction patterns, and if you're building anything that runs for more than about fifteen minutes, you probably need something similar.

Anti-Compaction Patterns

Phase Instruction Re-reads

Re-read workflow-phases.md at the start of every phase

Boundary Checkpoints

Write a markdown summary to disk at the end of every phase

Externalized State

Everything lives in .dev-state/ — disk is the source of truth

Design for context loss from the beginning. Assume that whatever Claude knows at the start of your workflow will be gone halfway through.

Teaching Claude to Stop

Four Verification Gates

L0Per-File

After every file edit

Format

Lint

Typecheck

L1Per-Unit

After each implementation unit

Typecheck

Unit tests

L2Full Suite

After all implementation

Full test suite

Integration

L3Pre-PR

Before opening PR

Scope check

17-point code review

The verification levels themselves were straightforward to implement. The hard part was teaching Claude when to give up.

Self-Correction Escalation

1st failure

Direct targeted fix

2nd failure

Fundamentally different approach

3rd failure

Stop, re-read requirements, reconsider design

Same signature 3x

Mark unit as stuck

10 total failures

Circuit breaker — notify human

✨

Human Checkpoints and the Autonomy Spectrum

Auto-dev has eight situations where it stops and asks a human before proceeding:

Large scope after planning
Scope limits exceeded during implementation
Circuit breaker triggered
New dependency needed
Dependency cycles in batch mode
Non-trivial merge conflicts
Integration test failures
Ad-hoc tasks: always stops after elaborating the task brief

Layering Problem-First and Tool-First

Problem-First (Orchestration)

User says 'implement this ticket'
Skill decides which tools to use
Skill decides the execution order
User never thinks about mechanics

Tool-First (Expertise)

How to use testing tools effectively
How to read error output and decide what to try
How to run verification commands correctly
Best practices for each specific step

Garbage In, Garbage Out

npx skills add super-mega-lab/toolkit

What I Got Wrong

I got a lot wrong and I'm still fixing things.

Where This Is Going

You're writing a detailed playbook for a colleague who's extremely capable but takes everything literally and will forget what you told them three hours ago if you don't write it down.

That's what I spent a couple weeks learning. Most of it the hard way. Hopefully some of it saves you a few days if you're building something similar.