Most of what I know about skill architecture I learned by getting it wrong first. Here's what lives in the gap between simple patterns and autonomous multi-hour workflows.
I've been building a Claude Code skill called auto-dev for the past couple weeks. You give it a Linear ticket, and it analyzes the codebase, writes an implementation plan, codes the feature in isolated subagents, runs verification at four different levels, and opens a pull request. It handles batch processing with parallel git worktrees. It has circuit breakers and a self-correction system that tracks every failed approach so it doesn't try the same thing twice.

What I want to talk about is the process of building it, because most of what I know about skill architecture I learned by getting it wrong first, and I think the lessons transfer to anyone building skills that are more ambitious than a commit message formatter.
Anthropic recently published The Complete Guide to Building Skills for Claude, and it's a solid resource for getting started. It covers the fundamentals, five common patterns, testing strategies, distribution. But there's a gap between the patterns in that guide and the reality of building something that runs autonomously for hours across multiple phases. This post is about what lives in that gap.
When you build a skill, it's just a folder with a SKILL.md file inside it. Markdown with YAML frontmatter at the top. There's optionally a references/ directory for supplementary docs, a scripts/ folder for code, and assets/ for templates. That's the whole structure.
What took me way too long to internalize is how the loading actually works. It's a three-level progressive disclosure system, and understanding it changed how I designed everything.
Each level loads more context. Wider boxes = more tokens in the context window.
The first level is the YAML frontmatter, the name and description fields at the top of your SKILL.md. This gets loaded into Claude's system prompt for every single conversation, whether your skill is relevant or not. It's how Claude decides whether to load the skill right now. The second level is the body of SKILL.md itself, which only gets loaded when Claude thinks the skill matches the current task. The third level is the reference files, which Claude discovers and reads on demand as it works through the instructions.
This matters because your frontmatter is competing for attention with every other skill the user has installed. The Anthropic guide recommends structuring it as what the skill does, when to use it, and key capabilities. The when-to-use-it part is doing about 80% of the work. My early versions had a description that triggered on any mention of Linear or ticket, which meant the full skill loaded when someone just wanted to check a ticket status. Adding specific trigger phrases fixed it, but I should have been more deliberate about this from the start.
Auto-dev's main SKILL.md is around 5,000 words, which is already pushing against what Claude can reliably follow. But the actual system is closer to 25,000 words of specification. Phase-by-phase execution commands, verification gate specs, self-correction logic, state file schemas, PR templates, safety limits, concurrent session management. There's a lot.
If I loaded all of that into context at once, Claude would start ignoring things. Instructions that are too verbose get skipped. Bullet points and numbered lists get followed more reliably than prose paragraphs. Critical instructions need to be at the top, not buried. And detailed reference material needs to live in separate files.
The main SKILL.md tells Claude what to do at each phase. The reference files tell it how. When Claude enters Phase 4 (Implementation), it reads the Phase 4 section from workflow-phases.md. When a verification gate fails, it reads self-correction.md. This keeps the working context small while giving Claude access to the full spec when it needs it.
If your instructions are getting long, don't try to make them shorter by cutting detail. Move the detail to reference files and point to them from the main SKILL.md.
I think this is the single most important architectural decision for any skill over about 2,000 words. If your instructions are getting long, don't try to make them shorter by cutting detail. Move the detail to reference files and point to them from the main SKILL.md.
Here's the thing that caught me completely off guard. I had the progressive disclosure working. The reference files were clean. The instructions were clear and well-structured. And the skill still broke in later phases.
The issue is context compaction. When Claude's context window fills up during a long session, it compresses older messages to make room for new ones. That compression is lossy. Instructions that were loaded at the start of the session can be partially or completely gone by the time Claude reaches Phase 5 of an eight-phase workflow.
How much of your original instructions Claude remembers at each phase
So Claude would execute Phases 0 through 3 perfectly, then start skipping steps in Phase 5. Not because the instructions were bad, but because they'd been compressed out of the context window. The skill was literally forgetting its own rules.
I solved this with what I think of as anti-compaction patterns, and if you're building anything that runs for more than about fifteen minutes, you probably need something similar.
Re-read workflow-phases.md at the start of every phase
Write a markdown summary to disk at the end of every phase
Everything lives in .dev-state/ — disk is the source of truth
The first pattern is phase instruction re-reads. At the start of every phase, auto-dev uses Glob to find the workflow-phases.md file and reads the section for the current phase. Every time. Even if it just read it two phases ago. This is explicitly a defense against compaction. It reloads the detailed instructions that might have been compressed away. This single change eliminated most of the problems I was seeing with skipped steps in later phases.
The second pattern is phase-boundary checkpoints. At the end of every phase, auto-dev writes a markdown summary to disk capturing what was accomplished, key decisions made, files affected, verification results, and what the next phase should focus on. At the start of each phase, it reads the previous phase's summary. This creates a paper trail that survives compaction. Even if Claude has lost all memory of Phase 2 by the time it reaches Phase 5, the summary file restores the context.
The third pattern is externalizing state entirely. Everything lives in a .dev-state/ directory. The workflow state, implementation plans, unit results, verification outcomes, attempt histories. Claude's working memory is treated as volatile. The disk is the source of truth.
Design for context loss from the beginning. Assume that whatever Claude knows at the start of your workflow will be gone halfway through.
If I could condense everything I learned into one sentence, it would be: design for context loss from the beginning. Assume that whatever Claude knows at the start of your workflow will be gone halfway through. Write important things to disk. Re-read your own instructions before executing them. Your skill should be able to reconstruct its complete state from files alone, at any point, without relying on what Claude remembers from earlier in the conversation.
Auto-dev has four verification levels. L0 runs after every file edit (format, lint, typecheck). L1 runs after each implementation unit (typecheck plus unit tests). L2 runs after all implementation (full test suite). And L3 runs before the PR (scope check plus a seventeen-point code review).
After every file edit
After each implementation unit
After all implementation
Before opening PR
The verification levels themselves were straightforward to implement. The hard part was teaching Claude when to give up.
Without explicit limits, Claude will retry a failing test in a loop. It'll try the same fix with slightly different syntax. It'll start modifying things that weren't broken, introducing new problems while chasing the original one. I watched this happen over and over during development. Claude would get stuck on a type error, try the same approach from three different angles, then start modifying unrelated files trying to make the types line up. Twenty minutes later the codebase was worse than when it started.
Direct targeted fix
Fundamentally different approach
Stop, re-read requirements, reconsider design
Mark unit as stuck
Circuit breaker — notify human
The self-correction protocol I ended up with works on a simple escalation model. First failure: direct targeted fix. Second failure on the same issue: fundamentally different approach. Third failure: stop, re-read the requirements, reconsider the design. If the same error signature (a combination of file path, line number, and error code) appears three times, stop entirely and mark the unit as stuck. If Claude starts oscillating, trying approach A, then B, then A again, stop immediately. After ten total failures on a single ticket, trigger a circuit breaker and notify the human.
The error signature tracking is what makes this work in practice. Every failure gets catalogued with a signature like src/login.tsx:42:TS2345. Before attempting a fix, Claude checks whether it's seen that signature before and what it tried last time. This prevents the loop where it tries the same fix with different variable names and wastes twenty minutes going nowhere.
You have to explicitly tell Claude to prioritize quality over speed. Adding instructions like "take your time to do this thoroughly" and "do not skip validation steps" made a real difference. Without those nudges, Claude will cut corners on verification to get to the PR faster.
Full autonomy is a nice idea right up until Claude decides it needs to npm install a new dependency at 2am. Or it's refactoring thirty files when the ticket asked for a three-file change. Or it's hit a merge conflict in batch mode and decided to resolve it by picking a side at random.
Auto-dev has eight situations where it stops and asks a human before proceeding:
The pattern I landed on is: automate the predictable stuff, checkpoint the uncertain stuff. A small ticket that touches three files and has a clear implementation path? Let it run. A large refactor with a new npm package and database migrations? Stop and verify.
Each checkpoint sends a macOS notification and presents an interactive picker so the human gets a clear set of options instead of a wall of text. This UX detail matters more than I expected. If your checkpoints are annoying to respond to, people will just start approving everything without reading, which defeats the purpose.
The Anthropic guide makes a distinction between problem-first and tool-first skills. Problem-first is when the user describes an outcome and the skill orchestrates the right tools. Tool-first is when the user already has a tool and the skill teaches Claude best practices for using it.
Auto-dev is aggressively problem-first at the top level. The user says "implement this ticket" and the skill handles everything. Which tools to use, what order to do things in, when to spawn subagents, how to verify the work. The user never thinks about the underlying mechanics.
But the reference files are tool-first. verification-levels.md is essentially a guide on how to use the testing tools effectively. self-correction.md teaches Claude how to read error output and decide what to try next. The main SKILL.md provides the problem-first orchestration, and the reference files provide tool-first expertise for each step.
If you're building something complex, I think this layered approach is the way to go. The main instructions worry about what needs to happen. The reference files worry about how to do each step well.
Auto-dev can only be as good as the tickets it receives. If the ticket is vague or missing acceptance criteria, the analysis phase produces a bad brief, the plan is wrong, and the implementation is worse. I learned this the hard way watching auto-dev confidently build the wrong thing from an underspecified ticket.
So we built another skill first: enhance-linear-issues. You point it at a set of Linear tickets and it reviews each one, fills in missing acceptance criteria, decomposes tickets that are too large, adds technical context from the codebase, and flags ambiguities that need human input before implementation starts. It's the prep work that makes auto-dev's job possible. This one's open source:
npx skills add super-mega-lab/toolkitThe workflow is: enhance-linear-issues cleans up and structures the tickets, then auto-dev takes them and runs. The quality of auto-dev's output went up dramatically once we stopped feeding it raw tickets and started feeding it well-structured ones. It turns out that the skill you build to prepare the input matters as much as the skill that does the work.
I got a lot wrong and I'm still fixing things.
I should have started with one phase and gotten it solid before building the rest. I built the entire eight-phase system end-to-end, then spent days debugging cascading failures where a bad analysis in Phase 1 produced a bad plan in Phase 2 that produced bad code in Phase 4. If I'd built and stabilized Phase 4 (implementation with verification) first, then layered on the surrounding phases, I would have saved a lot of time.
I should have designed my state schema upfront. I evolved workflow-state.json organically as I added features, which led to inconsistencies. Some fields camelCase, others snake_case, some objects with redundant data. It's the kind of tech debt that's annoying to fix after the fact and trivial to prevent with thirty minutes of design work.
I should have tested my frontmatter more systematically. Running ten to twenty test queries to check trigger accuracy, does the skill load when it should, does it stay quiet when it shouldn't. I did this informally but not rigorously, and it showed.
And I still haven't fully solved the logic error problem. Auto-dev is great at catching and fixing type errors and lint violations, those have clear error messages with specific file and line information. But the situation where the test passes and the feature doesn't actually work is much harder to catch autonomously. The requirements retrospective in Phase 5 helps (it re-reads the original ticket and checks each requirement against the diff), but it's not as reliable as I want it to be.
Skill building is becoming its own thing. It's not prompt engineering, the instructions are too structured and long-lived for that label. It's not traditional programming, there's no compiler telling you when you're wrong. It's somewhere in between. You're writing a detailed playbook for a colleague who's extremely capable but takes everything literally and will forget what you told them three hours ago if you don't write it down.
You're writing a detailed playbook for a colleague who's extremely capable but takes everything literally and will forget what you told them three hours ago if you don't write it down.
The infrastructure for this is getting built right now. The Anthropic guide, the skill-creator tool, the Skills API, the open standard for portability across platforms. What's missing is the accumulated knowledge of what actually works when you push past the simple patterns. What happens when your skill needs to run for two hours. How you handle context compaction. Where to put human checkpoints. How to teach self-correction without infinite loops.
That's what I spent a couple weeks learning. Most of it the hard way. Hopefully some of it saves you a few days if you're building something similar.