Why skills became a thing
A large language model is not a “smart assistant.” It’s a probabilistic generator wearing a very convincing costume of one. If you’ve shipped anything on top of an LLM for more than a couple of weeks, you’ve seen this firsthand: the same prompt on the same model, run on different days, produces different outputs. Sometimes the difference is cosmetic. Sometimes it quietly breaks your pipeline.
A quick word on why it does this
It’s worth pausing on the mechanics for one paragraph, because everything else in this article follows from it.
An LLM doesn’t “understand” your instruction and then execute it. It reads your tokens, runs them through a massive statistical model, and predicts the next most likely token, then the next, then the next. The output looks like reasoning because it was trained on terabytes of human reasoning. But there is no internal checklist. There is no agent that goes “okay, step 1 done, moving to step 2.” There is only: given everything I’ve seen so far, what comes next. If your instruction says “always call the calculator tool before computing,” the model treats that as context — a strong hint, not a hard rule. If its training pulls it toward computing inline, it might compute inline. No error. No warning. Just a different path through probability space.
This is why skills matter. Skills are how you compensate for the fact that your instructions are suggestions, not commands. You can’t make an LLM obey. You can only make the obedient path more probable than the alternatives.
What a skill actually is
A skill is a folder with a SKILL.md file inside. That file contains instructions, and sometimes the folder also contains scripts and reference materials. An orchestrator loads the skill dynamically when the task matches its description. The format originated at Anthropic, but it's since been published as an open standard and adopted across Claude Code, Cursor, Opencode, Codex, the Claude Agent SDK, and others. Skills are no longer a one-company feature — they're a shared way of packaging expertise for any orchestrator.
The point of a skill, in one line: it shifts LLM behavior from “guessing” toward “predictable.” You stop pasting the same 40-line instruction into every chat and package it once. When the model sees a matching task, your instructions load.
That sounds boring. It stays boring right up until you realize skills are the exact point where your product either stabilizes — or starts producing plausible nonsense at scale.
Who should pay extra attention
Skills look simpler than they are. “It’s a folder with markdown, what could go wrong.” A lot.
The risk concentrates in a few groups:
- No-code and low-code builders assembling an agent on a platform and pulling skills from a marketplace without reading what’s inside.
- Indie builders and small teams without a dedicated AI engineer, where nobody is going to review a 300-line prompt.
- Product managers who “plugged in GPT” and are now confused about why the output is great half the time and junk the other half.
For all of them, the risk is identical: the less technical depth you have, the higher the odds a skill silently breaks your system. Silently is the operative word. An LLM doesn’t throw an exception. It doesn’t crash. It just starts producing output that looks right but isn’t.
I’ll come back to this. It’s the most important thing to understand about skills.
How to think about skills: not types, but levels of control
Before getting into the kinds of skills, it helps to change the framing. Don’t think of skills as a collection of formats. Think of them as a ladder of how much control you’re taking over the model’s behavior:
No skill → chaos
Instructions → soft control
Procedural → strict control
Tools → external control (source of truth outside the model)
Executable → determinism (code the model runs instead of writing)
Multi-step → orchestration (control at every step)This matters because when you’re picking a “skill type,” you’re actually picking how much control you want and what you’re willing to pay for it. More control = less flexibility + more maintenance. Less control = faster to ship + more ways to fail quietly.
Now the types.
Skill types
For each: what it is, when to use, when it’s a mistake, a short example.
1. Text-based (instruction / procedural)
What it is. Pure markdown with instructions and a step-by-step procedure. No code, no tools, no external resources.
When to use. Anything text-first: reasoning, analysis, writing, structural review.
When it’s a mistake. When the task needs fresh data or deterministic math. If your skill asks the LLM to do arithmetic “in its head,” you’ve already lost.
Example.
When writing an article:
1. Identify reader intent
2. Build an outline with a clear narrative arc
3. Write each section in the style matching the outline
4. Validate structure against the original intent2. Reference-augmented
What it is. A skill with separate reference files — documentation, rules, guidelines. The main SKILL.md acts as a map, pointing to which reference to load when.
When to use. When the knowledge base is too big to sit in one file without hurting clarity. A large API, company brand guidelines, legal rules.
When it’s a mistake. When the content is small and fits comfortably in one file. Adding structure for its own sake slows down both the model and you.
Example structure.
skill-folder/
SKILL.md
references/
api.md
rules.md
edge-cases.md3. Tool-oriented
What it is. A skill whose main job is teaching the model to call an external tool correctly. APIs, databases, internal services.
When to use. When you need an external source of truth or dynamic data. Prices. Order statuses. User state. Anything that should not come from the model’s memory.
When it’s a mistake. When you add a tool to a task the model can already handle. A well-known case: a financial agent with a calculator tool registered kept doing arithmetic in its head and returning 107.9 instead of 107.88 because the instruction didn’t force the tool call. The tool was there. The decision to use it was the model’s. It didn’t.
Example.
To get current exchange rates:
1. Always call the rates_api tool — do NOT compute from memory
2. Parse the response (see references/api.md)
3. Validate the timestamp is within the last 5 minutes4. Executable
What it is. A skill bundled with a script the LLM runs instead of rewriting. Python, bash, whatever gives you determinism.
When to use. When the result has to be exact and repeatable. PDF parsing, structured extraction, heavy processing. This is exactly how Anthropic’s PDF skill works under the hood: the model doesn’t “read” the PDF, it runs a pre-written script that pulls form fields without loading the PDF into context.
When it’s a mistake. When the task is soft by nature. Forcing “write a friendly customer reply” into a script isn’t determinism, it’s product sabotage.
Example.
skill-folder/
SKILL.md
scripts/
extract_form_fields.py5. Composite (production-grade)
What it is. A skill with everything — instructions, references, scripts, templates. The real production shape.
When to use. Complex pipelines. Real products with accountability. Corporate workflows that need rules and code and examples.
When it’s a mistake. When the task doesn’t need it. A composite skill for “write me a tagline” isn’t thoroughness, it’s overhead.
Example structure.
skill-folder/
SKILL.md
scripts/
validate.py
references/
brand-voice.md
templates/
email-template.md6. Multi-step execution
Important distinction: this is not a structural type, it’s an execution pattern. Multi-step is about how a skill breaks a task into controlled steps, validating each one before moving on.
When to use. Complex tasks with a high cost of failure. Financial operations, document generation with legal consequences, any chain where a mistake at step 2 contaminates every later step.
When it’s a mistake. Simple tasks. Wrapping “respond to this email” in multi-step costs tokens and latency without buying you anything.
Example.
Step 1 → Gather all input; confirm it's complete
Step 2 → Extract structured data; pause if anything is ambiguous
Step 3 → Execute the core logic
Step 4 → Verify output matches expected schema
Step 5 → Return result; log all intermediate statesGuardrails: where the skill ends and the problem begins
Now the most important section. Maybe the most important in the article.
A skill needs guardrails. Explicit statements about where it applies, where it doesn’t, what inputs are valid, what environments are supported. This sounds obvious. In practice, 80% of public skills either skip this entirely or do it badly.
Bad skill
Use this skill for frontend development.That’s it. That’s the actual description quality you’ll find in a lot of open-source skills. An orchestrator reading that will apply it to a Vue project, because “frontend.” It’ll apply it to plain HTML. It’ll apply it to React Native, even though the skill was written for React on the web. And you won’t notice until you see the broken output.
Good skill
Use ONLY for React projects with:
- functional components (no class components)
- hooks-based state management
- TypeScript (not plain JS)Do NOT use in:
- Vue, Angular, or Svelte projects
- React Native codebases
- Server-side only (Node.js without React)Required environment:
- React 18+
- Node.js 20+The gap between these two is enormous. The first is an invitation to fail. The second is a contract.
Guardrails for an executable skill
Requires:
- Python 3.10+
- Filesystem write access to /tmp
- Permission to run subprocess callsWill fail silently if:
- Running in a sandbox without subprocess access
- Input file > 100MB (use the streaming variant instead)Without guardrails, a skill behaves like a virus: it applies anywhere the orchestrator spots a loose keyword match. With guardrails, it refuses to run where it shouldn’t. That’s what you want.
Why you can’t just “install a skill from GitHub”
A common mistake: grab a popular skill from some repo and bolt it onto your agent. “It works for everyone else.”
It doesn’t. Here’s why.
Instruction conflicts. You already have a system prompt with its own rules. The skill has its own. They can contradict each other. The LLM won’t flag the conflict — it’ll just pick something, and often not what you wanted.
Bad triggers. Description too broad → the skill fires on tasks it wasn’t designed for. Description too narrow → it doesn’t fire where it should. Both hurt.
Environment mismatch. The skill expects Python 3.11, you’re on 3.9. It expects filesystem access, you’re in a sandbox. It expects one tool-response format, you have another. If the skill doesn’t document its requirements, you’ll discover the mismatch through silent failures.
Tool misuse. The most painful one. The LLM “knows” about the tool from the skill but calls it with wrong parameters — or doesn’t call it at all because it decided mid-reasoning that it could handle this one itself. In production this shows up as a non-trivial rate of skipped tool calls and silent task-sequencing errors. Research on multi-agent systems puts this rate in the tens of percent for some models, not fractions of a percent.
The dangerous one: failures without a failure. I promised I’d come back to this.
The LLM doesn’t crash. It doesn’t throw. It returns plausible nonsense.
When a skill applies where it shouldn’t, you don’t get a red alert. You get:
- An answer that looks fine
- Metrics that don’t move
- A user who writes in three days later saying “the numbers in my report don’t add up”
By then you’ve lost trust, not just one request.
The thesis of this whole section:
A skill is not a plugin. It’s a change to how your model behaves.
Treat installing one with the same seriousness you’d treat deploying code to production. Functionally, that’s what it is.
Universal vs. per-model: which kind of skill are you writing?
One decision you’ll face early and probably won’t think about enough: should your skill be universal, or should it be tuned to a specific model or orchestrator?
A universal skill is written to work across models and platforms. It leans on the open standard, keeps instructions framework-agnostic, and doesn’t assume anything about the runtime beyond “there is an LLM and it will read markdown.” This is what most public skill repositories aim for. It’s portable. It’s reusable. It’s weaker.
A model-specific skill (or orchestrator-specific skill) is tuned to one target. You know which model will run it. You know its quirks — that it tends to skip step 3, or ignores bulleted lists but respects numbered ones, or forgets tool schemas after 8k tokens. You write the skill against those specific behaviors. This is what most internal production skills look like when you peek behind the curtain at companies running LLMs seriously.
The tradeoff is the obvious one. Universal skills are great for open-source, for sharing, for “install and go.” They degrade the moment someone runs them on a model they weren’t tested against. Per-model skills are stronger inside your product and useless outside it — switch the model provider and you’re rewriting.
Most people default to universal without thinking. It feels safe. It’s how things are “supposed” to be done. But if you’re building something serious, you usually end up with a mix: a universal scaffold for portability, plus per-model overrides for the parts where behavior diverges. The mistake is picking one side on autopilot.
Why skills need benchmarks
Which brings me to the last part.
The problem today
- There’s no standard for evaluating skills
- Evaluation happens by eye — someone tries it a few times, it “seems to work,” it ships
- There are no metrics for: did the trigger fire correctly, did execution complete, did it break anything else, what’s the latency, what’s the cost
With three skills, you can run them manually. With thirty skills combining in multi-agent flows, you physically can’t cover all the combinations.
What should be measured
The minimum set, per skill:
- Trigger accuracy — does it fire where it should, and not fire where it shouldn’t
- Execution correctness — is the output right
- Output consistency — is the output stable across runs
- Safety — does the skill stay inside its stated scope
- Latency and cost — time and money per call
And separately: benchmarking models for skills
One thing that rarely gets discussed: different LLMs handle skills differently. The same SKILL.md folder, plugged into two different models, produces different quality. One model skips instructions. The other follows them. Public model benchmarks don't capture this because they measure abstract ability, not "how well does this model follow your specific instructions inside your specific flow."
Public benchmarks don’t reflect reality
This is a thesis I arrived at the hard way:
Models that “crush the leaderboards” often underperform in practice — the moment you narrow down to tool-calling or chart generation inside an orchestrator.
The reason is structural. Public benchmarks cover broad surface area: general knowledge, reasoning on academic tasks, code generation out of context. Useful for a wide-angle view. Useless for a specific step in your orchestrator. You’re not running your agent on MMLU. You’re running it on “call exactly this tool, pass exactly these parameters, handle exactly this response.”
On that kind of specificity, the picture inverts.
What’s coming
I’m working on a narrow benchmark that simulates real use cases inside an orchestration step — not “model X is 4.2% better than model Y on average,” but “model X calls the right tool 94% of the time on your complexity of tasks, model Y does it 71% of the time, and X costs 10x more.”
This will ship as a feature of an existing open-source project: Bencher. It’s already live for a slightly different class of problems, and a skills-focused module is coming next. Details, setup, and the full scope are in the repo — drop the README into your LLM of choice if you’d rather skim.
One honest caveat: no benchmark should be the final word on a model choice. I use this kind of tooling as a sieve — it filters out models that are obviously wrong for the job and compares options where the price gap is 10x. The finalists still get tested on live traffic. Nothing replaces watching a model behave on your actual users.
One line to take away:
A bad skill is worse than no skill.
No skill is honest uncertainty. A bad skill is false confidence. And in production, false confidence always costs more than an honest “I don’t know.”
If you’re building something serious on top of an LLM, stop treating skills as convenient markdown folders. Treat them as behavioral changes to your system. Write guardrails. Test before you install. Measure in production. And don’t trust general-purpose benchmarks when your question is specific.
More on how to measure this next time.
The only questions left. Was this topic written by self-designed skill?

