03
BUILD⭐⭐

Test & Improve Your Skill

Make sure it actually works well

You’ve built a skill — now let’s find out if it actually works. Try it out, see what happens, and make it better.

The test → fix → retest loop

Try the skill
Note what’s off
Tweak the skill
Try again

Repeat until you’d send the output without editing.

01

Does it show up when it should?

Try these with your skill installed:

1.Ask for the task directly — e.g., "Write me a status report"Should activate
2.Rephrase it — e.g., "Give me a weekly update"Should activate
3.Ask for something related but different — e.g., "Write a project proposal"Should NOT activate
4.Ask something completely unrelated — e.g., "What's the weather?"Should NOT activate
5.Ask the AI: "What skills do you have for reports?"Should mention your skill

If it never fires: Add more trigger phrases to your description. If it fires for everything: Add “Do NOT use for…” lines.

Be pushy in your description. Instead of just “Writes status reports,” try “Writes status reports. Use when the user mentions status updates, weekly reports, team updates, check-ins, or standup summaries, even if they don’t say ‘status report’ explicitly.” AI tools tend to under-trigger skills, so a more aggressive description helps.

02

Is the output actually good?

Run your skill 3 times with the same request. The key question:

“Would I send this without editing?”

More specifically:

  • Did it follow your format? (right sections, right order)
  • Did it follow your rules? (tone, length, constraints)
  • Does it look like your example?
  • Is it consistent across all 3 runs?

Output inconsistent? Add more examples and mark critical rules with “IMPORTANT:”. Skips steps? Move important rules to the top. Wrong tone? Add a concrete “write like this” example.

03

Can it handle the messy stuff?

Real inputs are rarely perfect. Try these:

Give it incomplete info (leave out a key detail)
Give it way too much info (wall of text)
Ask for something slightly out of scope
Give it contradictory instructions

What good looks like: It asks clarifying questions for missing info, ignores the noise, and gracefully says no to out-of-scope requests.

The simple loop

1.Test it — does the output look right?
2.Fix the part that's off (description, instructions, or example)
3.Test the same thing again
4.When it works, try 2 more variations to be sure

That’s it. No fancy framework needed. Test, fix, test again.

When something’s off, tell the AI

The fastest way to improve your skill is to tell your AI exactly what went wrong. Paste this prompt, fill in the blanks, and let it suggest fixes to your SKILL.md:

Debug your skill — paste into your AI
I have a skill (SKILL.md) that isn't working the way I want. Here's what happened:

**What I asked:** [paste or describe the request you gave]
**What I expected:** [describe what good output looks like]
**What I got instead:** [paste or describe the actual output]

Here's my current SKILL.md:
[paste your SKILL.md content here]

Please suggest specific changes to my SKILL.md that would fix this. Show me the updated version.

This works in any AI tool. The more specific you are about what went wrong, the better the fix.

Takeaway: A skill is a living document. Version 1 is never the final version. The best skills get better because you keep feeding them what you learn.

The testing approach above is what the industry calls a “vibe eval.” You run the skill, look at the output, decide if it feels right. That works for personal skills. When a skill is shared across a team, or when you need it to survive model updates, you need more rigor. Here are five techniques worth knowing.

1. From Vibe Evals to Structured Evals

The manual test loop from the main lesson is a vibe eval. You run the skill, look at the output, decide if it feels right. When a skill is shared across a team, or when you need it to survive model updates, you need structured evals. A structured eval has three parts:

  • A test prompt — a realistic request someone would actually make
  • Expected output — what good looks like (not exact text, but criteria)
  • Assertions — specific pass/fail checks (“output is a PDF,” “contains the client name,” “has exactly 3 sections”)

These get saved in a standard format (JSON) so you can re-run them whenever you update the skill or when a new model version drops.

The mental model: vibe evals tell you “this seems fine.” Structured evals tell you “this specific thing broke.”

2. Baseline Comparison: Skill vs. No Skill

The most important question most skill authors never answer: is my skill actually better than no skill at all?

Modern AI tools are getting smarter with every model update. A skill you wrote 3 months ago might now be worse than just asking the AI directly, because the base model improved past what your skill encodes.

The technique: run the same test prompt twice. Once with the skill active, once without. Compare the outputs. If the no-skill version is just as good (or better), your skill needs updating or retiring.

This is where two concepts matter:

  • Capability uplift — a skill that teaches the AI something it genuinely can’t do well on its own. Example: a skill that bundles a Python validation script to check PDF form fields. The AI can’t do that without the script. These skills have durable value because they add real capability.
  • Encoded preference — a skill that teaches the AI your preferred way of doing something it could already do. Example: a skill that formats status reports in your company’s template. The AI could write a status report without it, but not in your format. These skills are valuable but fragile. As models get better at following instructions, you may need less and less scaffolding. They need regular re-testing.

The best skills combine both: they encode your preferences AND add capabilities the model doesn’t have natively (scripts, reference data, validated templates).

This is also how you think about building skills on top of skills. Your first skill handles the core task. Your second skill adds a validation layer. Your third skill connects to external data. Each layer adds capability that the base model can’t replicate on its own.

3. Automated Testing Agents

Claude’s skill-creator (updated March 2026) now includes specialized agents that automate the eval process:

  • A grader that checks your assertions against outputs, extracts implicit claims, and tells you when your tests are too easy (“this assertion would pass even with bad output”)
  • A blind comparator that takes two outputs without knowing which skill produced which, scores them on a rubric, and picks a winner. This removes the “I wrote this so it must be good” bias.
  • An analyzer that reads all your test results and surfaces patterns the summary stats hide (“assertion X always passes regardless of skill. It’s not testing anything useful”)

OpenAI’s Codex CLI has parallel eval tooling using JSONL trace capture and model-assisted grading. Hugging Face Upskill tracks pass rate, token savings, and skill lift as open-source metrics.

The tooling is platform-specific, but the concepts are universal: define what good looks like, automate the checking, compare against a baseline, iterate.

For most people, manual testing is all you need. If you’re deploying skills across a team, these tools are worth exploring.

4. Description Optimization

Your skill’s description is the single biggest factor in whether it actually gets used. It’s how the AI decides “should I load this skill for this request?”

The manual approach (from the main lesson): try 5 different phrasings and see if the skill activates.

The automated approach: generate 20 realistic test queries. 10 that should trigger the skill, 10 that shouldn’t. Run them. See where the description fails. Rewrite. Repeat.

What makes a good test query: specificity. Don’t test with “Format this data.” Test with “ok so my boss just sent me this xlsx file called Q4 sales final FINAL v2.xlsx and she wants me to add a column that shows profit margin as a percentage.” Real users talk like that.

The near-misses matter most for should-NOT-trigger queries. Don’t test with “What’s the weather?” Test with a query that shares keywords with your skill but actually needs something different.

Claude’s skill-creator automates this with a train/test split and iterative refinement. But you can do it manually with any AI tool: write 20 queries, test them, improve your description based on what failed.

5. When to Re-test (Model Updates and Skill Decay)

Skills can decay. A model update might make your skill’s instructions conflict with the model’s improved capabilities. Or the model might get so good at the base task that your skill’s overhead (loading time, token usage) no longer justifies the quality improvement.

Re-test your skills when:

  • A major model update ships (new Claude version, new GPT version, etc.)
  • You hear from users that output quality has changed
  • Your skill uses workarounds for known model weaknesses (those weaknesses may be fixed)

This is the real long-term value of structured evals: you run the same tests after a model update and instantly see if something broke. Without them, you’re finding out when a teammate says “hey, this skill isn’t working right anymore.”

You don’t need to build this yourself. Several AI tools now include skill-creator features that handle evals, baseline comparisons, and description optimization for you. Claude’s skill-creator skill runs structured interviews and sets up test suites automatically. Codex CLI has similar eval scaffolding. If your tool offers a skill-creator, lean on it. The concepts above are worth understanding so you know what the tool is doing, but let the tool do the heavy lifting.

Where the ecosystem is today:

Skill evaluation tooling is still early but moving quickly. There’s no single dominant tool yet. Most skill authors rely on manual testing. But dedicated tools are starting to appear:

For most people, manual testing is all you need. If you’re deploying skills across a team, these tools are worth exploring.

Need help bringing this to your team?

We work alongside your team to build AI-native workflows — from one-week sprints to full engineering acceleration. No handoffs, no slide decks.

Talk to us