Eval-Driven-Dev
Eval-Driven-Dev是一款code方向的AI技能,核心价值是>,可用于解决开发者在code领域的实际问题,帮助用户提升效率、自动化重复任务或优化工作流。
>
mkdir -p ./skills/eval-driven-dev && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/eval-driven-dev/SKILL.md -o ./skills/eval-driven-dev/SKILL.md Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).
Skill Content
# Eval-Driven Development for Python LLM Applications
You're building an **automated evaluation pipeline** that tests a Python-based AI application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via `pixie test`.
**What you're testing is the app itself** — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of `assertEqual` — but the thing under test is the app's code, not the LLM.
During evaluation, the app's own code runs for real — routing, prompt assembly, LLM calls, response formatting — nothing is mocked or stubbed. But the data the app reads from external sources (databases, caches, third-party APIs, voice streams) is replaced with test-specified values via instrumentations. This means each test case controls exactly what data the app sees, while still exercising the full application code path.
**Rule: The app's LLM calls must go to a real LLM.** Do not replace, mock, stub, or intercept the LLM with a fake implementation. The LLM is the core value-generating component — replacing it makes the eval tautological (you control both inputs and outputs, so scores are meaningless). If the project's test suite contains LLM mocking patterns, those are for the project's own unit tests — do NOT adopt them for the eval Runnable.
**The deliverable is a working `pixie test` run with real scores** — not a plan, not just instrumentation, not just a dataset.
This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline.
---
Before you start
**First, activate the virtual environment**. Identify the correct virtual environment for the project and activate it. After the virtual environment is active, run the setup.sh included in the skill's resources.
The script updates the `eval-driven-dev` skill and `pixie-qa` python package to latest version, initialize the pixie working directory if it's not already initialized, and start a web server in the background to show user updates.
**Setup error handling — what you can skip vs. what must succeed:**
- **Skill update fails** → OK to continue. The existing skill version is sufficient.
- **pixie-qa upgrade fails but was already installed** → OK to continue with the existing version.
- **pixie-qa is NOT installed and installation fails** → **STOP.** Ask the user for help. The workflow cannot proceed without the `pixie` package.
- **`pixie init` fails** → **STOP.** Ask the user for help.
- **`pixie start` (web server) fails** → **STOP.** Ask the user for help. Check `server.log` in the pixie root directory for diagnostics. Common causes: port conflict, missing dependency, slow environment. Do NOT proceed without the web server — the user needs it to see eval results.
---
The workflow
Follow Steps 1–6 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.
**How to work — read this before doing anything else:**
- **One step at a time.** Read only the current step's instructions. Do NOT read Steps 2–6 while working on Step 1.
- **Read references only when a step tells you to.** Each step names a specific reference file. Read it when you reach that step — not before.
- **Create artifacts immediately.** After reading code for a sub-step, write the output file for that sub-step before moving on. Don't accumulate understanding across multiple sub-steps before writing anything.
- **Verify, then move on.** Each step has a checkpoint. Verify it, then proceed to the next step. Don't plan future steps while verifying the current one.
**When to stop and ask for help:**
Some blockers cannot and should not be worked around. When you encou
🎯 Best For
- Claude users
- GitHub Copilot users
- Software engineers
- Development teams
- Tech leads
💡 Use Cases
- Code quality improvement
- Best practice enforcement
📖 How to Use This Skill
- 1
Install the Skill
Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
- 2
Load into Your AI Assistant
Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.
- 3
Apply Eval-Driven-Dev to Your Work
Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.
- 4
Review and Refine
Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.
❓ Frequently Asked Questions
Is Eval-Driven-Dev compatible with Cursor and VS Code?
Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.
Do I need specific dependencies for Eval-Driven-Dev?
Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.
How do I install Eval-Driven-Dev?
Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/eval-driven-dev/SKILL.md, ready to use.
Can I customize this skill for my team?
Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.
⚠️ Common Mistakes to Avoid
Skipping validation
Always test AI-generated code changes, even for simple refactors.
Missing dependency updates
Check if the skill requires updated dependencies or new packages.