Mayur Rathi

⭐ 34.1k GitHub stars

Eval-Driven-Dev

Eval-Driven-Dev is an code AI skill with a core value of >. It helps developers solve real-world problems in the code domain, boosting efficiency, automating repetitive tasks, and optimizing workflows.

>

Last verified on: 2026-07-14

Quick Facts

Category code

Works With Claude, GitHub Copilot

Source github/awesome-copilot

Stars ⭐ 34.1k

Last Verified 2026-07-14

Risk Level Low

mkdir -p ./skills/eval-driven-dev && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/eval-driven-dev/SKILL.md -o ./skills/eval-driven-dev/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# Eval-Driven Development for Python LLM Applications

You're building an **automated evaluation pipeline** that tests a Python-based AI application end-to-end — running it the same way a real user would, with real inputs — then scoring the outputs using evaluators and producing pass/fail results via `pixie test`.

**What you're testing is the app itself** — its request handling, context assembly (how it gathers data, builds prompts, manages conversation state), routing, and response formatting. The app uses an LLM, which makes outputs non-deterministic — that's why you use evaluators (LLM-as-judge, similarity scores) instead of `assertEqual` — but the thing under test is the app's code, not the LLM.

During evaluation, the app's own code runs for real — routing, prompt assembly, LLM calls, response formatting — nothing is mocked or stubbed. But the data the app reads from external sources (databases, caches, third-party APIs, voice streams) is replaced with test-specified values via instrumentations. This means each test case controls exactly what data the app sees, while still exercising the full application code path.

**Rule: The app's LLM calls must go to a real LLM.** Do not replace, mock, stub, or intercept the LLM with a fake implementation. The LLM is the core value-generating component — replacing it makes the eval tautological (you control both inputs and outputs, so scores are meaningless). If the project's test suite contains LLM mocking patterns, those are for the project's own unit tests — do NOT adopt them for the eval Runnable.

**The deliverable is a working `pixie test` run with real scores** — not a plan, not just instrumentation, not just a dataset.

This skill is about doing the work, not describing it. Read code, edit files, run commands, produce a working pipeline.

---

Before you start

**First, activate the virtual environment**. Identify the correct virtual environment for the project and activate it. After the virtual environment is active, run the setup.sh included in the skill's resources.

The script updates the `eval-driven-dev` skill and `pixie-qa` python package to latest version, initialize the pixie working directory if it's not already initialized, and start a web server in the background to show user updates.

**Setup error handling — what you can skip vs. what must succeed:**

- **Skill update fails** → OK to continue. The existing skill version is sufficient.

- **pixie-qa upgrade fails but was already installed** → OK to continue with the existing version.

- **pixie-qa is NOT installed and installation fails** → **STOP.** Ask the user for help. The workflow cannot proceed without the `pixie` package.

- **`pixie init` fails** → **STOP.** Ask the user for help.

- **`pixie start` (web server) fails** → **STOP.** Ask the user for help. Check `server.log` in the pixie root directory for diagnostics. Common causes: port conflict, missing dependency, slow environment. Do NOT proceed without the web server — the user needs it to see eval results.

---

The workflow

Follow Steps 1–6 straight through without stopping. Do not ask the user for confirmation at intermediate steps — verify each step yourself and continue.

**How to work — read this before doing anything else:**

- **One step at a time.** Read only the current step's instructions. Do NOT read Steps 2–6 while working on Step 1.

- **Read references only when a step tells you to.** Each step names a specific reference file. Read it when you reach that step — not before.

- **Create artifacts immediately.** After reading code for a sub-step, write the output file for that sub-step before moving on. Don't accumulate understanding across multiple sub-steps before writing anything.

- **Verify, then move on.** Each step has a checkpoint. Verify it, then proceed to the next step. Don't plan future steps while verifying the current one.

**When to stop and ask for help:**

Some blockers cannot and should not be worked around. When you encou

🎯 Best For

Claude users
GitHub Copilot users
Software engineers
Development teams
Tech leads

💡 Use Cases

Code quality improvement
Best practice enforcement

📖 How to Use This Skill

1
Install the Skill

Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
2
Load into Your AI Assistant

Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.
3
Apply Eval-Driven-Dev to Your Work

Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.
4
Review and Refine

Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.

❓ Frequently Asked Questions

Is Eval-Driven-Dev compatible with Cursor and VS Code?

Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.

Do I need specific dependencies for Eval-Driven-Dev?

Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.

How do I install Eval-Driven-Dev?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/eval-driven-dev/SKILL.md, ready to use.

Can I customize this skill for my team?

Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.

⚠️ Common Mistakes to Avoid

Skipping validation

Always test AI-generated code changes, even for simple refactors.

Missing dependency updates

Check if the skill requires updated dependencies or new packages.

🔗 Related Skills

a11y A11Y accessibility Accessibility Expert accessibility-runtime-tester Accessibility Runtime Tester acquire-codebase-knowledge Acquire-Codebase-Knowledge acreadiness-generate-instructions Acreadiness-Generate-Instructions add-educational-comments Add-Educational-Comments