Mayur Rathi

⭐ 34.1k GitHub stars

Phoenix-Evals

Phoenix-Evals is an data AI skill with a core value of Build and run evaluators for AI/LLM applications using Phoenix. It helps developers solve real-world problems in the data domain, boosting efficiency, automating repetitive tasks, and optimizing workflows.

Build and run evaluators for AI/LLM applications using Phoenix.

Last verified on: 2026-07-14

Quick Facts

Category data

Works With Claude, GitHub Copilot

Source github/awesome-copilot

Stars ⭐ 34.1k

Last Verified 2026-07-14

Risk Level Low

mkdir -p ./skills/phoenix-evals && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/phoenix-evals/SKILL.md -o ./skills/phoenix-evals/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# Phoenix Evals

Build evaluators for AI/LLM applications. Code first, LLM for nuance, validate against humans.

Quick Reference

| Task | Files |

| ---- | ----- |

| Setup | [setup-python](references/setup-python.md), [setup-typescript](references/setup-typescript.md) |

| Decide what to evaluate | [evaluators-overview](references/evaluators-overview.md) |

| Choose a judge model | [fundamentals-model-selection](references/fundamentals-model-selection.md) |

| Use pre-built evaluators | [evaluators-pre-built](references/evaluators-pre-built.md) |

| Build code evaluator | [evaluators-code-python](references/evaluators-code-python.md), [evaluators-code-typescript](references/evaluators-code-typescript.md) |

| Build LLM evaluator | [evaluators-llm-python](references/evaluators-llm-python.md), [evaluators-llm-typescript](references/evaluators-llm-typescript.md), [evaluators-custom-templates](references/evaluators-custom-templates.md) |

| Batch evaluate DataFrame | [evaluate-dataframe-python](references/evaluate-dataframe-python.md) |

| Run experiment | [experiments-running-python](references/experiments-running-python.md), [experiments-running-typescript](references/experiments-running-typescript.md) |

| Create dataset | [experiments-datasets-python](references/experiments-datasets-python.md), [experiments-datasets-typescript](references/experiments-datasets-typescript.md) |

| Generate synthetic data | [experiments-synthetic-python](references/experiments-synthetic-python.md), [experiments-synthetic-typescript](references/experiments-synthetic-typescript.md) |

| Validate evaluator accuracy | [validation](references/validation.md), [validation-evaluators-python](references/validation-evaluators-python.md), [validation-evaluators-typescript](references/validation-evaluators-typescript.md) |

| Sample traces for review | [observe-sampling-python](references/observe-sampling-python.md), [observe-sampling-typescript](references/observe-sampling-typescript.md) |

| Analyze errors | [error-analysis](references/error-analysis.md), [error-analysis-multi-turn](references/error-analysis-multi-turn.md), [axial-coding](references/axial-coding.md) |

| RAG evals | [evaluators-rag](references/evaluators-rag.md) |

| Avoid common mistakes | [common-mistakes-python](references/common-mistakes-python.md), [fundamentals-anti-patterns](references/fundamentals-anti-patterns.md) |

| Production | [production-overview](references/production-overview.md), [production-guardrails](references/production-guardrails.md), [production-continuous](references/production-continuous.md) |

Workflows

**Starting Fresh:**

[observe-tracing-setup](references/observe-tracing-setup.md) → [error-analysis](references/error-analysis.md) → [axial-coding](references/axial-coding.md) → [evaluators-overview](references/evaluators-overview.md)

**Building Evaluator:**

[fundamentals](references/fundamentals.md) → [common-mistakes-python](references/common-mistakes-python.md) → evaluators-{code|llm}-{python|typescript} → validation-evaluators-{python|typescript}

**RAG Systems:**

[evaluators-rag](references/evaluators-rag.md) → evaluators-code-* (retrieval) → evaluators-llm-* (faithfulness)

**Production:**

[production-overview](references/production-overview.md) → [production-guardrails](references/production-guardrails.md) → [production-continuous](references/production-continuous.md)

Reference Categories

| Prefix | Description |

| ------ | ----------- |

| `fundamentals-*` | Types, scores, anti-patterns |

| `observe-*` | Tracing, sampling |

| `error-analysis-*` | Finding failures |

| `axial-coding-*` | Categorizing failures |

| `evaluators-*` | Code, LLM, RAG evaluators |

| `experiments-*` | Datasets, running experiments |

| `validation-*` | Validating evaluator accuracy against human labels |

| `production-*` | CI/CD, monitoring |

Key Principles

| Principle | Action |

| --------- | ------ |

| Error analysis first | Can't automate what you haven't observed |

| Custom > g

🎯 Best For

UI designers
Product designers
Claude users
GitHub Copilot users
Data professionals

💡 Use Cases

Generating component mockups
Creating design system tokens
Data pipeline auditing
Query optimization

📖 How to Use This Skill

1
Install the Skill

Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
2
Load into Your AI Assistant

Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.
3
Apply Phoenix-Evals to Your Work

Provide context for your task — paste source material, describe your audience, or share existing work to guide the AI.
4
Review and Refine

Edit the AI output for accuracy, tone, and completeness. Add human insight where the AI lacks context.

❓ Frequently Asked Questions

Does this work with Figma?

Some design skills integrate with Figma plugins. Check the Works With section for supported tools.

How do I install Phoenix-Evals?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/phoenix-evals/SKILL.md, ready to use.

Can I customize this skill for my team?

Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.

⚠️ Common Mistakes to Avoid

Skipping usability testing

AI-generated designs should be validated with real users before development.

Ignoring data quality

AI analysis inherits all data quality issues — profile your data first.

🔗 Related Skills

acreadiness-assess Acreadiness-Assess acreadiness-policy Acreadiness-Policy adr-generator ADR Generator ai-prompt-engineering-safety-best-practices Ai-Prompt-Engineering-Safety-Best-Practices ai-readiness-reporter Ai-Readiness-Reporter ai-ready Ai-Ready