Mayur Rathi

⭐ 34.1k GitHub stars

Agentic-Eval

Agentic-Eval is an code AI skill with a core value of |. It helps developers solve real-world problems in the code domain, boosting efficiency, automating repetitive tasks, and optimizing workflows.

|

Last verified on: 2026-07-14

Quick Facts

Category code

Works With Claude, GitHub Copilot

Source github/awesome-copilot

Stars ⭐ 34.1k

Last Verified 2026-07-14

Risk Level Low

mkdir -p ./skills/agentic-eval && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/agentic-eval/SKILL.md -o ./skills/agentic-eval/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# Agentic Evaluation Patterns

Patterns for self-improvement through iterative evaluation and refinement.

Overview

Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.

text

Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use

- **Quality-critical generation**: Code, reports, analysis requiring high accuracy

- **Tasks with clear evaluation criteria**: Defined success metrics exist

- **Content requiring specific standards**: Style guides, compliance, formatting

---

Pattern 1: Basic Reflection

Agent evaluates and improves its own output through self-critique.

python

def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\n{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\nOriginal: {output}")
    
    return output

**Key insight**: Use structured JSON output for reliable parsing of critique results.

---

Pattern 2: Evaluator-Optimizer

Separate generation and evaluation into distinct components for clearer responsibilities.

python

class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            if evaluation["overall_score"] >= self.score_threshold:
                break
            output = self.optimize(output, evaluation)
        return output

---

Pattern 3: Code-Specific Reflection

Test-driven refinement loop for code generation.

python

class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\nCode: {code}")
        return code

---

Evaluation Strategies

Outcome-Based

Evaluate whether output achieves the expected result.

python

def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

Use LLM to compare and rank outputs.

python

def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

Score outputs against weighted dimensions.

python

RUB

🎯 Best For

Claude users
GitHub Copilot users
Software engineers
Development teams
Tech leads

💡 Use Cases

Code quality improvement
Best practice enforcement

📖 How to Use This Skill

1
Install the Skill

Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
2
Load into Your AI Assistant

Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.
3
Apply Agentic-Eval to Your Work

Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.
4
Review and Refine

Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.

❓ Frequently Asked Questions

Is Agentic-Eval compatible with Cursor and VS Code?

Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.

Do I need specific dependencies for Agentic-Eval?

Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.

How do I install Agentic-Eval?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/agentic-eval/SKILL.md, ready to use.

Can I customize this skill for my team?

Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.

⚠️ Common Mistakes to Avoid

Skipping validation

Always test AI-generated code changes, even for simple refactors.

Missing dependency updates

Check if the skill requires updated dependencies or new packages.

🔗 Related Skills

a11y A11Y accessibility Accessibility Expert accessibility-runtime-tester Accessibility Runtime Tester acquire-codebase-knowledge Acquire-Codebase-Knowledge acreadiness-generate-instructions Acreadiness-Generate-Instructions add-educational-comments Add-Educational-Comments

Agentic-Eval

Quick Facts

Skill Content

Overview

When to Use

Pattern 1: Basic Reflection

Pattern 2: Evaluator-Optimizer

Pattern 3: Code-Specific Reflection

Evaluation Strategies

Outcome-Based

LLM-as-Judge

Rubric-Based

🎯 Best For

💡 Use Cases

📖 How to Use This Skill

Install the Skill

Load into Your AI Assistant

Apply Agentic-Eval to Your Work

Review and Refine

❓ Frequently Asked Questions

Is Agentic-Eval compatible with Cursor and VS Code?

Do I need specific dependencies for Agentic-Eval?

How do I install Agentic-Eval?

Can I customize this skill for my team?

⚠️ Common Mistakes to Avoid

Skipping validation

Missing dependency updates

🔗 Related Skills