MR
Mayur Rathi
@github
⭐ 34.1k GitHub stars

Agentic-Eval

Agentic-Eval是一款code方向的AI技能,核心价值是|,可用于解决开发者在code领域的实际问题,帮助用户提升效率、自动化重复任务或优化工作流。

|

Last verified on: 2026-05-30
mkdir -p ./skills/agentic-eval && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/agentic-eval/SKILL.md -o ./skills/agentic-eval/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# Agentic Evaluation Patterns


Patterns for self-improvement through iterative evaluation and refinement.


Overview


Evaluation patterns enable agents to assess and improve their own outputs, moving beyond single-shot generation to iterative refinement loops.


text
Generate → Evaluate → Critique → Refine → Output
    ↑                              │
    └──────────────────────────────┘

When to Use


- **Quality-critical generation**: Code, reports, analysis requiring high accuracy

- **Tasks with clear evaluation criteria**: Defined success metrics exist

- **Content requiring specific standards**: Style guides, compliance, formatting


---


Pattern 1: Basic Reflection


Agent evaluates and improves its own output through self-critique.


python
def reflect_and_refine(task: str, criteria: list[str], max_iterations: int = 3) -> str:
    """Generate with reflection loop."""
    output = llm(f"Complete this task:\n{task}")
    
    for i in range(max_iterations):
        # Self-critique
        critique = llm(f"""
        Evaluate this output against criteria: {criteria}
        Output: {output}
        Rate each: PASS/FAIL with feedback as JSON.
        """)
        
        critique_data = json.loads(critique)
        all_pass = all(c["status"] == "PASS" for c in critique_data.values())
        if all_pass:
            return output
        
        # Refine based on critique
        failed = {k: v["feedback"] for k, v in critique_data.items() if v["status"] == "FAIL"}
        output = llm(f"Improve to address: {failed}\nOriginal: {output}")
    
    return output

**Key insight**: Use structured JSON output for reliable parsing of critique results.


---


Pattern 2: Evaluator-Optimizer


Separate generation and evaluation into distinct components for clearer responsibilities.


python
class EvaluatorOptimizer:
    def __init__(self, score_threshold: float = 0.8):
        self.score_threshold = score_threshold
    
    def generate(self, task: str) -> str:
        return llm(f"Complete: {task}")
    
    def evaluate(self, output: str, task: str) -> dict:
        return json.loads(llm(f"""
        Evaluate output for task: {task}
        Output: {output}
        Return JSON: {{"overall_score": 0-1, "dimensions": {{"accuracy": ..., "clarity": ...}}}}
        """))
    
    def optimize(self, output: str, feedback: dict) -> str:
        return llm(f"Improve based on feedback: {feedback}\nOutput: {output}")
    
    def run(self, task: str, max_iterations: int = 3) -> str:
        output = self.generate(task)
        for _ in range(max_iterations):
            evaluation = self.evaluate(output, task)
            if evaluation["overall_score"] >= self.score_threshold:
                break
            output = self.optimize(output, evaluation)
        return output

---


Pattern 3: Code-Specific Reflection


Test-driven refinement loop for code generation.


python
class CodeReflector:
    def reflect_and_fix(self, spec: str, max_iterations: int = 3) -> str:
        code = llm(f"Write Python code for: {spec}")
        tests = llm(f"Generate pytest tests for: {spec}\nCode: {code}")
        
        for _ in range(max_iterations):
            result = run_tests(code, tests)
            if result["success"]:
                return code
            code = llm(f"Fix error: {result['error']}\nCode: {code}")
        return code

---


Evaluation Strategies


Outcome-Based

Evaluate whether output achieves the expected result.


python
def evaluate_outcome(task: str, output: str, expected: str) -> str:
    return llm(f"Does output achieve expected outcome? Task: {task}, Expected: {expected}, Output: {output}")

LLM-as-Judge

Use LLM to compare and rank outputs.


python
def llm_judge(output_a: str, output_b: str, criteria: str) -> str:
    return llm(f"Compare outputs A and B for {criteria}. Which is better and why?")

Rubric-Based

Score outputs against weighted dimensions.


python
RUB

🎯 Best For

  • Claude users
  • GitHub Copilot users
  • Software engineers
  • Development teams
  • Tech leads

💡 Use Cases

  • Code quality improvement
  • Best practice enforcement

📖 How to Use This Skill

  1. 1

    Install the Skill

    Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.

  2. 2

    Load into Your AI Assistant

    Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.

  3. 3

    Apply Agentic-Eval to Your Work

    Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.

  4. 4

    Review and Refine

    Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.

❓ Frequently Asked Questions

Is Agentic-Eval compatible with Cursor and VS Code?

Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.

Do I need specific dependencies for Agentic-Eval?

Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.

How do I install Agentic-Eval?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/agentic-eval/SKILL.md, ready to use.

Can I customize this skill for my team?

Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.

⚠️ Common Mistakes to Avoid

Skipping validation

Always test AI-generated code changes, even for simple refactors.

Missing dependency updates

Check if the skill requires updated dependencies or new packages.

🔗 Related Skills