Mayur Rathi

⭐ 34.1k GitHub stars

Arize-Evaluator

Arize-Evaluator is an code AI skill with a core value of Handles LLM-as-judge evaluation workflows on Arize including creating/updating evaluators, running evaluations on spans or experiments, managing tasks, trigger-run operations, column mapping, and cont. It helps developers solve real-world problems in the code domain, boosting efficiency, automating repetitive tasks, and optimizing workflows.

Handles LLM-as-judge evaluation workflows on Arize including creating/updating evaluators, running evaluations on spans or experiments, managing tasks, trigger-run operations, column mapping, and cont

Last verified on: 2026-07-14

Quick Facts

Category code

Works With Claude, GitHub Copilot

Source github/awesome-copilot

Stars ⭐ 34.1k

Last Verified 2026-07-14

Risk Level Low

mkdir -p ./skills/arize-evaluator && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/arize-evaluator/SKILL.md -o ./skills/arize-evaluator/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# Arize Evaluator Skill

> **`SPACE`** — All `--space` flags and the `ARIZE_SPACE` env var accept a space **name** (e.g., `my-workspace`) or a base64 space **ID** (e.g., `U3BhY2U6...`). Find yours with `ax spaces list`.

This skill covers designing, creating, and running **LLM-as-judge evaluators** on Arize. An evaluator defines the judge; a **task** is how you run it against real data.

---

Prerequisites

Proceed directly with the task — run the `ax` command you need. Do NOT check versions, env vars, or profiles upfront.

If an `ax` command fails, troubleshoot based on the error:

- `command not found` or version error → see references/ax-setup.md

- `401 Unauthorized` / missing API key → run `ax profiles show` to inspect the current profile. If the profile is missing or the API key is wrong, follow references/ax-profiles.md to create/update it. If the user doesn't have their key, direct them to https://app.arize.com/admin > API Keys

- Space unknown → run `ax spaces list` to pick by name, or ask the user

- LLM provider call fails (missing OPENAI_API_KEY / ANTHROPIC_API_KEY) → run `ax ai-integrations list --space SPACE` to check for platform-managed credentials. If none exist, ask the user to provide the key or create an integration via the **arize-ai-provider-integration** skill

- **Security:** Never read `.env` files or search the filesystem for credentials. Use `ax profiles` for Arize credentials and `ax ai-integrations` for LLM provider keys. If credentials are not available through these channels, ask the user.

- **CRITICAL — Never fabricate evaluation results:** If an evaluation task fails, is cancelled, or produces no scores, report the failure clearly and explain what went wrong. Do NOT perform a "manual evaluation," invent quality scores, estimate percentages, or present any agent-generated analysis as if it came from the Arize evaluation system. Instead suggest: (1) fix the identified issue and retry, (2) try running from the Arize UI, (3) verify integration credentials with `ax ai-integrations list`, (4) contact support at https://arize.com/support

---

Concepts

What is an Evaluator?

An **evaluator** is an LLM-as-judge definition. It contains:

| Field | Description |

|-------|-------------|

| **Template** | The judge prompt. Uses `{variable}` placeholders (e.g. `{input}`, `{output}`, `{context}`) that get filled in at run time via a task's column mappings. |

| **Classification choices** | The set of allowed output labels (e.g. `factual` / `hallucinated`). Binary is the default and most common. Each choice can optionally carry a numeric score. |

| **AI Integration** | Stored LLM provider credentials (OpenAI, Anthropic, Bedrock, etc.) the evaluator uses to call the judge model. |

| **Model** | The specific judge model (e.g. `gpt-4o`, `claude-sonnet-4-5`). |

| **Invocation params** | Optional JSON of model settings like `{"temperature": 0}`. Low temperature is recommended for reproducibility. |

| **Optimization direction** | Whether higher scores are better (`maximize`) or worse (`minimize`). Sets how the UI renders trends. |

| **Data granularity** | Whether the evaluator runs at the **span**, **trace**, or **session** level. Most evaluators run at the span level. |

Evaluators are **versioned** — every prompt or model change creates a new immutable version. The most recent version is active.

What is a Task?

A **task** is how you run one or more evaluators against real data. Tasks are attached to a **project** (live traces/spans) or a **dataset** (experiment runs). A task contains:

| Field | Description |

|-------|-------------|

| **Evaluators** | List of evaluators to run. You can run multiple in one task. |

| **Column mappings** | Maps each evaluator's template variables to actual field paths on spans or experiment runs (e.g. `"input" → "attributes.input.value"`). This is what makes evaluators portable across projects and experiments. |

| **Query filter** | SQL-style expression to select which spa

🎯 Best For

Claude users
GitHub Copilot users
Software engineers
Development teams
Tech leads

💡 Use Cases

Code quality improvement
Best practice enforcement

📖 How to Use This Skill

1
Install the Skill

Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
2
Load into Your AI Assistant

Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.
3
Apply Arize-Evaluator to Your Work

Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.
4
Review and Refine

Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.

❓ Frequently Asked Questions

Is Arize-Evaluator compatible with Cursor and VS Code?

Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.

Do I need specific dependencies for Arize-Evaluator?

Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.

How do I install Arize-Evaluator?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/arize-evaluator/SKILL.md, ready to use.

Can I customize this skill for my team?

Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.

⚠️ Common Mistakes to Avoid

Skipping validation

Always test AI-generated code changes, even for simple refactors.

Missing dependency updates

Check if the skill requires updated dependencies or new packages.

🔗 Related Skills

a11y A11Y accessibility Accessibility Expert accessibility-runtime-tester Accessibility Runtime Tester acquire-codebase-knowledge Acquire-Codebase-Knowledge acreadiness-generate-instructions Acreadiness-Generate-Instructions add-educational-comments Add-Educational-Comments