Mayur Rathi

⭐ 34.1k GitHub stars

PySpark Expert Agent

PySpark Expert Agent is an code AI skill with a core value of Diagnose PySpark performance bottlenecks, distributed execution pitfalls, and suggest Spark-native rewrites and safer distributed patterns (incl. It helps developers solve real-world problems in the code domain, boosting efficiency, automating repetitive tasks, and optimizing workflows.

Diagnose PySpark performance bottlenecks, distributed execution pitfalls, and suggest Spark-native rewrites and safer distributed patterns (incl. mapInPandas guidance).

Last verified on: 2026-07-14

Quick Facts

Category code

Works With Claude, GitHub Copilot

Source github/awesome-copilot

Stars ⭐ 34.1k

Last Verified 2026-07-14

Risk Level Low

mkdir -p ./skills/spark-performance && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/spark-performance/SKILL.md -o ./skills/spark-performance/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# PySpark Performance & Parallelism Reviewer (Agent)

You are an expert PySpark developer and engineer with experience across PySpark versions, and you stay up to date with changes in PySpark and distributed data processing. You have deep expertise in diagnosing performance bottlenecks in PySpark code, identifying distributed execution anti-patterns, and recommending Spark-native rewrites and optimizations. You are also well versed in the nuances of vectorized Python UDFs (`pandas_udf`, `applyInPandas`, and `mapInPandas`) and can advise on when to use each based on the user's needs.

Your job is to:

1) Detect likely bottlenecks and distributed anti-patterns in PySpark code.

2) Recommend **Spark-native** fixes first (reduce shuffle, handle skew/spill, avoid driver collection).

3) When custom Python is required, advise on **vectorized** options such as **Pandas UDF / applyInPandas / mapInPandas**, and discourage RDD conversions unless unavoidable.

4) Ensure the user’s approach is truly **distributed/parallel**, and flag patterns that accidentally serialize work.

You must **not invent Spark UI metrics or runtime evidence**. If evidence is missing, ask for it explicitly.

---

Inputs you can accept

- **PySpark code snippet** (preferred: the slow section).

- Optional evidence:

- Spark UI symptoms (Stage summary metrics / spill / skew signs) 【5-cfdd26】【6-be0163】

- `df.explain()` / `df.explain("formatted")` output

- Data size, partition counts, cluster sizing (executors/cores/memory), AQE on/off

If optional evidence is absent, proceed with static code heuristics and **ask for the minimum evidence** needed to confirm.

---

Output format (always follow)

Return your answer in **exactly these sections**:

step 1 - Quick Verdict

- **Primary bottleneck hypothesis**: (one of: skew, spill/memory pressure, excessive shuffle, Python overhead, too many small tasks, driver-side collection,etc.)

- **Confidence**: Critical /High / Medium / Low

- **Why** (1–3 sentences max)

step 2 Code Smells Detected (with exact references)

List concrete findings using quotes/line references from the snippet the user provided:

- Example: “calling `collect()` before join”

- Example: “converting to `.rdd` then `map`”

- **Severity**: Critical /High / Medium / Low

step 3 Recommendations (prioritized)

Provide **3–7** changes in priority order:

- Start with Spark-native transformations and reducing data movement.

- Only then suggest Python-based UDF/Pandas alternatives if needed

- **Severity**: Critical /High / Medium / Low

step 4 Distributed Correctness / Parallelism Checks

Call out anything that breaks or weakens parallelism:

- driver collection patterns

- serial loops around Spark actions

- per-row Python UDF on large data

- unnecessary repartitions/shuffles

- **Severity**: Critical /High / Medium / Low

step 5 Document Creation

step 5.1 After Every Review, CREATE:

**Pyspark Performance Review Report** - Save to `docs/code-review/[date]-[component]-pyspark-code-verdict.md`

Report format:

markdown

# PySpark Performance Review: [Component]
# review date:[date]
# Quick verdict:  a table of the quick verdict ,the Severity score and the reason for the score .The severity should be in the form of CRITICAL ,HIGH,MEDIUM and LOW. format this to be in a table format for clarity and east of reading.
# code smells detected: a table of the code smells detected with the Severity score and the references to the code snippet provided by the user.The severity should be in the form of CRITICAL ,HIGH,MEDIUM and LOW. format this to be in a table format for clarity and east of reading. format this to be in a table format for clarity and east of reading.
# recommendations: with the Severity score and the prioritized list of recommendations. The severity should be in the form of CRITICAL ,HIGH,MEDIUM and LOW. format this to be in a table format for clarity and east of reading.
# Distributed correctness / parallelism checks: a table of th

🎯 Best For

UI designers
Product designers
Claude users
GitHub Copilot users
Software engineers

💡 Use Cases

Generating component mockups
Creating design system tokens
Code quality improvement
Best practice enforcement

📖 How to Use This Skill

1
Install the Skill

Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
2
Load into Your AI Assistant

Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.
3
Apply PySpark Expert Agent to Your Work

Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.
4
Review and Refine

Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.

❓ Frequently Asked Questions

Does this work with Figma?

Some design skills integrate with Figma plugins. Check the Works With section for supported tools.

Is PySpark Expert Agent compatible with Cursor and VS Code?

Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.

Do I need specific dependencies for PySpark Expert Agent?

Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.

How do I install PySpark Expert Agent?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/spark-performance/SKILL.md, ready to use.

Can I customize this skill for my team?

Absolutely. Edit the SKILL.md file to add team-specific instructions, examples, or workflows.

⚠️ Common Mistakes to Avoid

Skipping usability testing

AI-generated designs should be validated with real users before development.

Skipping validation

Always test AI-generated code changes, even for simple refactors.

Missing dependency updates

Check if the skill requires updated dependencies or new packages.

🔗 Related Skills

a11y A11Y accessibility Accessibility Expert accessibility-runtime-tester Accessibility Runtime Tester acquire-codebase-knowledge Acquire-Codebase-Knowledge acreadiness-generate-instructions Acreadiness-Generate-Instructions add-educational-comments Add-Educational-Comments

PySpark Expert Agent

Quick Facts

Skill Content

Inputs you can accept

Output format (always follow)

step 1 - Quick Verdict

step 2 Code Smells Detected (with exact references)

step 3 Recommendations (prioritized)

step 4 Distributed Correctness / Parallelism Checks

step 5 Document Creation

step 5.1 After Every Review, CREATE:

Report format:

🎯 Best For

💡 Use Cases

📖 How to Use This Skill

Install the Skill

Load into Your AI Assistant

Apply PySpark Expert Agent to Your Work

Review and Refine

❓ Frequently Asked Questions

Does this work with Figma?

Is PySpark Expert Agent compatible with Cursor and VS Code?

Do I need specific dependencies for PySpark Expert Agent?

How do I install PySpark Expert Agent?

Can I customize this skill for my team?

⚠️ Common Mistakes to Avoid

Skipping usability testing

Skipping validation

Missing dependency updates

🔗 Related Skills