Mayur Rathi

⭐ 34.1k GitHub stars

Scala-Spark

Scala-Spark is an code AI skill with a core value of Best practices for building Apache Spark applications in Scala, covering DataFrames, Datasets, SparkSQL, performance tuning, testing, and production deployment patterns. It helps developers solve real-world problems in the code domain, boosting efficiency, automating repetitive tasks, and optimizing workflows.

Best practices for building Apache Spark applications in Scala, covering DataFrames, Datasets, SparkSQL, performance tuning, testing, and production deployment patterns.

Last verified on: 2026-08-01

Quick Facts

Category code

Works With Claude, GitHub Copilot

Source github/awesome-copilot

Stars ⭐ 34.1k

Last Verified 2026-08-01

Risk Level Low

mkdir -p ./skills/scala-spark && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/scala-spark/SKILL.md -o ./skills/scala-spark/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# Scala + Apache Spark Best Practices

Guidelines for writing efficient, maintainable, and production-ready Apache Spark applications in Scala.

Dependencies

SBT

scala

val sparkVersion = "3.5.1"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core"   % sparkVersion % "provided",
  "org.apache.spark" %% "spark-sql"    % sparkVersion % "provided",
  "org.apache.spark" %% "spark-mllib"  % sparkVersion % "provided",
  "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided"
)

Maven

xml

<properties>
    <spark.version>3.5.1</spark.version>
    <scala.binary.version>2.13</scala.binary.version>
</properties>

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

Mark Spark dependencies as `"provided"` since the cluster supplies them at runtime. Only bundle application-specific libraries in the fat JAR.

SparkSession Setup

Always use `SparkSession` as the single entry point:

scala

import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession.builder()
  .appName("MyApplication")
  .config("spark.sql.shuffle.partitions", "200")
  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .getOrCreate()

import spark.implicits._

- Do **not** create multiple `SparkSession` instances in the same JVM.

- Avoid hardcoding `master` in application code; set it at submit time via `--master`.

DataFrames vs Datasets vs RDDs

Prefer the **DataFrame API** (untyped `Dataset[Row]`) for most workloads. Use **Datasets** (typed) when compile-time type safety justifies the serialization overhead. Avoid raw **RDDs** unless you need low-level control.

scala

import org.apache.spark.sql.{DataFrame, Dataset}

// Preferred — DataFrame API
val df: DataFrame = spark.read.parquet("data/events")
val result = df
  .filter($"status" === "active")
  .groupBy($"region")
  .agg(count("*").as("total"))

// Typed Dataset — use when schema safety matters
case class Event(id: Long, status: String, region: String)
val ds: Dataset[Event] = df.as[Event]
val active = ds.filter(_.status == "active")

Schema Management

Always define schemas explicitly when reading semi-structured data instead of relying on schema inference:

scala

import org.apache.spark.sql.types._

val schema = StructType(Seq(
  StructField("id", LongType, nullable = false),
  StructField("name", StringType, nullable = true),
  StructField("timestamp", TimestampType, nullable = false),
  StructField("amount", DecimalType(18, 2), nullable = true),
  StructField("tags", ArrayType(StringType), nullable = true)
))

val df = spark.read
  .schema(schema)
  .json("data/events/*.json")

- Schema inference (`inferSchema=true`) reads the entire data source and is expensive for large files.

- For Parquet and Delta, the schema is embedded — explicit definition is unnecessary.

Column Expressions

Prefer `col()` or `$""` over string column names in transformations for early error detection:

scala

import org.apache.spark.sql.functions._

// Good — type-checked column references
df.select(col("name"), $"amount" * 1.1 as "adjusted_amount"

🎯 Best For

QA engineers
Developers writing unit tests
UI designers
Product designers
Claude users

💡 Use Cases

Generating test cases for edge conditions
Writing integration test suites
Generating component mockups
Creating design system tokens

📖 How to Use This Skill

1
Install the Skill

Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
2
Load into Your AI Assistant

Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.
3
Apply Scala-Spark to Your Work

Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.
4
Review and Refine

Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.

❓ Frequently Asked Questions

Does this generate test mocks?

Many testing skills include mock generation. Check the install command and skill content for details.

Does this work with Figma?

Some design skills integrate with Figma plugins. Check the Works With section for supported tools.

Is Scala-Spark compatible with Cursor and VS Code?

Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.

Do I need specific dependencies for Scala-Spark?

Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.

How do I install Scala-Spark?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/scala-spark/SKILL.md, ready to use.

⚠️ Common Mistakes to Avoid

Not testing edge cases

AI tends to generate happy-path tests. Manually review for boundary conditions.

Skipping usability testing

AI-generated designs should be validated with real users before development.

Skipping validation

Always test AI-generated code changes, even for simple refactors.

Missing dependency updates

Check if the skill requires updated dependencies or new packages.

🔗 Related Skills

accessibility-runtime-tester Accessibility Runtime Tester amplitude-experiment-implementation Amplitude Experiment Implementation apify-integration-expert Apify-Integration-Expert dataverse-python-testing-debugging Dataverse-Python-Testing-Debugging declarative-agents-architect Declarative Agents Architect vuejs-expert Expert Vue.js Frontend Engineer

Scala-Spark

Quick Facts

Skill Content

Dependencies

SBT

Maven

SparkSession Setup

DataFrames vs Datasets vs RDDs

Schema Management

Column Expressions

🎯 Best For

💡 Use Cases

📖 How to Use This Skill

Install the Skill

Load into Your AI Assistant

Apply Scala-Spark to Your Work

Review and Refine

❓ Frequently Asked Questions

Does this generate test mocks?

Does this work with Figma?

Is Scala-Spark compatible with Cursor and VS Code?

Do I need specific dependencies for Scala-Spark?

How do I install Scala-Spark?

⚠️ Common Mistakes to Avoid

Not testing edge cases

Skipping usability testing

Skipping validation

Missing dependency updates

🔗 Related Skills