MR
Mayur Rathi
@github
⭐ 34.1k GitHub stars

Scala-Spark

Scala-Spark is an code AI skill with a core value of Best practices for building Apache Spark applications in Scala, covering DataFrames, Datasets, SparkSQL, performance tuning, testing, and production deployment patterns. It helps developers solve real-world problems in the code domain, boosting efficiency, automating repetitive tasks, and optimizing workflows.

Best practices for building Apache Spark applications in Scala, covering DataFrames, Datasets, SparkSQL, performance tuning, testing, and production deployment patterns.

Last verified on: 2026-06-17
mkdir -p ./skills/scala-spark && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/scala-spark/SKILL.md -o ./skills/scala-spark/SKILL.md

Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).

Skill Content

# Scala + Apache Spark Best Practices


Guidelines for writing efficient, maintainable, and production-ready Apache Spark applications in Scala.


Dependencies


SBT


scala
val sparkVersion = "3.5.1"

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-core"   % sparkVersion % "provided",
  "org.apache.spark" %% "spark-sql"    % sparkVersion % "provided",
  "org.apache.spark" %% "spark-mllib"  % sparkVersion % "provided",
  "org.apache.spark" %% "spark-streaming" % sparkVersion % "provided"
)

Maven


xml
<properties>
    <spark.version>3.5.1</spark.version>
    <scala.binary.version>2.13</scala.binary.version>
</properties>

<dependencies>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-core_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-sql_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-mllib_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>
    <dependency>
        <groupId>org.apache.spark</groupId>
        <artifactId>spark-streaming_${scala.binary.version}</artifactId>
        <version>${spark.version}</version>
        <scope>provided</scope>
    </dependency>
</dependencies>

Mark Spark dependencies as `"provided"` since the cluster supplies them at runtime. Only bundle application-specific libraries in the fat JAR.


SparkSession Setup


Always use `SparkSession` as the single entry point:


scala
import org.apache.spark.sql.SparkSession

val spark: SparkSession = SparkSession.builder()
  .appName("MyApplication")
  .config("spark.sql.shuffle.partitions", "200")
  .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
  .getOrCreate()

import spark.implicits._

- Do **not** create multiple `SparkSession` instances in the same JVM.

- Avoid hardcoding `master` in application code; set it at submit time via `--master`.


DataFrames vs Datasets vs RDDs


Prefer the **DataFrame API** (untyped `Dataset[Row]`) for most workloads. Use **Datasets** (typed) when compile-time type safety justifies the serialization overhead. Avoid raw **RDDs** unless you need low-level control.


scala
import org.apache.spark.sql.{DataFrame, Dataset}

// Preferred — DataFrame API
val df: DataFrame = spark.read.parquet("data/events")
val result = df
  .filter($"status" === "active")
  .groupBy($"region")
  .agg(count("*").as("total"))

// Typed Dataset — use when schema safety matters
case class Event(id: Long, status: String, region: String)
val ds: Dataset[Event] = df.as[Event]
val active = ds.filter(_.status == "active")

Schema Management


Always define schemas explicitly when reading semi-structured data instead of relying on schema inference:


scala
import org.apache.spark.sql.types._

val schema = StructType(Seq(
  StructField("id", LongType, nullable = false),
  StructField("name", StringType, nullable = true),
  StructField("timestamp", TimestampType, nullable = false),
  StructField("amount", DecimalType(18, 2), nullable = true),
  StructField("tags", ArrayType(StringType), nullable = true)
))

val df = spark.read
  .schema(schema)
  .json("data/events/*.json")

- Schema inference (`inferSchema=true`) reads the entire data source and is expensive for large files.

- For Parquet and Delta, the schema is embedded — explicit definition is unnecessary.


Column Expressions


Prefer `col()` or `$""` over string column names in transformations for early error detection:


scala
import org.apache.spark.sql.functions._

// Good — type-checked column references
df.select(col("name"), $"amount" * 1.1 as "adjusted_amount"

🎯 Best For

  • QA engineers
  • Developers writing unit tests
  • UI designers
  • Product designers
  • Claude users

💡 Use Cases

  • Generating test cases for edge conditions
  • Writing integration test suites
  • Generating component mockups
  • Creating design system tokens

📖 How to Use This Skill

  1. 1

    Install the Skill

    Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.

  2. 2

    Load into Your AI Assistant

    Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.

  3. 3

    Apply Scala-Spark to Your Work

    Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.

  4. 4

    Review and Refine

    Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.

❓ Frequently Asked Questions

Does this generate test mocks?

Many testing skills include mock generation. Check the install command and skill content for details.

Does this work with Figma?

Some design skills integrate with Figma plugins. Check the Works With section for supported tools.

Is Scala-Spark compatible with Cursor and VS Code?

Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.

Do I need specific dependencies for Scala-Spark?

Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.

How do I install Scala-Spark?

Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/scala-spark/SKILL.md, ready to use.

⚠️ Common Mistakes to Avoid

Not testing edge cases

AI tends to generate happy-path tests. Manually review for boundary conditions.

Skipping usability testing

AI-generated designs should be validated with real users before development.

Skipping validation

Always test AI-generated code changes, even for simple refactors.

Missing dependency updates

Check if the skill requires updated dependencies or new packages.

🔗 Related Skills