Scala-Spark
Scala-Spark is an code AI skill with a core value of Best practices for building Apache Spark applications in Scala, covering DataFrames, Datasets, SparkSQL, performance tuning, testing, and production deployment patterns. It
helps developers solve real-world problems in the code domain, boosting
efficiency, automating repetitive tasks, and optimizing workflows.
Best practices for building Apache Spark applications in Scala, covering DataFrames, Datasets, SparkSQL, performance tuning, testing, and production deployment patterns.
mkdir -p ./skills/scala-spark && curl -sfL https://raw.githubusercontent.com/github/awesome-copilot/main/skills/scala-spark/SKILL.md -o ./skills/scala-spark/SKILL.md Run in terminal / PowerShell. Requires curl (Unix) or PowerShell 5+ (Windows).
Skill Content
# Scala + Apache Spark Best Practices
Guidelines for writing efficient, maintainable, and production-ready Apache Spark applications in Scala.
Dependencies
SBT
val sparkVersion = "3.5.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-sql" % sparkVersion % "provided",
"org.apache.spark" %% "spark-mllib" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided"
)Maven
<properties>
<spark.version>3.5.1</spark.version>
<scala.binary.version>2.13</scala.binary.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_${scala.binary.version}</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>
</dependencies>Mark Spark dependencies as `"provided"` since the cluster supplies them at runtime. Only bundle application-specific libraries in the fat JAR.
SparkSession Setup
Always use `SparkSession` as the single entry point:
import org.apache.spark.sql.SparkSession
val spark: SparkSession = SparkSession.builder()
.appName("MyApplication")
.config("spark.sql.shuffle.partitions", "200")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.getOrCreate()
import spark.implicits._- Do **not** create multiple `SparkSession` instances in the same JVM.
- Avoid hardcoding `master` in application code; set it at submit time via `--master`.
DataFrames vs Datasets vs RDDs
Prefer the **DataFrame API** (untyped `Dataset[Row]`) for most workloads. Use **Datasets** (typed) when compile-time type safety justifies the serialization overhead. Avoid raw **RDDs** unless you need low-level control.
import org.apache.spark.sql.{DataFrame, Dataset}
// Preferred — DataFrame API
val df: DataFrame = spark.read.parquet("data/events")
val result = df
.filter($"status" === "active")
.groupBy($"region")
.agg(count("*").as("total"))
// Typed Dataset — use when schema safety matters
case class Event(id: Long, status: String, region: String)
val ds: Dataset[Event] = df.as[Event]
val active = ds.filter(_.status == "active")Schema Management
Always define schemas explicitly when reading semi-structured data instead of relying on schema inference:
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("id", LongType, nullable = false),
StructField("name", StringType, nullable = true),
StructField("timestamp", TimestampType, nullable = false),
StructField("amount", DecimalType(18, 2), nullable = true),
StructField("tags", ArrayType(StringType), nullable = true)
))
val df = spark.read
.schema(schema)
.json("data/events/*.json")- Schema inference (`inferSchema=true`) reads the entire data source and is expensive for large files.
- For Parquet and Delta, the schema is embedded — explicit definition is unnecessary.
Column Expressions
Prefer `col()` or `$""` over string column names in transformations for early error detection:
import org.apache.spark.sql.functions._
// Good — type-checked column references
df.select(col("name"), $"amount" * 1.1 as "adjusted_amount"🎯 Best For
- QA engineers
- Developers writing unit tests
- UI designers
- Product designers
- Claude users
💡 Use Cases
- Generating test cases for edge conditions
- Writing integration test suites
- Generating component mockups
- Creating design system tokens
📖 How to Use This Skill
- 1
Install the Skill
Copy the install command from the Terminal tab and run it. The SKILL.md file downloads to your local skills directory.
- 2
Load into Your AI Assistant
Open Claude or GitHub Copilot and reference the skill. Paste the SKILL.md content or use the system prompt tab.
- 3
Apply Scala-Spark to Your Work
Open your project in the AI assistant and ask it to apply the skill. Start with a small module to verify the output quality.
- 4
Review and Refine
Review AI suggestions before committing. Run tests, check for regressions, and iterate on the skill output.
❓ Frequently Asked Questions
Does this generate test mocks?
Many testing skills include mock generation. Check the install command and skill content for details.
Does this work with Figma?
Some design skills integrate with Figma plugins. Check the Works With section for supported tools.
Is Scala-Spark compatible with Cursor and VS Code?
Yes — this skill works with any AI coding assistant including Cursor, VS Code with Copilot, and JetBrains IDEs.
Do I need specific dependencies for Scala-Spark?
Check the install command and Works With section. Most code skills only require the AI assistant and your codebase.
How do I install Scala-Spark?
Copy the install command from the Terminal tab and run it. The skill downloads to ./skills/scala-spark/SKILL.md, ready to use.
⚠️ Common Mistakes to Avoid
Not testing edge cases
AI tends to generate happy-path tests. Manually review for boundary conditions.
Skipping usability testing
AI-generated designs should be validated with real users before development.
Skipping validation
Always test AI-generated code changes, even for simple refactors.
Missing dependency updates
Check if the skill requires updated dependencies or new packages.