Handling Spark Dependencies in Ilum

Ilum provides three methods to handle dependencies for Spark on Kubernetes, each suited for different use cases ranging from rapid prototyping to stable production environments.

Comparison of Dependency Management Methods

Method	Best For	Persistence	Startup Speed
Custom Docker Image	Production, Large dependencies, Security	High (Immutable)	Fast (Pre-built)
Runtime Injection	Testing, PoCs, Small/Transient libs	Medium (Cached)	Slower (Downloads at startup)
Notebook `pip install`	Ad-hoc Experiments, Exploration	None (Session only)	Slowest (Repeated installs)

1. Dedicated Docker Image (Production Best Practice)

This method involves creating a custom Docker image that includes all required dependencies. It ensures consistency across environments and is the best approach for production workloads.

Steps to Create a Custom Spark Image

Start with the official Ilum Spark base image.
Add necessary JARs for any Java-based dependencies.
Install required Python packages.
Build and push the image to a private or public registry.
Configure Ilum to use this new image.

Example: Adding Apache Iceberg Support

Below is an example Dockerfile that builds on the Ilum Spark base image and adds support for Apache Iceberg:

Dockerfile
FROM ilum/spark:3.5.8

USER root

# Add JARs for Iceberg support
ADD --chmod=644 https://repo1.maven.org/maven2/org/apache/iceberg/iceberg-spark-runtime-3.5_2.12/1.8.0/iceberg-spark-runtime-3.5_2.12-1.8.0.jar $SPARK_HOME/jars

# Install Python dependencies
RUN python3 -m pip install pandas pyiceberg[hive,s3fs,pandas,snappy,gcsfs,adlfs]

USER ${spark_uid}

Build and Push the Image

After writing the Dockerfile (for example, saved as Dockerfile in the current directory), build and push the image:

Build and Push Image
docker build -t myPrivateRepo/spark:3.5.8-iceberg .
docker push myPrivateRepo/spark:3.5.8-iceberg

Configuring Ilum to Use the Custom Image

Once the image is available in a container registry, update Ilum to use this custom Spark image:

UI (Job & Service)
Helm (Install Time)
REST API
Global Default (Cluster Config)

Per-Job/Service Setting: When submitting a Spark job or Service, specify the image by setting this param:

Configuration
spark.kubernetes.container.image: myPrivateRepo/spark:3.5.8-iceberg

During the installation process: Add this flag to your helm install command:

helm install ...
--set ilum-core.kubernetes.defaultCluster.config.spark\\.kubernetes\\.container\\.image="myPrivateRepo/spark:3.5.8-iceberg"

When submitting a job programmatically, verify the image parameter:

Submit Job
curl -X POST "http://ilum-core/api/v1/job/submit" \
  -F "name=my-custom-job" \
  -F "image=registry.example.com/my-team/spark-custom:v1" \
  ...

You can set the default image for the entire cluster via the UI using one of two methods.

Option A: General Tab (Spark Version)

Navigate to the General tab of your cluster settings. Locate the Spark Version field and enter your custom image tag (e.g., myPrivateRepo/spark:3.5.8-iceberg).

Set spark version on spark cluster settings

Option B: Configuration Tab

Navigate to the Configuration tab. Add a new parameter spark.kubernetes.container.image and set its value to your custom image.

Set spark version on spark cluster properties section

Best Practices

Keep all dependency versions aligned with the Spark version used.
Regularly update the custom image to include security patches and the latest dependency versions.
Store images in a reliable and accessible container registry.
Use a versioning scheme for your images (e.g., include Spark and feature versions in the tag).

Troubleshooting

Common Image Issues

Issue	Solution
Dependency mismatch	Ensure all JARs and Python packages are compatible with the Spark version in use.
Image not found	Verify the image name and that it was pushed to the correct registry (and that Ilum has access to that registry).
Job fails due to missing dependencies	Double-check that the Spark job is using the intended custom image (check the image configuration in Ilum or the spark-submit command).

2. Runtime Injection (Spark Packages & PyPI)

For rapid development and testing, you can add dependencies dynamically using Spark’s configuration. This approach fetches JARs and installs Python packages at startup time.

Adding Java JARs

Specify Maven coordinates for Java dependencies using the spark.jars.packages configuration.

UI (Job & Service)
Helm (Install Time)
Global Default (Cluster Config)

For individual Jobs or Services, you can add packages directly in the Resources tab.

Adding Spark Packages in UI

Navigate to New Job or New Service.
Go to the Resources tab.
Scroll to Spark Packages.
Click Add Parameter and enter the Maven coordinate (e.g., org.apache.hadoop:hadoop-aws:3.3.4).

During the installation process: Add this flag to your helm install command:

helm install ...
  --set ilum-core.kubernetes.defaultCluster.config.spark\\.jars\\.packages="org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.0,org.apache.hadoop:hadoop-aws:3.3.4"

To define default packages for all jobs on a cluster, set the property in the Cluster Configuration.

Runtime: Set this in the Cluster Configuration form:

Cluster Configuration
  spark.jars.packages: org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.8.0,org.apache.hadoop:hadoop-aws:3.3.4

Spark will automatically download the specified package (and its dependencies) from Maven Central or the configured repository when the job starts.

Installing Python Dependencies in Ilum

Ilum provides multiple ways to install Python dependencies for Spark jobs and Jupyter sessions. Depending on your use case, you can choose between:

UI (Job & Service)
Jupyter Session
Global Default (Cluster Config)

Ilum makes it easy to add Python dependencies when creating Spark Jobs or Interactive Services directly from the UI. The process is identical for both.

Adding Python dependencies to Spark Job in Ilum UI

Navigate to New Job or New Service in the Ilum UI (see Running Spark Jobs).
Locate the Requirements field under the Resources tab.
Enter the required Python dependencies.

Ilum will install these dependencies at runtime before executing the application.

There are two ways to configure dependencies for Jupyter: per-session or globally for all sessions.

Option A: Per-Session (Session Creation Form)

When creating a Jupyter Notebook session, you can specify required Python packages directly in the session creation form.

New sparkmagic spark form

Open the Create Session form (e.g., via the %manage_spark magic command).
Locate the Extra Packages field.
Enter the required packages as a semicolon-separated list:
Extra Packages
```
pandas;numpy;openai
```
When the session starts, Ilum will automatically install these libraries.

Option B: Global Default for Jupyter (Helm/ConfigMap)

To define default packages for all Jupyter Spark sessions (but not standard Spark jobs):

Install Time:

helm install ...
  --set ilum-jupyter.sparkmagic.config.sessionConfigs.conf='{"pyRequirements":"pandas;numpy;openai"}'

Post-Install (ConfigMap): Modify the ilum-jupyter-config configMap:

config.json
data:
  config.json: |
    ...
    {
      "session_configs": {
        "conf": { "pyRequirements": "pandas;numpy;openai", ... }
      }
    }

To define default python packages for ALL Spark applications running on a specific cluster (including Jobs and interactive sessions started by external modules like Jupyter or Airflow), you need to add environment variables in the Cluster Configuration.

You need to add two properties with the same list of semicolon-separated packages:

spark.executorEnv.PIP_MODULES
spark.kubernetes.driverEnv.PIP_MODULES

Set PIP_MODULES in cluster config

Steps:

Go to Clusters and edit your target cluster (or configured during creation).
Navigate to the Configuration tab.

Add the parameters:

Cluster Parameters
spark.executorEnv.PIP_MODULES=pyspark;pandas;openai
spark.kubernetes.driverEnv.PIP_MODULES=pyspark;pandas;openai

Each approach ensures your Spark jobs and Jupyter sessions have the necessary dependencies installed, so you can focus on data engineering and analysis instead of managing environments.

Best Practices

Use this method for testing or proof-of-concept jobs; avoid it for production due to the overhead of downloading dependencies on each run.
Specify exact versions for packages to ensure reproducibility.
Combine this approach with custom Docker images for better consistency (e.g., use Docker for core dependencies and spark.jars.packages for a few transient ones if needed).
Be mindful of network access and performance, as downloading packages can slow down startup times.

Troubleshooting

Common Dependency Issues

Issue	Solution
JAR not found	Ensure the Maven coordinates (groupId, artifactId, version) are correct.
Startup Performance	If startup is slow or OOMs occur, consider baking dependencies into a Docker image.

3. Installing Libraries in Jupyter Notebooks with pip install

For quick interactive experiments, you can install libraries within a Jupyter notebook using pip. This is a fast way to test something in an ad-hoc manner, but it is not recommended for anything beyond temporary exploration.

Example

If you are running a Spark session in an Ilum Jupyter notebook and need a new Python package, you can install it like so:

notebook.ipynb
%%spark

import subprocess

# Install package
result = subprocess.check_output(["pip", "install", "geopandas"])
print(result.decode())

# Verify installation
result = subprocess.check_output(["pip", "list"])
print(result.decode())

This will install the package in the notebook’s environment so you can use it immediately.

Why It’s Not Recommended

Packages installed this way are only available in the current spark session.
The environment does not persist across session restarts or new sessions.
It can lead to inconsistencies between your development environment and the production Spark runtime.

Best Practices

Use this approach only for quick, throwaway prototyping.
If you find yourself relying on a pip-installed library, add it to a requirements file or Docker image for permanence.
Document any packages you had to install in the notebook so you can update your environment properly later.

Troubleshooting

Pip Install Issues

Issue	Solution
Package not found	Check spelling and availability on PyPI.
Module not found	Try restarting the notebook kernel to reload the environment.

Frequently Asked Questions (FAQ)

How do I install private Python packages in Spark?

You can install private packages by building a Custom Docker Image (Method 1). During the docker build process, you can pass credentials or use a pip configuration file to authenticate with your private PyPI repository. Alternatively, for runtime injection, you may need to configure a custom pip index URL in your environment, but Docker is more secure for handling credentials.

Should I use Docker or runtime requirements for Spark on Kubernetes?

For Production, always use a Docker image. It guarantees that every node (driver and executors) has the exact same environment without the latency and failure risk of installing packages at runtime. Use runtime requirements only for development, testing, or very small, non-critical libraries.

How to add JDBC drivers to Ilum Spark jobs?

JDBC drivers (like PostgreSQL, MySQL, or Snowflake) are best added as JARs. You can either:

Add the JAR to your Docker image (e.g., in $SPARK_HOME/jars).
Use spark.jars.packages (Method 2) to fetch them from Maven Central at runtime (e.g., org.postgresql:postgresql:42.6.0).

Final Recommendations

Production workloads: Use a custom Docker image with all dependencies pre-installed. This yields a stable and reproducible environment with faster startup times.
Testing or prototyping: Use spark.jars.packages and a pyrequirements.txt for flexibility. This allows you to experiment quickly without building a new image, though it may incur startup overhead.
Interactive experiments: Installing via Jupyter notebooks is convenient for short-lived experiments, but always transition to a more robust solution (Docker image or requirements file) for anything that needs to be saved or run again.

By following these practices, you can efficiently manage Spark dependencies in Ilum while minimizing compatibility issues and runtime errors.

Comparison of Dependency Management Methods​

1. Dedicated Docker Image (Production Best Practice)​

Steps to Create a Custom Spark Image​

Example: Adding Apache Iceberg Support​

Build and Push the Image​

Configuring Ilum to Use the Custom Image​

Best Practices​

Troubleshooting​

2. Runtime Injection (Spark Packages & PyPI)​

Adding Java JARs​

Installing Python Dependencies in Ilum​

Best Practices​

Troubleshooting​

3. Installing Libraries in Jupyter Notebooks with pip install​

Example​

Why It’s Not Recommended​

Best Practices​

Troubleshooting​

Frequently Asked Questions (FAQ)​

Final Recommendations​

Comparison of Dependency Management Methods

1. Dedicated Docker Image (Production Best Practice)

Steps to Create a Custom Spark Image

Example: Adding Apache Iceberg Support

Build and Push the Image

Configuring Ilum to Use the Custom Image

Best Practices

Troubleshooting

2. Runtime Injection (Spark Packages & PyPI)

Adding Java JARs

Installing Python Dependencies in Ilum

Best Practices

Troubleshooting

3. Installing Libraries in Jupyter Notebooks with pip install

Example

Why It’s Not Recommended

Best Practices

Troubleshooting

Frequently Asked Questions (FAQ)

Final Recommendations