মূল বিষয়বস্তুতে যান

Monitoring Apache Spark on Kubernetes

Effective observability is critical for maintaining the reliability and performance of distributed data processing systems. Ilum provides a multi-layered monitoring architecture for Apache Spark on Kubernetes, integrating native real-time insights with industry-standard open-source tools.

This guide covers the three pillars of Ilum's observability stack:

  1. Application Metrics: Real-time job statistics, JVM performance, and executor health via Ilum UI and Prometheus.
  2. Infrastructure Monitoring: Cluster resource utilization (CPU/Memory) via Kubernetes metrics.
  3. Distributed Logging: Centralized log aggregation and query capabilities using Loki and Promtail.

Ilum Native Monitoring

Ilum's built-in monitoring interface provides immediate, high-level visibility into job execution without requiring external tool configuration. It acts as the first line of defense for detecting job failures and performance regressions.

Real-time Job Metrics

For every Spark application managed by Ilum, the UI exposes granular telemetry covering resource allocation, task throughput, and memory consumption.

Key Monitoring Dimensions:

  • সম্পদ বরাদ্দ: Tracks the count of active executors, total cores assigned, and aggregated memory usage across the cluster. This helps verify if a job is receiving the requested Kubernetes resources.
  • Task Throughput: Displays the lifecycle of Spark tasks—Running, Completed, Failed, and Interrupted. A high failure rate often indicates data quality issues or transient network faults.
  • Memory & Data I/O: Monitors peak Heap/Off-Heap memory usage and Shuffle I/O (input/output bytes). Sudden spikes in Shuffle Write can indicate expensive groupByবা joinকার্যক্রম।
  • Executor Health: Detailed breakdown of memory pools (Execution vs. Storage) per executor, essential for tuning spark.memory.fraction.

Ilum Job Overview

Ilum Executor Statistics

Execution Timeline Analysis

Understanding the temporal structure of a Spark application is vital for performance optimization. Ilum's সময়রেখা module visualizes the execution flow, distinguishing between high-level Spark Jobs and their constituent Stages.

  • Spark Jobs: Triggered by actions (e.g., count(), collect(), save()).
  • Stages: Boundaries created by Shuffle operations (data redistribution).

The Timeline view allows engineers to identify:

  • Long-tail Stages: Stages that take disproportionately long to complete.
  • Straggler Tasks: Individual tasks that delay the completion of a stage, often caused by data skew.

ইলম কাজের সময়রেখা

Cluster Resource Utilization

Ilum aggregates metrics from the underlying Kubernetes cluster to show total resource consumption. This view is crucial for capacity planning and ensuring that the cluster has sufficient headroom for pending workloads.

Cluster Resource Overview


Apache Spark UI & History Server

While Ilum provides a consolidated view, the native স্পার্ক ইউআই remains the standard for deep-dive DAG (Directed Acyclic Graph) analysis and SQL query planning.

Accessing the Spark UI

For running jobs, the Spark UI is proxied directly through the Ilum interface. For completed applications, Ilum bundles a স্পার্ক ইতিহাস সার্ভার .

Persistent Event Logging

The History Server relies on Spark Event Logs stored on a persistent volume. This integration is configured via Helm parameters:

ইলুম-কোর : 
historyServer:
সক্ষম : সত্য
# Ensures logs persist across restarts
volume:
সক্ষম : সত্য
আকার : 10Gi

নোট: To disable the History Server (e.g., in resource-constrained environments), run:

helm upgrade --set ilum-core.historyServer.enabled=false --reuse-values ilum ilum/ilum

You can access the History Server directly via port-forwarding:

kubectl port-forward svc/ilum-history-server 9666:18080

Advanced Monitoring with Prometheus

For production environments, Ilum integrates with the প্রমিথিউস ecosystem to scrape, store, and alert on time-series metrics. This integration uses the PrometheusServlet sink in Spark to expose metrics in a format compatible with Prometheus scraping.

Architecture & Configuration

Ilum automatically configures the PodMonitor custom resource (part of the Prometheus Operator) to discover Spark driver and executor pods.

Enabling Prometheus Integration:

The Prometheus stack is optional. Enable it via Helm:

হেলম আপগ্রেড \ 
--সেট কুব-প্রমিথিউস-স্ট্যাক.সক্ষম = সত্য \
--সেট ilum-core.job.prometheus.enabled=true \
--পুনঃব্যবহার-মান ইলুম ইলুম / ইলুম

This configuration injects the following Spark properties into every job:

spark.ui.prometheus.enabled= সত্য 
spark.metrics.conf.*.sink.prometheusServlet.class= org.apache.spark.metrics.sink.PrometheusServlet
spark.metrics.conf.*.sink.prometheusServlet.path= /metrics/prometheus/
spark.executor.processTreeMetrics.enabled= সত্য

Critical Spark Metrics to Monitor

When monitoring Spark on Kubernetes, focus on these high-signal metrics:

Metric CategoryMetric Name (PromQL Pattern)Why it matters
স্মৃতি metrics_executor_heapMemoryUsed_bytesHigh heap usage (>90%) correlates with frequent Full GC and OOM risks.
Garbage Collectionmetrics_executor_jvm_G1_Young_Generation_countFrequent GC pauses freeze execution threads, reducing throughput.
Shuffle I/Ometrics_shuffle_read_bytes_totalMassive shuffle reads indicate network-heavy operations (joins/grouping).
CPU/Tasksmetrics_executor_threadpool_activeTasksShould match the number of allocated cores. Low utilization implies resource waste.

Querying with PromQL

Access the Prometheus UI via port-forward:

- কুবেক্টল পোর্ট-ফরোয়ার্ড এসভিসি / প্রমিথিউস-পরিচালিত 9090: 9090 

Example PromQL Queries:

  • Max Memory Usage per Job:

    max(metrics_executor_maxMemory_bytes{namespace="ilum"}) by (pod)
  • Total Active Tasks in Cluster:

    sum(metrics_executor_threadpool_activeTasks)

Prometheus Metrics

Prometheus Chart

Zabbix Integration

Organizations using Zabbix for infrastructure monitoring can consume ilum's Prometheus metrics directly. Zabbix 4.2+ includes a native Prometheus data preprocessing capability and HTTP agent items that can scrape Prometheus endpoints without additional exporters.

Configure a Zabbix HTTP agent item pointing to the Prometheus endpoint exposed by ilum's PodMonitor, and use Prometheus pattern preprocessing to extract specific metrics.


Visualization with Grafana

Ilum leverages গ্রাফানা for dashboarding, providing pre-built visualizations that correlate Spark metrics with Kubernetes infrastructure metrics.

Default Dashboards

Ilum ships with a suite of dashboards located in the ইলুম folder:

  1. Ilum Spark Job Overview: High-level health check of all running jobs (Success/Failure rates, Total Cluster Memory).
  2. Ilum Spark Job Detail: Deep dive into a single application id. Correlates Driver vs. Executor memory usage.
  3. Ilum Pod Monitor: Infrastructure metrics for the Ilum control plane (CPU throttling, Memory limits).

Grafana Job Overview

Grafana Charts

Accessing Grafana

কুবেক্টল পোর্ট-ফরোয়ার্ড এসভিসি / আইএলইউএম-গ্রাফিনা 8080: 80 
# Default Credentials: admin / prom-operator

Distributed Logging with Loki

Logs are the source of truth for debugging failures. Ilum integrates লোকি (log aggregation) and প্রমটেল (log shipping) to centralize logs from ephemeral Spark executors.

Log Architecture

  • Promtail: Runs as a DaemonSet on every Kubernetes node, tailing stdout/stderr from containers and shipping them to Loki.
  • Loki: Stores compressed logs in object storage (MinIO) and indexes metadata (labels).
  • LogQL: A query language inspired by PromQL for filtering logs.
OpenTelemetry Compatibility

Loki natively supports OpenTelemetry-compatible log ingestion via the OTLP endpoint. If your organization standardizes on OpenTelemetry, you can replace Promtail with an OpenTelemetry Collector configured to export logs to Loki. This provides a unified telemetry pipeline for logs, metrics, and traces without changing the storage backend.

Configure the OTel Collector's Loki exporter:

exporters: 
loki:
শেষ বিন্দু : এইচটিটিপি : ইলুম - loki: 3100/loki/api/v1/push

Enable Logging Stack:

হেলম আপগ্রেড ইলুম ইলুম / ইলুম \ 
--set global.logAggregation.enabled=true \
--set global.logAggregation.loki.enabled=true \
--set global.logAggregation.promtail.enabled=true \
--পুনঃব্যবহার-মান

Effective LogQL for Spark

Effective debugging requires filtering noise. Use these LogQL patterns to find root causes:

  • Find specific Exception traces:

    {namespace="ilum"} |= "Exception" |!= "at org.apache.spark"
  • Isolate logs for a specific Spark Application:

    {app="job-20241107-1313-driver"} 
  • Detect Container OOM Kills (System Logs):

    {namespace="ilum"} |= "OOMKilled"

Loki Log Query

Integrating with Elasticsearch / OpenSearch

While ilum ships with Loki as the default log aggregation backend, organizations standardized on the Elastic Stack (ELK)বা OpenSearch can integrate ilum's logs and metrics into their existing infrastructure.

Log Shipping with Fluent Bit

Replace or supplement Promtail with Fluent Bit to ship container logs to Elasticsearch:

# Fluent Bit output configuration for Elasticsearch
[ OUTPUT]
Name es
Match kube.*
Host elasticsearch.logging.svc
Port 9200
Index ilum- logs
Type _doc
Logstash_Format On
Logstash_Prefix ilum
Suppress_Type_Name On

Alternatively, use Fluentd with the fluent-plugin-elasticsearch output plugin for more complex parsing and routing requirements.

Index Templates for Spark Metrics

Create an Elasticsearch index template optimized for Spark metrics:

{ 
"index_patterns": [ "ilum-spark-*"] ,
"template": {
"settings": {
"number_of_shards": 2 ,
"number_of_replicas": 1
} ,
"mappings": {
"properties": {
"timestamp": { "টাইপ" : "date" } ,
"namespace": { "টাইপ" : "keyword" } ,
"application_id": { "টাইপ" : "keyword" } ,
"executor_id": { "টাইপ" : "keyword" } ,
"heap_memory_used": { "টাইপ" : "long" } ,
"shuffle_read_bytes": { "টাইপ" : "long" } ,
"active_tasks": { "টাইপ" : "integer" }
}
}
}
}

Prometheus Remote Write to Elasticsearch

Export Prometheus metrics to Elasticsearch using the Prometheus remote write feature or a sidecar exporter, enabling correlation of time-series metrics with log data in Kibana/OpenSearch Dashboards.

Kibana Dashboard Patterns

Build Kibana dashboards for ilum observability:

  • Log Explorer: Filter by kubernetes.namespace, kubernetes.pod_name, and log level to isolate Spark driver/executor logs
  • Error Rate Dashboard: Visualize exception frequency across jobs using aggregations on log patterns
  • Resource Heatmap: Correlate CPU/memory metrics with job execution timelines
নোট

When using Elasticsearch/OpenSearch alongside Loki, ilum's native log viewer in the UI continues to use Loki. The Elasticsearch integration provides an additional, external observability path for teams with existing ELK infrastructure.


Cost & Usage Tracking

Effective cost management in multi-tenant environments requires visibility into resource consumption at the team and project level. Ilum's integration with Prometheus and Grafana provides the foundation for tracking compute costs and attributing them to specific teams or workloads.

Namespace-Level Resource Consumption

Since ilum isolates workloads via Kubernetes namespaces, resource consumption can be tracked per namespace using built-in Kubernetes metrics:

CPU Hours per Namespace:

sum(rate(container_cpu_usage_seconds_total{namespace=~"ilum-.*"}[1h])) by (namespace)

Memory Hours per Namespace (GB-hours):

sum(avg_over_time(container_memory_working_set_bytes{namespace=~"ilum-.*"}[1h])) by (namespace) / 1024^3

Spark Job Cost Estimation

Individual Spark job costs can be estimated from executor metrics. The following PromQL query calculates total executor-seconds per application:

sum(
count_over_time(metrics_executor_threadpool_activeTasks{namespace="ilum"}[1h])
) by (app)

Combine this with your cloud provider's per-vCPU-hour pricing to derive cost estimates.

Building a Cost Attribution Dashboard

Create a Grafana dashboard with the following panels for chargeback reporting:

PanelPromQL Purpose
CPU Hours by Teamsum(rate(container_cpu_usage_seconds_total[24h])) by (namespace) * 24Daily CPU consumption per namespace
Memory GB-Hours by Teamsum(avg_over_time(container_memory_working_set_bytes[24h])) by (namespace) / 1024^3 * 24Daily memory consumption per namespace
Active Executors Over Timesum(metrics_executor_threadpool_activeTasks) by (namespace)Concurrent compute usage trend
Storage Usagesum(kubelet_volume_stats_used_bytes{namespace=~"ilum-.*"}) by (namespace)Persistent volume usage

Cloud Billing Integration

For precise cost attribution, integrate Prometheus metrics with cloud billing APIs:

  • AWS: Use Cost Explorer API with resource tags matching Kubernetes namespace labels
  • GCP: Enable GKE usage metering to export per-namespace resource consumption to BigQuery
  • Azure: Use Azure Cost Management with AKS cost analysis by namespace
টিপ

For a turnkey solution, consider deploying OpenCost alongside ilum. OpenCost provides real-time Kubernetes cost monitoring and integrates with Prometheus and Grafana, offering per-pod and per-namespace cost breakdowns using actual cloud billing data.


Troubleshooting Guide

1. Detecting OutOfMemory (OOM) Errors

OOM errors in Spark on Kubernetes usually manifest in two ways:

  1. Java OOM (java.lang.OutOfMemoryError): The JVM runs out of Heap space.
    • Diagnosis: Search logs for java.lang.OutOfMemoryError: Java heap space.
    • Fix: Increase spark.executor.memory.
  2. Container OOM (Exit Code 137): The OS kills the container because it exceeded the Kubernetes memory limit (Off-heap usage + Overhead).
    • Diagnosis: Sudden disappearance of metrics in Grafana coupled with Reason: OOMKilled in Kubernetes events.
    • Fix: Increase spark.kubernetes.memoryOverheadFactorবা spark.executor.memoryOverhead.

2. Identifying Data Skew

Data skew occurs when one partition contains significantly more data than others, causing a single task to run much longer than the rest ("Straggler").

  • Diagnosis:এতে Ilum Timeline, look for a stage where the Max Task Duration is 10x+ higher than the Median Task Duration.
  • Fix:ব্যবহার repartition() to redistribute data or enable Adaptive Query Execution (spark.sql.adaptive.enabled=true).

Legacy Support: Graphite

For environments already standardized on Graphite, Ilum supports pushing metrics via the GraphiteSink.

Enable Graphite Exporter:

হেলম আপগ্রেড ইলুম ইলুম / ইলুম \ 
--সেট ilum-core.job.graphite.enabled=true \
--সেট গ্রাফাইট-এক্সপোর্টার.গ্রাফাইট.সক্ষম = সত্য \
--পুনঃব্যবহার-মান

Graphite Configuration