মূল বিষয়বস্তুতে যান

Streamlit on Ilum - Spark Integration & Kubernetes Deployment

Streamlit Banner

Streamlit is an open-source Python framework designed for building interactive data applications and dashboards with minimal effort. It enables data scientists and engineers to transform data scripts into shareable web applications in minutes, without requiring extensive knowledge of frontend frameworks like React or Vue.

এর প্রেক্ষাপটে ইলুম , Streamlit serves as a powerful frontend layer for your Big Data infrastructure. It allows you to expose complex Apache Spark computations, real-time data streams, and machine learning models through a user-friendly interface, bridging the gap between raw data processing and business intelligence.


Architecture: Streamlit in a Cloud-Native Spark Ecosystem

When deployed within Ilum, Streamlit operates as a microservice inside your Kubernetes cluster. This architecture provides several technical advantages for data engineering workflows:

  • Proximity to Data: Running within the same cluster as your Spark Executors minimizes latency when transferring large datasets for visualization.
  • Unified Security: The application enables seamless integration with internal network policies and RBAC controls.
  • স্কেলেবিলিটি: Kubernetes handles the orchestration, ensuring your dashboard remains available even under load.

Integration Patterns

Since Streamlit is Python-based, it offers flexible integration points with the Ilum ecosystem. Below are the primary architectural patterns for connecting your dashboard to your data infrastructure.

1. Spark Connect Integration

This is the most powerful integration method. By using স্পার্ক কানেক্ট , your Streamlit app acts as a client that submits DataFrame operations directly to the Ilum Spark Cluster. This decouples the client (Streamlit) from the heavy processing (Spark Executors).

আমদানি  streamlit হিসেবে  st
থেকে পাইস্পার্ক . এসকিউএল আমদানি স্পার্কসেশন

@st. cache_resource
ডিএফ get_spark_session( ) :
ফিরে স্পার্কসেশন . নির্মাতা \
. দূরবর্তী ( "sc://spark-connect-service:15002") \
. getOrCreate ( )

স্ফুলিঙ্গ = get_spark_session( )

st. subheader( "Big Data Query")
# This operation runs on the Spark Cluster, not the Streamlit pod
ডিএফ = স্ফুলিঙ্গ . এসকিউএল ( "SELECT category, count(*) as count FROM sales GROUP BY category")
st. bar_chart( ডিএফ . toPandas( ) )

2. Ilum Public API Control Plane

You can use Streamlit to build a custom "Job Launcher" or "Control Plane". Instead of processing data, the app sends HTTP requests to Ilum's API to trigger asynchronous jobs, manage schedules, or retrieve execution logs.

আমদানি  streamlit হিসেবে  st
আমদানি requests

ILUM_API = "http://ilum-api-service:8080"

ডিএফ trigger_job( job_name, params) :
response = requests. পোস্ট (
f"{ ILUM_API} /api/v1/jobs/{ job_name} /launch",
json= params
)
ফিরে response. status_code == 200

যদি st. বোতাম ( "Start ETL Pipeline") :
success = trigger_job( "nightly-batch", { "date": "2023-10-27"} )
যদি success:
st. success( "Pipeline triggered successfully!")

3. JDBC/ODBC for Data Lake Analytics

For low-latency queries on Delta Lake, Iceberg, or Hudi tables, you can connect Streamlit to Ilum SQL using standard JDBC drivers. This allows you to treat your data lake as a standard relational database.

4. Livy Proxy Execution

For scenarios requiring dynamic code submission, Streamlit can post code snippets to the Livy Proxy. This is useful for building "Notebook-like" interfaces where users can submit custom logic to be executed on the cluster.


Developing High-Performance Data Apps

Streamlit is favored for quick prototyping and exploratory data analysis (EDA) because of its declarative API. Below are the core concepts and best practices for building scalable apps in Ilum.

1. Basic Application Structure

You can install the framework using Python’s package manager: pip install streamlit.

A minimal valid application requires just a few lines of code. Save this as app.py:

আমদানি  streamlit হিসেবে  st

st. title( "Hello Ilum!")
st. লিখন ( "This is a Streamlit app running on a Kubernetes cluster.")

Run it locally to test:

streamlit run app.py

Output preview: Streamlit Title App

2. Visualizing Dataframes

Streamlit’s strength comes from its deep integration with the PyData stack (Pandas, NumPy, Arrow). It natively renders interactive charts.

আমদানি  streamlit হিসেবে  st
আমদানি pandas হিসেবে pd
আমদানি numpy হিসেবে np

# Generate sample data
chart_data = pd. ডেটাফ্রেম (
np. এলোমেলো . randn( 20 , 3 ) ,
columns= [ 'a', 'b', 'c']
)

st. header( "Interactive Data Visualization")
st. line_chart( chart_data)

Result: Streamlit Dataframe Visualization The displayed chart is fully interactive. Users can zoom, pan, and inspect individual data points.

3. Advanced Component Usage

For enterprise dashboards, you will likely need input widgets to filter data or trigger jobs.

আমদানি  streamlit হিসেবে  st
আমদানি pandas হিসেবে pd
আমদানি numpy হিসেবে np
আমদানি time

# Title and layout configuration
st. set_page_config( page_title= "Ilum Dashboard", layout= "wide")

st. title( "🎨 Ilum Control Panel")
st. মার্কডাউন ( "### Monitor and control your Spark workloads")

# Input widgets in a sidebar
সাথে st. sidebar:
st. header( "Configuration")
environment = st. selectbox(
"Select Environment", [ "Production", "Staging", "Development"]
)
job_timeout = st. slider( "Job Timeout (seconds)", 60 , 3600 , 600)
enable_logs = st. checkbox( "Show Verbose Logs", মান = সত্য )

# Main layout columns
কোল১ , কোল২ = st. columns( 2 )

সাথে কোল১ :
st. subheader( "📝 Job Parameters")
job_name = st. text_input( "Job Name", "daily_etl_process")
upload_config = st. file_uploader( "Upload Configuration (JSON/YAML)")

সাথে কোল২ :
st. subheader( "📊 Resource Usage")
# Simulating data for display
মেট্রিক্স = pd. ডেটাফ্রেম ( {
"CPU Usage": np. এলোমেলো . uniform( 20 , 80, 10 ) ,
"Memory Usage": np. এলোমেলো . uniform( 40, 90, 10 )
} )
st. area_chart( মেট্রিক্স )

# Action Buttons and Feedback
st. divider( )
যদি st. বোতাম ( "🚀 Trigger Spark Job", টাইপ = "primary") :
সাথে st. spinner( "Submitting job to Ilum Scheduler...") :
time. sleep( 1.5) # Simulate API call latency

st. success( f"Job **{ job_name} ** started successfully in { environment} !")
st. তথ্য ( "Job ID: `spark-job-12345-xyz`")

The resulting UI provides a professional control interface: Streamlit Components Demo

4. Performance Optimization: Caching and State

When working with Big Data, efficiency is critical. Streamlit provides mechanisms to manage computational resources effectively.

Caching Strategies

ব্যবহার @st.cache_data for serializable data objects (like Pandas DataFrames) and @st.cache_resource for connections (like Spark Sessions or Database connections).

@st. cache_data( ttl= 3600 )   # Cache data for 1 hour
ডিএফ get_large_dataset( ) :
# This expensive operation runs only once per hour
# independent of how many users view the app
ফিরে pd. read_parquet( "s3://my-data-lake/heavy-table/")

ডিএফ = get_large_dataset( )
st. dataframe( ডিএফ )

Session State

For multi-step workflows (e.g., a wizard for submitting a Spark job), use st.session_state to persist variables across re-runs.

যদি  'job_id' না  মধ্যে  st. session_state: 
st. session_state. job_id = None

যদি st. বোতাম ( "জমা দিন" ) :
st. session_state. job_id = submit_spark_job( )

যদি st. session_state. job_id:
st. লিখন ( f"Tracking job: { st. session_state. job_id} " )

Production Deployment in Ilum

Transitioning from a local script to a production service involves containerizing the application and configuring the Kubernetes deployment manifest.

1. Docker Containerization

Create a optimized ডকারফাইল that includes your application code and dependencies.

# Use a slim python image to reduce attack surface and image size
থেকে python:3.11-slim

# Set working directory
WORKDIR /app

# Install system dependencies if needed (e.g., for pyodbc or specialized libraries)
# RUN apt-get update && apt-get install -y gcc

# Copy requirements first to leverage Docker layer caching
অনুলিপি requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application code
অনুলিপি streamlit_app.py .
অনুলিপি .streamlit/ .streamlit/

# Expose the default Streamlit port
EXPOSE 8501

# Healthcheck to ensure the container is responsive
HEALTHCHECK CMD curl --fail http://localhost:8501/_stcore/health

# Run the application
এন্ট্রি পয়েন্ট [ "streamlit", "run", "streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0"]

2. Helm Configuration

To deploy your container within the Ilum stack, update your Helm values. This configuration allows you to define resource quotas, ensuring your dashboard doesn't consume excessive cluster resources.

streamlit: 
সক্ষম : সত্য

প্রতিচ্ছবি :
সংগ্রহস্থল : my- registry/ilum- ড্যাশবোর্ড
ট্যাগ : v1.0.0
টাননীতি : Always

# Define resource limits to guarantee stability
সংস্থান :
requests:
cpu: 500m
memory: 512Mi
limits:
cpu: 2000m
memory: 2Gi

# Environment variables for application configuration
এনভি :
- নাম : SPARK_MASTER_URL
মান : "k8s://https://kubernetes.default.svc"
- নাম : ILUM_API_URL
মান : "http://ilum-api-service:8080"

3. Troubleshooting & Observability

If your Streamlit application fails to connect to Spark or Ilum services, check the following:

  • Network Policies: Ensure your Kubernetes NetworkPolicies allow traffic from the Streamlit namespace to the Spark namespace.
  • Service DNS: Use the fully qualified domain name (FQDN) for internal services, e.g., spark-connect-service.ilum.svc.cluster.local.
  • Logs: Retrieve application logs using kubectl logs -l app=streamlit -n ilum to debug Python exceptions or connection timeouts.

Once deployed, Ilum manages the lifecycle of the Streamlit pod, ensuring your interactive data applications are always accessible to your team.