মূল বিষয়বস্তুতে যান

AI Data Analyst Agent

Ilum includes an intelligent AI Agent assistant powered by Large Language Models (LLMs) with advanced retrieval capabilities over the platform's catalog, lineage, and documentation. This AI Data Analyst understands natural-language questions, generates optimized SQL queries, and provides clear explanations of results, making data analysis accessible to both technical and non-technical users.

সংক্ষিপ্ত বিবরণ

The AI Data Analyst Agent is a dedicated, deployable service within Ilum that combines artificial intelligence with your data infrastructure to provide an interactive and intelligent data analysis experience. It serves as a bridge between natural language queries and complex SQL operations, enabling users to explore data through conversational interfaces.

Key Capabilities

  • Natural Language Processing: Understands complex questions in plain English and translates them into appropriate SQL queries
  • Contextual Awareness: Leverages workspace metadata, user roles, and active datasets to provide relevant and accurate responses
  • SQL Generation: Automatically generates optimized SQL queries based on user requests
  • Pipeline Integration: Creates pipeline stubs and chart specifications based on query context
  • Interactive Execution: Allows users to edit, modify, and re-run generated SQL queries
  • Learning System: Captures user feedback through thumbs up/down and edits for continuous improvement

Why Choose Ilum's Data Analyst Agent?

Ilum's Data Analyst Agent revolutionizes how organizations interact with their data by combining cutting-edge artificial intelligence with enterprise-grade data infrastructure. This powerful AI-powered SQL generator eliminates the complexity of traditional data analysis workflows, making advanced analytics accessible to both technical and non-technical users.

Transform Natural Language into SQL Queries

Our natural language to SQL converter understands complex business questions and automatically generates optimized database queries. Whether you're asking "What were our top-selling products last quarter?" or "Show me customer churn rates by region," the AI Data Analyst instantly translates your natural language queries into efficient SQL code, saving hours of manual query writing.

Enterprise-Ready AI for Data Teams

Unlike generic AI tools, Ilum's AI Data Analyst is specifically designed for enterprise data analytics:

  • Context-Aware Intelligence: Understands your data schema, relationships, and business logic
  • Role-Based Security: Respects user permissions and data access policies automatically
  • Scalable Architecture: Handles large-scale data environments with millions of records
  • Multi-Engine Support: Works with Spark SQL, Trino, and other popular query engines

Automated Business Intelligence and Reporting

The AI Data Analyst serves as an intelligent BI assistant that can:

  • Generate complex analytical reports with simple conversational prompts
  • Create data visualizations and chart specifications automatically
  • Build reusable query templates and macros for common business analyses
  • Integrate with existing dashboards and reporting workflows seamlessly

Key Competitive Advantages

Faster Time-to-Insight: Reduce query development time from hours to minutes with AI-generated SQL that's production-ready from the start.

Democratized Data Access: Enable non-technical stakeholders to explore data independently without requiring SQL expertise or constant support from data teams.

Enhanced Data Governance: Built-in security controls ensure users only access data they're authorized to see, maintaining compliance with enterprise data policies.

Continuous Learning: The system improves over time by learning from user feedback, query patterns, and organizational data usage, becoming more accurate and contextually aware.

Perfect for Modern Data Teams

Whether you're a data analyst looking to accelerate query development, a business user needing self-service analytics, or a data engineer building scalable data infrastructure, Ilum's AI Data Analyst adapts to your workflow and expertise level.

স্থাপত্য

The AI Data Analyst operates as an independent service within the Ilum ecosystem. It implements a পুনরুদ্ধার-বর্ধিত প্রজন্ম (আরএজি) architecture to ground Large Language Model (LLM) responses in the specific context of the organization's data catalog, ensuring high relevance and reducing hallucinations.

RAG Pipeline Implementation

The RAG pipeline consists of the following technical stages:

  1. Metadata Ingestion: The system continuously syncs with the Hive Metastore to ingest table schemas, column types, and comment annotations.
  2. Vectorization: Schema information and documentation are tokenized and converted into vector embeddings using the configured embedding model (e.g., text-embedding-ada-002).
  3. Semantic Indexing: Vectors are stored in a specialized store (pgvector, ChromaDB) with HNSW (Hierarchical Navigable Small World) indexing for low-latency approximate nearest neighbor (ANN) search.
  4. Context Retrieval: Incoming natural language queries are vectorized to retrieve the most relevant table schemas and documentation chunks based on cosine similarity.
  5. Context Injection: The retrieved schema definitions are injected into the LLM system prompt within the token window limits, dynamically prioritizing the most relevant tables.

Component Interaction

  • UI Component: WebSocket-based interactive chat interface utilizing React for state management.
  • API Gateway: RESTful endpoints handling request validation, rate limiting, and routing.
  • Orchestration Engine: Manages the agent loop, tool selection (MCP), and memory context.
  • OAuth2 / OIDC: Validates JWT tokens and extracts user claims (roles, groups) for permission filtering.

AI Data Analyst Interface

How the AI Agent Works

The AI Data Analyst operates as an intelligent AI Agent that autonomously reasons, plans, and executes data analysis tasks. Unlike simple chatbots or query generators, this AI Agent can understand context, make decisions, and take actions across the Ilum platform.

AI Agent Architecture

The AI Agent follows a sophisticated Perception → Reasoning → Planning → Action cycle:

  1. Perception Phase: The agent analyzes user input, current data context, and available resources
  2. Reasoning Phase: It applies domain knowledge to understand the user's intent and requirements
  3. Planning Phase: The agent creates an execution strategy, selecting appropriate tools and methods
  4. Action Phase: It executes the plan, generates SQL, runs queries, and provides explanations

Agent Decision-Making Process

Context-Aware Analysis

The AI Agent maintains awareness of:

  • Current User Session: Previous queries, results, and conversation history
  • Data Environment: Available databases, tables, schemas, and relationships
  • User Permissions: Role-based access controls and data security boundaries
  • System State: Active clusters, resource availability, and performance metrics

Intelligent Query Planning

When processing requests, the Agent:

  1. Analyzes Intent: Determines whether the user wants data exploration, reporting, analysis, or system operations
  2. Evaluates Context: Considers relevant tables, joins, filters, and aggregations needed
  3. Optimizes Approach: Selects the most efficient query pattern and execution strategy
  4. Validates Security: Ensures generated queries respect user permissions and data policies

Agent Learning and Memory

Short-Term Memory

  • Conversation Context: Maintains context across multi-turn conversations
  • Query History: Remembers previous queries and results within the session
  • User Preferences: Adapts to user's preferred output formats and styles

Long-Term Learning

  • প্যাটার্ন স্বীকৃতি : Learns from successful query patterns and user feedback
  • Domain Knowledge: Builds understanding of organization-specific data patterns
  • Performance Optimization: Remembers which queries work best for different scenarios

Autonomous Action Capabilities

The AI Agent can independently:

Data Discovery

  • Schema Exploration: Automatically discover and understand table structures
  • Relationship Mapping: Identify foreign keys and data relationships
  • Data Profiling: Analyze data quality, distributions, and patterns

Query Optimization

  • Performance Analysis: Evaluate query execution plans and suggest improvements
  • Index Recommendations: Suggest database optimizations based on query patterns
  • Resource Management: Choose appropriate cluster resources for different workloads

Error Handling and Recovery

  • Self-Correction: Detect and fix common SQL syntax or logic errors
  • Alternative Approaches: Try different query strategies if initial attempts fail
  • Graceful Degradation: Provide partial results or explanations when full execution isn't possible

Integration with Ilum MCP Server

The AI Agent leverages the Model Context Protocol (MCP) to interact with Ilum services:

Available Actions

  • create_chart: Generate visualizations from query results
  • run_pipeline: Execute data processing workflows
  • schedule_job: Set up recurring analysis tasks
  • update_dashboard: Modify existing dashboards with new insights
  • set_alert: Configure monitoring and notification rules

Action Planning

The Agent decides when and how to use these actions based on:

  • User Intent: Explicit requests vs. implicit needs
  • Data Context: Results that would benefit from visualization or further processing
  • Workflow Integration: Opportunities to automate repetitive tasks

Feedback Loop and Continuous Improvement

Real-Time Adaptation

  • User Corrections: Learns from manual edits to generated queries
  • Thumbs Up/Down: Adjusts confidence scoring and query selection
  • Conversation Flow: Improves dialogue management and response timing

Administrative Learning

  • Query Success Rates: Tracks which types of queries work best
  • Performance Metrics: Monitors response times and resource usage
  • Error Patterns: Identifies common failure modes for improvement

বৈশিষ্ট্য

Natural Language Querying

The AI Data Analyst excels at interpreting natural language questions and converting them into executable SQL queries. Users can ask questions like:

  • "Show me the top 10 customers by revenue last quarter"
  • "What's the average processing time for jobs in the production cluster?"
  • "Find all tables that contain customer information"
  • "Which datasets haven't been updated in the last 30 days?"

Contextual Intelligence

The assistant leverages multiple sources of context to provide accurate and relevant responses:

  • Workspace Metadata: Understanding of available tables, columns, and data relationships
  • User Roles: Respect for access permissions and data security boundaries
  • Active Datasets: Awareness of currently loaded and accessible data
  • Historical Context: Learning from previous queries and interactions

SQL Generation and Optimization

The core function of the AI Data Analyst is translating natural language intent into dialect-specific SQL execution plans. The system ensures generated queries are syntactically correct, optimally structured, and security-compliant.

Multi-Dialect Support

The agent is aware of the specific SQL dialect of the underlying engine and adjusts syntax accordingly:

  • স্পার্ক এসকিউএল : Utilizes Spark-specific functions such as LATERAL VIEW EXPLODE, array_contains, and specific window function syntax.
  • Trino / Presto: Adapts to Trino's specific data types, UNNEST operations, and lambda functions.
  • PostgreSQL : Generates standard SQL compliant with PostgreSQL syntax for internal metadata queries or connected JDBC sources.

Complex Query Patterns

The system supports generation of advanced SQL constructs:

  • Common Table Expressions (CTEs): Decomposing complex logic into readable WITH clauses.
  • Window Functions: Generating analytic queries using RANK(), DENSE_RANK(), LEAD()এবং LAG() over partitions.
  • Nested Data Processing: Handling STRUCT, ARRAYএবং MAP types common in Big Data formats like Parquet and Avro.
  • Temporal Analysis: Generating accurate date math and interval calculations specific to the engine (e.g., date_add vs INTERVAL).

Interactive Workflow

AI Data Analyst Workflow

The AI Data Analyst provides a seamless workflow:

  1. Question Input: Users submit natural language queries
  2. Context Analysis: System analyzes available data and user permissions
  3. SQL Generation: Creates optimized SQL queries
  4. Result Explanation: Provides clear interpretation of results
  5. Iterative Refinement: Allows editing and re-execution of queries

MCP Server Integration

The AI Data Analyst integrates with the Ilum MCP server to enable actionable insights:

  • Chart Creation: "Create a bar chart showing monthly sales trends"
  • Pipeline Execution: "Run the data processing pipeline for this dataset"
  • Dashboard Updates: "Add this analysis to the executive dashboard"
  • Alert Configuration: "Set up monitoring for unusual data patterns"

Deployment and Configuration

পূর্বশর্ত

Before deploying the AI Data Analyst, ensure you have:

  • Ilum core services running
  • Hive Metastore configured for catalog access
  • Appropriate LLM provider access (OpenAI, Azure OpenAI, or local models)
  • Storage configured for embedding vectors and feedback data

Enabling AI Data Analyst (Enterprise License)

To enable the AI Data Analyst in your Ilum deployment:

হেলম আপগ্রেড ইলুম ইলুম / ইলুম \ 
--set ilum-ai.enabled=true \
--set ilum-ai.llm.provider=openai \
--set ilum-ai.llm.apiKey=<your-api-key> \
--set ilum-ai.embeddings.enabled=true \
--পুনঃব্যবহার-মান

Configuration Options

Advanced LLM Tuning

Beyond basic provider settings, you can tune the LLM's behavior to balance creativity with determinism:

ilum-ai: 
সক্ষম : সত্য
llm:
প্রদানকারী : openai # openai, azure, local
apiKey: "<your-api-key>"
মডেল : "gpt-4"
# Controls randomness: 0.0 is deterministic, higher values are more creative
তাপমাত্রা : 0.1
# Penalizes new tokens based on their existing frequency in the text
frequencyPenalty: 0.0
# Controls the maximum number of tokens to generate
maxTokens: 2048
# System prompt override for organizational coding standards
systemPrompt: |
You are an expert Data Analyst. Always use explicit column names in SELECT statements.
Never use SELECT *. Prefer Common Table Expressions (CTEs) for readability.

Vector Store Configuration

Configure the embedding model and vector database connection for the RAG pipeline:

Embedding Configuration

ilum-ai: 
embeddings:
সক্ষম : সত্য
প্রদানকারী : openai
মডেল : "text-embedding-ada-002"
vectorStore: "pgvector" # pgvector, chromadb, weaviate

Security Configuration

ilum-ai: 
নিরাপত্তা :
oauth2 :
সক্ষম : সত্য
প্রদানকারী : "keycloak"
roleBasedAccess: সত্য
dataFiltering: সত্য

ব্যবহারের উদাহরণ

Basic Data Exploration

User Query: "What tables do we have in the sales database?"

AI Response:

নির্বাচন table_name ,  table_comment 
থেকে information_schema. tables
যেখানে table_schema = 'sales'
ORDER দ্বারা table_name ;

The AI Data Analyst found 5 tables in the sales database: customers, orders, products, order_items, and sales_summary.

Advanced Analytics

User Query: "Show me the trend of order values over the last 6 months, broken down by product category"

AI Response:

নির্বাচন  
DATE_FORMAT( o. order_date, '%Y-%m') হিসেবে month,
p. category,
যোগফল ( oi. পরিমাণ * oi. unit_price) হিসেবে total_value
থেকে orders o
যুক্ত order_items oi অন o. order_id = oi. order_id
যুক্ত products p অন oi. product_id = p. product_id
যেখানে o. order_date >= DATE_SUB( CURRENT_DATE( ) , INTERVAL 6 MONTH)
গোষ্ঠী দ্বারা DATE_FORMAT( o. order_date, '%Y-%m') , p. category
ORDER দ্বারা month, category;

This query analyzes order trends by calculating monthly totals for each product category over the past 6 months.

Performance Analysis

User Query: "Which Spark jobs took longer than 30 minutes to complete yesterday?"

AI Response:

নির্বাচন  
job_id,
job_name,
cluster_name,
start_time,
end_time,
TIMESTAMPDIFF( MINUTE, start_time, end_time) হিসেবে duration_minutes
থেকে ilum_jobs
যেখানে DATE( start_time) = DATE_SUB( CURRENT_DATE( ) , INTERVAL 1 DAY)
এবং TIMESTAMPDIFF( MINUTE, start_time, end_time) > 30
ORDER দ্বারা duration_minutes DESC;

Found 3 jobs that exceeded 30 minutes runtime. The longest was the daily ETL job at 45 minutes.

Administrative Features

Feedback Collection

The AI Data Analyst automatically collects user feedback to improve performance:

  • Thumbs Up/Down: Simple rating system for query quality
  • Query Edits: Tracks manual modifications to generated SQL
  • Usage Patterns: Monitors frequently asked questions and topics
  • Error Tracking: Logs failed queries and their contexts

Operational Intelligence

Administrators gain deep visibility into the AI service operations through comprehensive metrics exposed via Prometheus:

Observability Metrics

  • Latency Breakdown:
    • ilum_ai_ttft_seconds: Time-to-First-Token latency (user perception).
    • ilum_ai_retrieval_duration_seconds: Time taken for vector search and metadata retrieval.
    • ilum_ai_generation_duration_seconds: LLM processing time.
  • Token Economics:
    • ilum_ai_tokens_prompt_total: Counter of input tokens sent to the LLM.
    • ilum_ai_tokens_completion_total: Counter of generated output tokens.
    • ilum_ai_context_window_usage: Percentage of context window utilized per request.

Semantic Caching

To reduce costs and latency, the system implements a Semantic Cache layer (Redis):

  1. Embedding Comparison: Incoming queries are vectorized and compared against cached query vectors.
  2. Similarity Threshold: If cosine similarity > 0.95, the cached SQL response is served immediately.
  3. TTL Management: Cache entries expire based on configured TTL (default 1h) to ensure data freshness.

Model Optimization

The system provides tools for continuous improvement:

  • Prompt Engineering: Refine system prompts based on feedback
  • Embedding Updates: Refresh knowledge base with new metadata
  • Model Fine-tuning: Customize models for specific use cases
  • Context Expansion: Add new data sources and documentation

Saved Queries and Macros

Users can save frequently used queries and create reusable macros:

Saving Queries

-- Save this query as "monthly_revenue_report"
নির্বাচন
DATE_FORMAT( order_date, '%Y-%m') হিসেবে month,
যোগফল ( total_amount) হিসেবে revenue
থেকে orders
যেখানে order_date >= DATE_SUB( CURRENT_DATE( ) , INTERVAL 12 MONTH)
গোষ্ঠী দ্বারা month
ORDER দ্বারা month;

Creating Macros

Users can create parameterized macros for common analyses:

-- Macro: customer_analysis(customer_id, start_date, end_date)
নির্বাচন
c . customer_name,
গণনা ( o. order_id) হিসেবে order_count,
যোগফল ( o. total_amount) হিসেবে total_spent,
এভিজি ( o. total_amount) হিসেবে avg_order_value
থেকে customers c
যুক্ত orders o অন c . customer_id = o. customer_id
যেখানে c . customer_id = ${customer_id}
এবং o. order_date মধ্যে '${start_date}' এবং '${end_date}'
গোষ্ঠী দ্বারা c . customer_id, c . customer_name;

Integration with Ilum Ecosystem

The AI Data Analyst seamlessly integrates with other Ilum components:

Table Explorer Integration

  • Automatic Discovery: Discovers tables and relationships from Hive Metastore
  • Schema Awareness: Understands table structures and data types
  • Lineage Context: Uses data lineage information for better query suggestions

SQL Viewer Integration

  • Query Handoff: Generated queries can be opened in SQL Viewer for editing
  • Execution Context: Shares execution engines and cluster configurations
  • Result Sharing: Enables collaborative analysis workflows

Monitoring Integration

  • Job Tracking: Monitors AI-generated query executions
  • Performance Analysis: Provides insights into query performance
  • Resource Usage: Tracks computational resources for AI operations

সমস্যা সমাধান

Log Analysis & Troubleshooting

Common Log Patterns

ContextLengthExceededError

ERROR [AiService] Token limit exceeded: 8500 tokens used (limit: 8192).
Action: Reduce 'retrieval_limit' in configuration or switch to a model with larger context window (e.g., gpt-4-32k).

MetadataSyncFailure

WARN [MetadataSync] Failed to vectorize table 'sales.orders': column 'metadata' contains binary data.
Action: Check Hive Metastore for unsupported column types or corrupt table statistics.

HallucinationDetected (Self-Correction)

INFO [QueryValidator] Generated SQL referenced column 'total_rev' which does not exist in 'sales_daily'. Correcting to 'total_revenue'.
Action: System auto-corrected the query using schema validation.

Troubleshooting Steps

  1. Slow RAG Retrieval: If ilum_ai_retrieval_duration_seconds is high (>2s), check the vector store index. Ensure HNSW indexes are built:
    তৈরি  INDEX অন  embeddings ব্যবহার  hnsw ( vector vector_cosine_ops) ; 
  2. Authentication Failures: Verify the OIDC provider's JWKS endpoint reachability from the Ilum AI pod.
  3. Low Quality Responses: Inspect the system_prompt effectively injected into the LLM context via debug logs (enable ilum.ai.debug=true).

Authentication Issues

For OAuth2 authentication problems:

  1. Verify Provider Configuration: Check OAuth2 settings and endpoints
  2. Review Token Expiration: Ensure tokens are being refreshed properly
  3. Check Role Permissions: Verify user roles have appropriate access
  4. Validate SSL Certificates: Ensure secure connections are working

Performance Optimization

Vector Store Optimization

-- Optimize vector store indexes
তৈরি INDEX idx_embeddings_similarity অন embeddings
ব্যবহার ivfflat ( embedding vector_cosine_ops)
WITH ( lists = 100 ) ;

-- Regular cleanup of old embeddings
বিলোপ থেকে embeddings
যেখানে created_at < NOW( ) - INTERVAL '30 days';

Cache Configuration

Enable response caching for better performance:

ilum-ai: 
cache:
সক্ষম : সত্য
ttl: 3600 # 1 hour
maxSize: 1000
টাইপ : "redis"
redis:
আয়োজক : "ilum-redis"
পোর্ট : 6379

Security Considerations

Query Interception & Validation Layer

The AI Data Analyst does not simply pass generated SQL to the execution engine. It employs a rigorous interception layer:

  1. AST Analysis: Generated SQL is parsed into an Abstract Syntax Tree (AST).
  2. Permission Validation: The AST is traversed to verify that the requesting user has নির্বাচন privileges on all referenced tables and columns.
  3. Command Restriction: Operations strictly limited to read-only (SELECT, SHOW, DESCRIBE). DDL/DML operations (DROP, DELETE, UPDATE) are rejected at the parser level.

PII & Sensitive Data Handling

The system integrates a privacy preservation layer for handling sensitive data:

  • PII Detection: Regex and NLP-based classifiers identify potential Personally Identifiable Information (PII) in query results (e.g., emails, SSNs).
  • Dynamic Masking: Detected sensitive fields can be automatically masked (e.g., j***@example.com) before being returned to the UI.
  • Audit Logging: Original queries and masked results are logged with a cryptographic hash of the user identity for immutable audit trails.

Compliance

For regulatory compliance:

  • GDPR Compliance: Supports data deletion and privacy requests
  • SOX Compliance: Maintains audit trails for financial data access
  • HIPAA Compliance: Implements healthcare data protection measures
  • SOC2 Compliance: Follows security and availability standards

সর্বোত্তম অনুশীলন

Effective Query Formulation

To get the best results from the AI Data Analyst:

  1. Be Specific: Provide clear context and desired outcomes
  2. Use Domain Terms: Include relevant business terminology
  3. Specify Time Ranges: Always include date constraints when relevant
  4. Define Metrics: Clearly specify how calculations should be performed

Data Preparation

Optimize your data for AI analysis:

  1. Consistent Naming: Use clear, consistent column and table names
  2. Proper Documentation: Maintain up-to-date table and column comments
  3. Data Quality: Ensure data is clean and well-structured
  4. Metadata Enrichment: Add business context to technical metadata

Collaboration Workflows

Establish effective team workflows:

  1. Shared Queries: Use saved queries for common analyses
  2. ডকুমেন্টেশন : Document complex queries and their business purpose
  3. Review Process: Implement peer review for critical analyses
  4. সংস্করণ নিয়ন্ত্রণ : Track changes to important queries and macros

এপিআই রেফারেন্স

The AI Data Analyst provides REST API endpoints for programmatic access:

Query Generation

POST /api/v1/ai/generate-query
Content-Type: application/json
Authorization: Bearer <token>

{
"question": "Show me top customers by revenue this quarter",
"context": {
"database": "analytics",
"user_role": "analyst"
}
}

Query Execution

POST /api/v1/ai/execute-query
Content-Type: application/json
Authorization: Bearer <token>

{
"sql": "SELECT customer_name, SUM(revenue) FROM...",
"cluster_id": "prod-cluster-1"
}

Feedback Submission

POST /api/v1/ai/feedback
Content-Type: application/json
Authorization: Bearer <token>

{
"query_id": "12345",
"rating": "positive",
"comment": "Query was accurate and well-optimized"
}