Data Lifecycle Management

সংক্ষিপ্ত বিবরণ

As data volumes grow, managing the lifecycle of stored data becomes critical for controlling storage costs, maintaining query performance, and meeting compliance requirements. Without active lifecycle management, data lakes accumulate stale snapshots, orphan files, and small fragmented data files that degrade both performance and cost efficiency.

Ilum provides data lifecycle management through the native maintenance operations of its supported table formats -- Apache Iceberg, Delta Lake, and Apache Hudi -- combined with automation through orchestration tools. Rather than introducing a proprietary abstraction, Ilum lets you execute standard SQL maintenance commands via the এসকিউএল ভিউয়ার and schedule them using the built-in Scheduler, Airflow, n8n বা কেস্ত্র .

This page covers:

Table compaction to merge small files and improve read performance
Orphan file cleanup to reclaim storage from failed or interrupted operations
Snapshot and metadata management to control metadata growth
Retention policies to automate expiration of old data and snapshots
Storage tiering to optimize cost by placing data on appropriate storage backends
Automation patterns to schedule maintenance as recurring workflows

Table Compaction

Frequent writes, updates, and deletes can produce many small files, which increases the overhead of file listing and metadata management, and reduces query throughput. Compaction merges these small files into larger, optimally sized files.

হিমশৈল

Use the rewrite_data_files procedure to compact an Iceberg table:

কল spark_catalog . সিস্টেম . rewrite_data_files( 
  সারণী  = >  'db.my_table'
) 

You can also filter which partitions to compact:

কল spark_catalog . সিস্টেম . rewrite_data_files( 
  সারণী  = >  'db.my_table', 
  where = >  'event_date >= "2025-01-01"'
) 

টিপ

Compaction is a rewrite operation and will produce new snapshots. Plan to run snapshot expiration after compaction to avoid metadata bloat.

ডেল্টা লেক

Delta Lake provides the OPTIMIZE command for compaction:

OPTIMIZE db. my_table

For tables with high-cardinality filter columns, use ZORDER BY to co-locate related data within the same files, improving query performance on those columns:

OPTIMIZE db. my_table ZORDER দ্বারা  ( customer_id,  order_date) 

হুদি

Apache Hudi Merge on Read (MoR) tables use compaction to merge log files with base files, producing new file slices. Compaction can run inline (during writes) or asynchronously.

For inline compaction, set these table properties at creation time:

তৈরি  সারণী  db. my_hudi_table ( . . . )  ব্যবহার হুডি 
টিবিএল প্রপার্টি ( 
  টাইপ  =  'মোর' , 
প্রাথমিককী =  'id', 
  'hoodie.compact.inline' =  'true', 
  'hoodie.compact.inline.max.delta.commits' =  '5' 
) 

For asynchronous compaction, set hoodie.compact.inlineকরতে মিথ্যা and schedule a separate compaction job. Refer to the Apache Hudi documentation for details on async compaction strategies.

ট্রিনো

When using Trino as your SQL engine, you can optimize both Iceberg and Delta Lake tables with the optimize procedure:

পরিবর্তন  সারণী  db. my_table EXECUTE optimize

For Iceberg tables, you can also specify a file size threshold:

পরিবর্তন  সারণী  db. my_table EXECUTE optimize( file_size_threshold = >  '128MB') 

When to Compact

After bulk ingestion that produces many small files
After a series of updates or deletes that fragment existing data files
On a recurring schedule (e.g., nightly or weekly) for tables with continuous writes

সতর্কীকরণ

Compaction is resource-intensive. Schedule it during off-peak hours and allocate sufficient executor memory to avoid out-of-memory failures. For large tables, compact partition by partition rather than the entire table at once.

Orphan File Cleanup

Orphan files are data files that exist in storage but are no longer referenced by any table metadata. They can result from:

Failed or interrupted write operations
Schema evolution operations that create new files before dropping old references
Concurrent writes where one transaction is aborted

These files consume storage without serving any purpose.

হিমশৈল

Use the remove_orphan_files procedure to clean up unreferenced files:

কল spark_catalog . সিস্টেম . remove_orphan_files( 
  সারণী  = >  'db.my_table', 
  older_than = >  টাইমস্ট্যাম্প  '2025-12-01 00:00:00'
) 

সতর্কীকরণ

Always set older_than to a timestamp at least 3 days in the past. Files created by in-progress writes may not yet be referenced in metadata. Removing them prematurely can corrupt active operations.

To preview which files would be removed without actually deleting them, use the dry_run option:

কল spark_catalog . সিস্টেম . remove_orphan_files( 
  সারণী  = >  'db.my_table', 
  older_than = >  টাইমস্ট্যাম্প  '2025-12-01 00:00:00', 
  dry_run = >  সত্য 
) 

ডেল্টা লেক

Delta Lake handles orphan file removal through the ভ্যাকুয়াম command (covered in Snapshot & Metadata Management). Files not referenced by the transaction log and older than the retention threshold are removed during vacuum.

Scheduling Cleanup

Orphan file cleanup should run on a regular schedule. Use the ilum Scheduler to set up a recurring Spark job that executes cleanup procedures, or integrate it into an Airflow DAG alongside other maintenance tasks.

Snapshot & Metadata Management

Table formats maintain snapshots and metadata for features like time travel and rollback. Over time, accumulating snapshots increases metadata size, slows down query planning, and consumes additional storage for data files that are only retained by old snapshots.

Iceberg Snapshot Expiration

Expire old snapshots to remove metadata entries and allow referenced data files to be cleaned up:

কল spark_catalog . সিস্টেম . expire_snapshots( 
  সারণী  = >  'db.my_table', 
  older_than = >  টাইমস্ট্যাম্প  '2025-12-01 00:00:00'
) 

You can also limit the number of snapshots to retain:

কল spark_catalog . সিস্টেম . expire_snapshots( 
  সারণী  = >  'db.my_table', 
  older_than = >  টাইমস্ট্যাম্প  '2025-12-01 00:00:00', 
  retain_last = >  10 
) 

Iceberg also supports automatic metadata cleanup via table properties:

পরিবর্তন  সারণী  db. my_table সেট টিবিএল প্রপার্টি ( 
  'write.metadata.delete-after-commit.enabled' =  'true', 
  'write.metadata.previous-versions-max' =  '10'
) 

Delta Lake VACUUM

ঐ ভ্যাকুয়াম command removes data files that are no longer referenced by the Delta transaction log and are older than the retention threshold:

VACUUM db. my_table RETAIN 168  HOURS

ব্যবহার DRY RUN to preview the files that would be deleted:

VACUUM db. my_table RETAIN 168  HOURS DRY RUN

নোট

The default Delta Lake retention period is 7 days (168 hours). Setting it below this threshold requires disabling a safety check, which is not recommended as it can break time travel queries and concurrent reads.

For more details on time travel, snapshots, and tagging in each table format, see the এসকিউএল ভিউয়ার documentation.

Retention Policies

Retention policies define how long data, snapshots, and log files are kept. Configuring retention at the table level ensures that maintenance operations automatically respect these boundaries.

Iceberg Retention Properties

Set table properties to control automatic snapshot expiration:

পরিবর্তন  সারণী  db. my_table সেট টিবিএল প্রপার্টি ( 
  'history.expire.max-snapshot-age-ms' =  '604800000',   -- 7 days
  'history.expire.min-snapshots-to-keep' =  '5' 
) 

These properties are respected by the expire_snapshots procedure when called without explicit older_thanবা retain_last arguments.

Delta Lake Retention Properties

Configure retention for deleted files and transaction log entries:

পরিবর্তন  সারণী  db. my_table সেট টিবিএল প্রপার্টি ( 
  'delta.deletedFileRetentionDuration' =  'interval 30 days', 
  'delta.logRetentionDuration' =  'interval 90 days'
) 

deletedFileRetentionDuration controls how long deleted data files are kept before ভ্যাকুয়াম can remove them.
logRetentionDuration controls how long transaction log entries are retained, which determines the time travel window.

Building a Retention Schedule

A practical retention strategy often combines multiple tiers based on data age:

Tier	Age	সঞ্চয়	Action
Hot	0--30 days	Fast SSD (MinIO local)	Full snapshot retention, no cleanup
Warm	30--90 days	S3 / GCS	Expire snapshots older than 30 days, compact weekly
Cold	90 days -- 1 year	S3 Infrequent Access / Archive	Retain only latest snapshot, vacuum aggressively
Purge	> 1 year	--	Drop tables or partitions per compliance policy

Implement this schedule as a recurring Airflow DAG that runs different maintenance procedures based on the age of each partition. See Automation Patterns for examples.

Storage Tiering

Ilum supports multiple storage backends -- MinIO (default), S3, GCS, WASBS, and HDFS -- which can be configured per cluster. By leveraging multiple storage backends, you can place frequently accessed data on fast local storage and move older or infrequently accessed data to cheaper object storage.

For instructions on adding storage backends to your clusters, see ক্লাস্টার এবং স্টোরেজ .

Iceberg Storage Location

Iceberg tables support configuring the write location per table, allowing you to direct new data to a specific storage tier:

পরিবর্তন  সারণী  db. my_table সেট টিবিএল প্রপার্টি ( 
  'write.data.path' =  's3a://cold-storage-bucket/warehouse/my_table/data'
) 

For partitioned tables, you can implement tier-based placement by writing different partitions to different storage locations. For example, use a Spark job that reads recent partitions from warm storage and rewrites them to cold storage:

স্ফুলিঙ্গ . এসকিউএল ( """
  INSERT OVERWRITE db.my_table PARTITION (event_date)
  SELECT * FROM db.my_table
  WHERE event_date < '2025-01-01'
""") 

After rewriting, update the table's data path for new writes to target the appropriate tier.

Partition-Based Tiering Strategy

A common approach is to use date partitions to separate data across tiers:

Recent partitions (last 30 days) remain on fast local MinIO storage for low-latency queries
Older partitions (30--365 days) are moved to S3 or GCS for cost-effective storage
Archived partitions (> 1 year) are moved to infrequent-access storage classes or deleted

Automate this movement with scheduled Spark jobs that:

Read data from the source tier
Write data to the destination tier
Update table metadata to reflect the new file locations

Delta Lake Storage Management

Delta Lake tables use the location specified at table creation. To implement tiering, create separate tables or use CREATE TABLE ... LOCATION to control placement:

তৈরি  সারণী  db. cold_archive
ব্যবহার ডেল্টা 
LOCATION 's3a://cold-storage-bucket/warehouse/cold_archive'
হিসেবে  নির্বাচন  *  থেকে  db. my_table যেখানে  event_date <  '2025-01-01'

Automation Patterns

Manual execution of lifecycle operations is error-prone and unsustainable. Ilum provides several automation options for scheduling recurring maintenance.

Airflow DAG

Build an Airflow DAG that runs weekly maintenance across your tables. Submit maintenance SQL via the LivyOperator to execute on a Spark cluster managed by Ilum.

থেকে  airflow আমদানি  DAG
থেকে  airflow. providers. apache. livy. operators. livy আমদানি  LivyOperator
থেকে তারিখের সময় আমদানি তারিখের সময় 

সাথে  DAG( 
    dag_id= 'weekly_table_maintenance', 
    schedule= '0 2 * * 0',   # Every Sunday at 2 AM
    start_date= তারিখের সময় ( 2025,  1 ,  1 ) , 
    catchup= False, 
)  হিসেবে  dag: 

    compact =  LivyOperator( 
        task_id= 'compact_tables', 
        ফাইল = 's3a://ilum-files/maintenance/compact.py', 
        livy_conn_id= 'ilum-livy-proxy', 
        polling_interval= 30 , 
    ) 

    expire_snapshots =  LivyOperator( 
        task_id= 'expire_snapshots', 
        ফাইল = 's3a://ilum-files/maintenance/expire_snapshots.py', 
        livy_conn_id= 'ilum-livy-proxy', 
        polling_interval= 30 , 
    ) 

    cleanup =  LivyOperator( 
        task_id= 'orphan_cleanup', 
        ফাইল = 's3a://ilum-files/maintenance/cleanup.py', 
        livy_conn_id= 'ilum-livy-proxy', 
        polling_interval= 30 , 
    ) 

    compact >> expire_snapshots >> cleanup

The maintenance scripts (e.g., compact.py) contain the SQL calls shown in earlier sections, executed via spark.sql(...). For complete details on configuring Airflow with Ilum, see the Airflow documentation.

ilum Built-in Scheduler

For simpler recurring tasks, use the ilum Scheduler to create a scheduled Spark job that runs maintenance SQL. This avoids the overhead of deploying a full orchestration platform.

Write a PySpark script that executes your maintenance procedures
Upload the script to your storage (e.g., MinIO)
Create a schedule in the ilum UI with an appropriate CRON expression (e.g., 0 3 * * 0 for weekly Sunday at 3 AM)

n8n Workflow

সাথে n8n , you can build a visual workflow that triggers Spark maintenance jobs through Ilum's API. This is useful when maintenance should be triggered by external events (e.g., after a data pipeline completes) rather than on a fixed schedule.

Kestra Pipeline

কেস্ত্র provides a declarative YAML-based approach to defining maintenance pipelines:

আইডি : হিমশৈল - maintenance
নামস্থান :  ilum.maintenance
কার্য : 
  -  আইডি :  compact
    টাইপ : io.kestra.plugin.spark.SparkCLI 
    মাস্টার :  "k8s://https://ilum-core:9888"
    আর্গস : 
      -  "--class"
      -  "org.apache.spark.sql.IcebergMaintenance"
      -  "s3a://ilum-files/maintenance/iceberg-maintenance.jar"
triggers: 
  -  আইডি :  weekly
    টাইপ :  io.kestra.plugin.core.trigger.Schedule
    cron:  "0 4 * * 0"

Recommended Maintenance Schedule

Operation	Frequency	Notes
Table compaction	Weekly or after large ingestions	Partition-level for large tables
Snapshot expiration	Weekly	Run after compaction
Orphan file cleanup	Weekly or bi-weekly	সেট `older_than` to at least 3 days ago
VACUUM (Delta)	Weekly	Keep retention at 7+ days
Storage tier migration	Monthly	Move cold partitions to cheaper storage

টিপ

Combine related operations into a single scheduled job or DAG to reduce cluster overhead. Run compaction first, then snapshot expiration, then orphan file cleanup -- this order ensures that files produced by compaction are properly tracked before cleanup runs.

সংক্ষিপ্ত বিবরণ ​

Table Compaction​

হিমশৈল ​

ডেল্টা লেক ​

হুদি ​

ট্রিনো ​

When to Compact​

Orphan File Cleanup​

হিমশৈল ​

ডেল্টা লেক ​

Scheduling Cleanup​

Snapshot & Metadata Management​

Iceberg Snapshot Expiration​

Delta Lake VACUUM​

Retention Policies​

Iceberg Retention Properties​

Delta Lake Retention Properties​

Building a Retention Schedule​

Storage Tiering​

Iceberg Storage Location​

Partition-Based Tiering Strategy​

Delta Lake Storage Management​

Automation Patterns​

Airflow DAG​

ilum Built-in Scheduler​

n8n Workflow​

Kestra Pipeline​

Recommended Maintenance Schedule​

সংক্ষিপ্ত বিবরণ

Table Compaction

হিমশৈল

ডেল্টা লেক

হুদি

ট্রিনো

When to Compact

Orphan File Cleanup

হিমশৈল

ডেল্টা লেক

Scheduling Cleanup

Snapshot & Metadata Management

Iceberg Snapshot Expiration

Delta Lake VACUUM

Retention Policies

Iceberg Retention Properties

Delta Lake Retention Properties

Building a Retention Schedule

Storage Tiering

Iceberg Storage Location

Partition-Based Tiering Strategy

Delta Lake Storage Management

Automation Patterns

Airflow DAG

ilum Built-in Scheduler

n8n Workflow

Kestra Pipeline

Recommended Maintenance Schedule