মূল বিষয়বস্তুতে যান

Data Lifecycle Management

সংক্ষিপ্ত বিবরণ

As data volumes grow, managing the lifecycle of stored data becomes critical for controlling storage costs, maintaining query performance, and meeting compliance requirements. Without active lifecycle management, data lakes accumulate stale snapshots, orphan files, and small fragmented data files that degrade both performance and cost efficiency.

Ilum provides data lifecycle management through the native maintenance operations of its supported table formats -- Apache Iceberg, Delta Lake, and Apache Hudi -- combined with automation through orchestration tools. Rather than introducing a proprietary abstraction, Ilum lets you execute standard SQL maintenance commands via the এসকিউএল ভিউয়ার and schedule them using the built-in Scheduler, Airflow, n8n বা কেস্ত্র .

This page covers:

  • Table compaction to merge small files and improve read performance
  • Orphan file cleanup to reclaim storage from failed or interrupted operations
  • Snapshot and metadata management to control metadata growth
  • Retention policies to automate expiration of old data and snapshots
  • Storage tiering to optimize cost by placing data on appropriate storage backends
  • Automation patterns to schedule maintenance as recurring workflows

Table Compaction

Frequent writes, updates, and deletes can produce many small files, which increases the overhead of file listing and metadata management, and reduces query throughput. Compaction merges these small files into larger, optimally sized files.

হিমশৈল

Use the rewrite_data_files procedure to compact an Iceberg table:

কল spark_catalog . সিস্টেম . rewrite_data_files( 
সারণী = > 'db.my_table'
)

You can also filter which partitions to compact:

কল spark_catalog . সিস্টেম . rewrite_data_files( 
সারণী = > 'db.my_table',
where = > 'event_date >= "2025-01-01"'
)
টিপ

Compaction is a rewrite operation and will produce new snapshots. Plan to run snapshot expiration after compaction to avoid metadata bloat.

ডেল্টা লেক

Delta Lake provides the OPTIMIZE command for compaction:

OPTIMIZE db. my_table

For tables with high-cardinality filter columns, use ZORDER BY to co-locate related data within the same files, improving query performance on those columns:

OPTIMIZE db. my_table ZORDER দ্বারা  ( customer_id,  order_date) 

হুদি

Apache Hudi Merge on Read (MoR) tables use compaction to merge log files with base files, producing new file slices. Compaction can run inline (during writes) or asynchronously.

For inline compaction, set these table properties at creation time:

তৈরি  সারণী  db. my_hudi_table ( . . . )  ব্যবহার হুডি 
টিবিএল প্রপার্টি (
টাইপ = 'মোর' ,
প্রাথমিককী = 'id',
'hoodie.compact.inline' = 'true',
'hoodie.compact.inline.max.delta.commits' = '5'
)

For asynchronous compaction, set hoodie.compact.inlineকরতে মিথ্যা and schedule a separate compaction job. Refer to the Apache Hudi documentation for details on async compaction strategies.

ট্রিনো

When using Trino as your SQL engine, you can optimize both Iceberg and Delta Lake tables with the optimize procedure:

পরিবর্তন  সারণী  db. my_table EXECUTE optimize

For Iceberg tables, you can also specify a file size threshold:

পরিবর্তন  সারণী  db. my_table EXECUTE optimize( file_size_threshold = >  '128MB') 

When to Compact

  • After bulk ingestion that produces many small files
  • After a series of updates or deletes that fragment existing data files
  • On a recurring schedule (e.g., nightly or weekly) for tables with continuous writes
সতর্কীকরণ

Compaction is resource-intensive. Schedule it during off-peak hours and allocate sufficient executor memory to avoid out-of-memory failures. For large tables, compact partition by partition rather than the entire table at once.


Orphan File Cleanup

Orphan files are data files that exist in storage but are no longer referenced by any table metadata. They can result from:

  • Failed or interrupted write operations
  • Schema evolution operations that create new files before dropping old references
  • Concurrent writes where one transaction is aborted

These files consume storage without serving any purpose.

হিমশৈল

Use the remove_orphan_files procedure to clean up unreferenced files:

কল spark_catalog . সিস্টেম . remove_orphan_files( 
সারণী = > 'db.my_table',
older_than = > টাইমস্ট্যাম্প '2025-12-01 00:00:00'
)
সতর্কীকরণ

Always set older_than to a timestamp at least 3 days in the past. Files created by in-progress writes may not yet be referenced in metadata. Removing them prematurely can corrupt active operations.

To preview which files would be removed without actually deleting them, use the dry_run option:

কল spark_catalog . সিস্টেম . remove_orphan_files( 
সারণী = > 'db.my_table',
older_than = > টাইমস্ট্যাম্প '2025-12-01 00:00:00',
dry_run = > সত্য
)

ডেল্টা লেক

Delta Lake handles orphan file removal through the ভ্যাকুয়াম command (covered in Snapshot & Metadata Management). Files not referenced by the transaction log and older than the retention threshold are removed during vacuum.

Scheduling Cleanup

Orphan file cleanup should run on a regular schedule. Use the ilum Scheduler to set up a recurring Spark job that executes cleanup procedures, or integrate it into an Airflow DAG alongside other maintenance tasks.


Snapshot & Metadata Management

Table formats maintain snapshots and metadata for features like time travel and rollback. Over time, accumulating snapshots increases metadata size, slows down query planning, and consumes additional storage for data files that are only retained by old snapshots.

Iceberg Snapshot Expiration

Expire old snapshots to remove metadata entries and allow referenced data files to be cleaned up:

কল spark_catalog . সিস্টেম . expire_snapshots( 
সারণী = > 'db.my_table',
older_than = > টাইমস্ট্যাম্প '2025-12-01 00:00:00'
)

You can also limit the number of snapshots to retain:

কল spark_catalog . সিস্টেম . expire_snapshots( 
সারণী = > 'db.my_table',
older_than = > টাইমস্ট্যাম্প '2025-12-01 00:00:00',
retain_last = > 10
)

Iceberg also supports automatic metadata cleanup via table properties:

পরিবর্তন  সারণী  db. my_table সেট টিবিএল প্রপার্টি ( 
'write.metadata.delete-after-commit.enabled' = 'true',
'write.metadata.previous-versions-max' = '10'
)

Delta Lake VACUUM

ভ্যাকুয়াম command removes data files that are no longer referenced by the Delta transaction log and are older than the retention threshold:

VACUUM db. my_table RETAIN 168  HOURS

ব্যবহার DRY RUN to preview the files that would be deleted:

VACUUM db. my_table RETAIN 168  HOURS DRY RUN
নোট

The default Delta Lake retention period is 7 days (168 hours). Setting it below this threshold requires disabling a safety check, which is not recommended as it can break time travel queries and concurrent reads.

For more details on time travel, snapshots, and tagging in each table format, see the এসকিউএল ভিউয়ার documentation.


Retention Policies

Retention policies define how long data, snapshots, and log files are kept. Configuring retention at the table level ensures that maintenance operations automatically respect these boundaries.

Iceberg Retention Properties

Set table properties to control automatic snapshot expiration:

পরিবর্তন  সারণী  db. my_table সেট টিবিএল প্রপার্টি ( 
'history.expire.max-snapshot-age-ms' = '604800000', -- 7 days
'history.expire.min-snapshots-to-keep' = '5'
)

These properties are respected by the expire_snapshots procedure when called without explicit older_thanবা retain_last arguments.

Delta Lake Retention Properties

Configure retention for deleted files and transaction log entries:

পরিবর্তন  সারণী  db. my_table সেট টিবিএল প্রপার্টি ( 
'delta.deletedFileRetentionDuration' = 'interval 30 days',
'delta.logRetentionDuration' = 'interval 90 days'
)
  • deletedFileRetentionDuration controls how long deleted data files are kept before ভ্যাকুয়াম can remove them.
  • logRetentionDuration controls how long transaction log entries are retained, which determines the time travel window.

Building a Retention Schedule

A practical retention strategy often combines multiple tiers based on data age:

TierAgeসঞ্চয় Action
Hot0--30 daysFast SSD (MinIO local)Full snapshot retention, no cleanup
Warm30--90 daysS3 / GCSExpire snapshots older than 30 days, compact weekly
Cold90 days -- 1 yearS3 Infrequent Access / ArchiveRetain only latest snapshot, vacuum aggressively
Purge> 1 year--Drop tables or partitions per compliance policy

Implement this schedule as a recurring Airflow DAG that runs different maintenance procedures based on the age of each partition. See Automation Patterns for examples.


Storage Tiering

Ilum supports multiple storage backends -- MinIO (default), S3, GCS, WASBS, and HDFS -- which can be configured per cluster. By leveraging multiple storage backends, you can place frequently accessed data on fast local storage and move older or infrequently accessed data to cheaper object storage.

For instructions on adding storage backends to your clusters, see ক্লাস্টার এবং স্টোরেজ .

Iceberg Storage Location

Iceberg tables support configuring the write location per table, allowing you to direct new data to a specific storage tier:

পরিবর্তন  সারণী  db. my_table সেট টিবিএল প্রপার্টি ( 
'write.data.path' = 's3a://cold-storage-bucket/warehouse/my_table/data'
)

For partitioned tables, you can implement tier-based placement by writing different partitions to different storage locations. For example, use a Spark job that reads recent partitions from warm storage and rewrites them to cold storage:

স্ফুলিঙ্গ . এসকিউএল ( """
INSERT OVERWRITE db.my_table PARTITION (event_date)
SELECT * FROM db.my_table
WHERE event_date < '2025-01-01'
""")

After rewriting, update the table's data path for new writes to target the appropriate tier.

Partition-Based Tiering Strategy

A common approach is to use date partitions to separate data across tiers:

  1. Recent partitions (last 30 days) remain on fast local MinIO storage for low-latency queries
  2. Older partitions (30--365 days) are moved to S3 or GCS for cost-effective storage
  3. Archived partitions (> 1 year) are moved to infrequent-access storage classes or deleted

Automate this movement with scheduled Spark jobs that:

  • Read data from the source tier
  • Write data to the destination tier
  • Update table metadata to reflect the new file locations

Delta Lake Storage Management

Delta Lake tables use the location specified at table creation. To implement tiering, create separate tables or use CREATE TABLE ... LOCATION to control placement:

তৈরি  সারণী  db. cold_archive
ব্যবহার ডেল্টা
LOCATION 's3a://cold-storage-bucket/warehouse/cold_archive'
হিসেবে নির্বাচন * থেকে db. my_table যেখানে event_date < '2025-01-01'

Automation Patterns

Manual execution of lifecycle operations is error-prone and unsustainable. Ilum provides several automation options for scheduling recurring maintenance.

Airflow DAG

Build an Airflow DAG that runs weekly maintenance across your tables. Submit maintenance SQL via the LivyOperator to execute on a Spark cluster managed by Ilum.

থেকে  airflow আমদানি  DAG
থেকে airflow. providers. apache. livy. operators. livy আমদানি LivyOperator
থেকে তারিখের সময় আমদানি তারিখের সময়

সাথে DAG(
dag_id= 'weekly_table_maintenance',
schedule= '0 2 * * 0', # Every Sunday at 2 AM
start_date= তারিখের সময় ( 2025, 1 , 1 ) ,
catchup= False,
) হিসেবে dag:

compact = LivyOperator(
task_id= 'compact_tables',
ফাইল = 's3a://ilum-files/maintenance/compact.py',
livy_conn_id= 'ilum-livy-proxy',
polling_interval= 30 ,
)

expire_snapshots = LivyOperator(
task_id= 'expire_snapshots',
ফাইল = 's3a://ilum-files/maintenance/expire_snapshots.py',
livy_conn_id= 'ilum-livy-proxy',
polling_interval= 30 ,
)

cleanup = LivyOperator(
task_id= 'orphan_cleanup',
ফাইল = 's3a://ilum-files/maintenance/cleanup.py',
livy_conn_id= 'ilum-livy-proxy',
polling_interval= 30 ,
)

compact >> expire_snapshots >> cleanup

The maintenance scripts (e.g., compact.py) contain the SQL calls shown in earlier sections, executed via spark.sql(...). For complete details on configuring Airflow with Ilum, see the Airflow documentation.

ilum Built-in Scheduler

For simpler recurring tasks, use the ilum Scheduler to create a scheduled Spark job that runs maintenance SQL. This avoids the overhead of deploying a full orchestration platform.

  1. Write a PySpark script that executes your maintenance procedures
  2. Upload the script to your storage (e.g., MinIO)
  3. Create a schedule in the ilum UI with an appropriate CRON expression (e.g., 0 3 * * 0 for weekly Sunday at 3 AM)

n8n Workflow

সাথে n8n , you can build a visual workflow that triggers Spark maintenance jobs through Ilum's API. This is useful when maintenance should be triggered by external events (e.g., after a data pipeline completes) rather than on a fixed schedule.

Kestra Pipeline

কেস্ত্র provides a declarative YAML-based approach to defining maintenance pipelines:

আইডি : হিমশৈল - maintenance
নামস্থান : ilum.maintenance
কার্য :
- আইডি : compact
টাইপ : io.kestra.plugin.spark.SparkCLI
মাস্টার : "k8s://https://ilum-core:9888"
আর্গস :
- "--class"
- "org.apache.spark.sql.IcebergMaintenance"
- "s3a://ilum-files/maintenance/iceberg-maintenance.jar"
triggers:
- আইডি : weekly
টাইপ : io.kestra.plugin.core.trigger.Schedule
cron: "0 4 * * 0"
OperationFrequencyNotes
Table compactionWeekly or after large ingestionsPartition-level for large tables
Snapshot expirationWeeklyRun after compaction
Orphan file cleanupWeekly or bi-weeklyসেট older_than to at least 3 days ago
VACUUM (Delta)WeeklyKeep retention at 7+ days
Storage tier migrationMonthlyMove cold partitions to cheaper storage
টিপ

Combine related operations into a single scheduled job or DAG to reduce cluster overhead. Run compaction first, then snapshot expiration, then orphan file cleanup -- this order ensures that files produced by compaction are properly tracked before cleanup runs.