মূল বিষয়বস্তুতে যান

নেসি ক্যাটালগ

সংক্ষিপ্ত বিবরণ

Project Nessie is an open-source transactional data catalog that brings Git-like version control to data lakes. It enables you to manage multiple versions of your data using branches, tags, and commits—similar to how Git manages source code.

With Ilum’s integration, you can leverage Nessie’s version control features directly in your Spark environment. This allows you to branch, tag, and merge data changes safely and efficiently.

Unlike traditional Hive or Glue catalogs, which only track the latest state of each table, Nessie records all changes as commits in a timeline. Each commit represents a consistent snapshot of your data lake. Changes are isolated until committed, ensuring incomplete or in-progress updates are never visible to other users or jobs. Once finalized, changes become atomically visible, guaranteeing consistency.

Key Features: Nessie vs. Traditional Catalogs

বৈশিষ্ট্য Traditional Catalogsনেসি ক্যাটালগ
Branchingনা Yes (Git-like)
Isolated EnvironmentsManual/ComplexSimple, via branches
Commit History & Time TravelLimited/Per-tableFull catalog history
মাল্টি-টেবিল লেনদেন না Yes (atomic commits)
Collaboration & Governanceন্যূনতম Built-in, audit log

Highlights

  • Branching: Create multiple isolated branches (e.g., মূল , দেব , staging) without duplicating data. Branches are lightweight pointers to metadata snapshots.
  • Isolated Environments: Use the same data lake for dev, staging, and prod by isolating changes in branches. No need for separate catalogs or data copies.
  • Commit History & Time Travel: Nessie maintains a unified commit log. Inspect, audit, or time-travel to any previous state by commit hash or timestamp.
  • Atomic Multi-table Transactions: Commit changes across multiple tables as a single atomic operation. All succeed or none do.
  • Collaboration & Governance: Work on separate branches, merge changes, and track who changed what and when. Enables safe experimentation and robust auditability.

Core Concepts

Branches

একটি শাখা is an independent line of development for your data catalog. Branches start as copies of existing branches and track changes separately. They are lightweight, referencing the same data files but different metadata. The default branch is usually মূল .

Tags

একটি ট্যাগ is a read-only label pointing to a specific commit. Use tags to mark stable versions or important milestones (e.g., v1.0, 2025-06-release). Tags are immutable bookmarks.

Commits

একটি কমিট করুন is a set of changes recorded as a single atomic unit. Each commit has a unique ID, timestamp, author, and optional message. The commit log provides full catalog versioning.

Using Nessie in Ilum

সতর্কীকরণ

Nessie is not enabled by default in Ilum. To enable it, see the productionপৃষ্ঠা।

Ilum supports প্রজেক্ট নেসি as a catalog for version-controlled data management. When using Ilum notebooks or Spark jobs with Apache Iceberg, you can perform Git-like operations—branching, merging, tagging—directly via SQL.

Nessie requires special configuration with Spark. If used as the main Ilum catalog, the nessie_catalog should be pre-configured in your Spark session.

Additionally, make sure the Spark image used in your cluster has the Nessie client installed. In particular, you will need:

  • Iceberg Spark Runtime (org.apache.iceberg:iceberg-spark-runtime-<SPARK_VERSION>_<SCALA_VERSION>) - Required for Nessie support.
  • Iceberg AWS Bundle (org.apache.iceberg:iceberg-aws-bundle) - Required for S3 support.
  • Nessie SQL Extensions (org.projectnessie.nessie-integrations:nessie-spark-extensions-<SPARK_VERSION>_<SCALA_VERSION>) - Required for Nessie-specific SQL operations.

You can also use Ilum’s custom Spark image for Nessie: ilum/spark:<SPARK_VERION>-nessie, which includes all the required dependencies.

সতর্কীকরণ

Ilum’s spark-nessie image does not include any Delta table dependencies, so you will need to remove the default cluster configuration for Delta tables if you use this image. In particular:

নাম মান
spark.sql.ক্যাটালগ.স্পার্ক_ক্যাটালগ org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.sql.এক্সটেনশন io.delta.sql.DeltaSparkSessionExtension

Additionally, in order for Nessie’s AWS S3 support to work, you need to pass the credentials to the Spark instance differently than with the Hive&Delta solution. To pass the credentials, you can use the spark.driver.extraJavaOptionsএবং spark.executor.extraJavaOptions Spark configuration options. In particular, the following options should be appended:

  • -Daws.region=<REGION> - The default Minio region is ইউএস-ইস্ট-১ .
  • -Daws.accessKeyId=<ACCESS_KEY_ID> - The default Minio access key is মিনিওঅ্যাডমিন .
  • -Daws.secretAccessKey=<SECRET_ACCESS_KEY> - The default Minio secret key is মিনিওঅ্যাডমিন . For a more secure solution, the credentials should be built into the Spark image.

Nessie Walkthrough

In the beginning, it is recommended to create anything inside the main branch so that you avoid problems with merging into an empty branch:

তৈরি  সারণী  nessie_catalog. ব্যবহারকারী ( 
user_id আইএনটি ,
user_name ভার্চার ( 20 )
) ;

Create a Branch

তৈরি  BRANCH dev IN nessie_catalog থেকে মূল ; 

And to verify everything, list all branches and tags:

LIST REFERENCES IN nessie_catalog; 

Work on a Branch

Create a table and insert data in the দেব branch with the fully qualified name (<TableName>@<BranchName>বা <TableName>@<CommitHash>):

তৈরি  সারণী  nessie_catalog. ` sales@dev` ( 
sale_timestamp CHAR( 10 ) ,
sale_amount আইএনটি ,
payment_method ভার্চার ( 20 )
) ;

সন্নিবেশ এর মধ্যে nessie_catalog. ` sales@dev` মানসমূহ
( '2025-06-01', 1000, 'Online') ,
( '2025-06-02', 1500, 'InStore') ,
( '2025-06-03', 800, 'Online') ,
( '2025-06-04', 1200, 'Mobile') ,
( '2025-06-05', 950, 'InStore') ;
নির্বাচন গণনা ( * ) থেকে nessie_catalog. ` sales@dev` ;

Or use the USE statement to switch a context to a specific branch:

USE BRANCH dev IN nessie_catalog; 
নির্বাচন গণনা ( * ) থেকে nessie_catalog. sales;
নোট

Because Ilum’s SQL executor treats each query as a stateless entity, using the USE statement requires executing all related statements together.
To do this, select the entire query block in the editor and then press কার্যকর .

And show the log of all commits done:

দেখান  LOG অন দেব IN nessie_catalog; 

Merge Branches

MERGE BRANCH dev এর মধ্যে মূল IN nessie_catalog; 
দেখান টেবিল IN nessie_catalog;
নোট

If you see an error of No common ancestor in parents of <X> and <Y>, this can mean that the branch you are trying to merge into is empty. This will cause the merge to fail, even if the branch you are trying to merge was correctly created from the parent branch.

সর্বোত্তম অনুশীলন

  • Develop in Isolation: Use branches for development or experiments. Promote changes through a hierarchy (e.g., dev → staging → main).
  • Merge Frequently: Merge changes regularly to minimize conflicts.
  • Keep Branches Short-Lived: Remove feature branches after merging.
  • Avoid Conflicts: Sync your branch with the latest target branch before merging.
  • Tag Milestones: Use tags for stable releases or important checkpoints.
  • Document Changes: Add commit messages for traceability.

আরও শেখো

For advanced SQL operations and the full Nessie Spark SQL reference, see:
👉 Nessie Spark SQL Reference

Nessie with Ilum combines Spark’s power with Git-like data management, enabling robust “data as code” workflows for your lakehouse.