CIP-0165

Canonical Ledger State

Abstract

This proposal defines the Standard Canonical Ledger State (SCLS), a stable, versioned, and verifiable file format for representing the Cardano ledger state. It specifies a segmented binary container with deterministic CBOR encodings, per-chunk commitments, and a manifest that enables identical snapshots across implementations, supports external tools (e.g., Mithril), and future-proofs distribution and verification of state.

This CIP specifies a canonical interchange format for the ledger state. It does not define, prescribe, or constrain the internal storage or representation of the ledger state within any node implementation. Internal formats remain an implementation detail; the canonical format applies only to the export, interchange, and verification of the ledger state consistency.

Motivation: why is this CIP necessary?

Ledger state serialisations are currently implementation details that may change over time. This makes them unsuitable as stable artifacts for distribution, signing, fast sync, or external tooling (e.g., db-sync, conformance testing, and Mithril checkpoints). Without a canonical format, two nodes at the same chain point can legitimately produce different byte streams for the same state, complicating verification and opening room for error in multi-implementation ecosystems.

SCLS addresses these problems by:

  • specifying a canonical, language-agnostic container and encoding rules;
  • enabling streaming builds and data consistency validation (per-namespace roots);
  • being extensible (e.g., optional indexes/Bloom filters) without breaking compatibility;
  • supporting solutions that store updates in incremental (delta) formats (e.g. UTxO-HD/LSM).

Versioning and Upgrade Complexity: the proposed format defines a solution that will be able to support future protocol extensions with new eras without changing this CIP. The chosen approach allows implementers to define a client that will be interested only in particular parts of the state, but will not prevent it from storing, loading and verifying the interesting parts of the state.

The concrete use-case scenarios for this CIP are:

  • allow building a dump of the Cardano Ledger node state in a canonical format, so any two nodes would generate the same file. This would allow persistence, faster bootstrap and verification.
  • such a state can be verified by another node against its own local state and independently signed. This enables full utilization of Mithril, as each node can sign the same state snapshot independently. To make this possible,
  • full conformance testing. Any implementation would be able to reuse the test-suite of the Haskell node by importing data applying the test transaction and exporting data back.

Specification

The Standard Canonical Ledger State (SCLS) is a segmented file format for Cardano ledger states, designed to support streaming and verifiability. Records are sequential, each tagged by type and independently verifiable by hash. Multi-byte values use network byte order (big-endian).

File Structure

  1. Sequence of records (S, D)* where S is a 32-bit size and D is the payload stored in typed record.
  2. Each payload begins with a one-byte type tag, defining its type.

Unsupported record types are skipped; core data remains accessible.

Namespaces and Entries

In order to provide types of the values and be able to store and verify only partial state, a notion of namespaces is introduced. Each SCLS file may store values from one or more namespaces.

Supported Namespaces

Each logical table/type is a namespace identified by a canonical string (e.g., "utxo", "gov"). Namespace identifiers are UTF-8 encoded; all comparisons and ordering use bytewise (lexicographic by Unicode codepoint) ordering on the UTF-8 byte representation.

ShortnameContent
blocks/v0Blocks created
gov/committee/v0Governance action state
gov/constitution/v0Constitution
gov/pparams/v0Protocol parameters
gov/proposals/v0Update proposals
pool_stake/v0Stake delegation
nonce/v0Nonces
snapshots/v0snapshots
utxo/v0UTXOs

New namespaces may and will be introduced in the future. With new eras and features, new types of the data will be introduced and stored. In order to define what data is stored in the SCLS file, tools fill the MANIFEST record and define namespaces. The order of the namespaces does not change the signatures and other integrity data.

Record Types

CodeNamePurpose
0x00HDRFile header: magic, version, network, namespaces
0x01MANIFESTGlobal commitments, chunk table, summary
0x10CHUNKOrdered entries with per-chunk footer + hash
0x11DELTAIncremental updates overlay (reserved for future)
0x20BLOOMPer-chunk Bloom filter (reserved for future)
0x21INDEXOptional key→offset or value-hash indexes (reserved for future)
0x30DIRDirectory footer with offsets to metadata/index regions (reserved for future)
0x31METAOpaque metadata entries (e.g., signatures, notes)

Proposed file layout:

HDR,
(CHUNK[, BLOOM])*,
MANIFEST,
[INDEX]*,
[META]*
[DIR],
[ (DELTA[, BLOOM])* , MANIFEST, [INDEX]*, [DIR] ]*

At the first steps of implementation it would be enough to have the simpler structure:

HDR,
(CHUNK)*,
MANIFEST

All the other record types allow the introduction of additional features, like delta-states, querying data and may be omitted in case the user does not want those functionalities.

For the additional record types (all except HDR, CHUNK, MANIFEST) it's possible to keep such records outside of the file and build them iteratively, outside of the main dump process. This is especially useful for indices.

HDR Record

Purpose: identify file, version, and global properties.

Structure:

REC_HDR (once at start of file)

  • magic : b"SCLS"
  • version : u16 (start with 1)

Policy:

  • appears once in the file header;
  • must be read and verified first;
  • supports magic bytes for file recognition.

CHUNK Record

Purpose: group entries for streaming and integrity; maintain global canonical order, see namespace and entries for more details.

Structure:

  • chunk_seq : u64 — sequence number of the record
  • chunk_format : u8 - format of the chunks, indicating compression scheme (see data compression table below)
  • namespace : bstr — namespace of the values stored in the CHUNK
  • entries : DataEntry — list of length-prefixed data entries
  • footer {entries_count: u32, chunk_hash: blake28} — hash value of the chunk of data, is used to keep integrity of the file.

DataEntry is a blob of a key-valued data. The structure of the DataEntry is the following:

  • size : u32 - size of the data
  • key : fixed size - key is a fixed size blob where size depends on the namespace
  • value : bstr — cbor data entry

While the format requires each entry to have a key, it's still possible to support hierarchical structures, either by normalizing them and keeping a path or hash as a key or by introducing an artificial key and keeping the entire hierarchy in a single key. The choice depends on each namespace. If there are ways to express and support updating of a part of the tree, it is worth normalizing the tree. If the data is kept and updated as a whole, a single artificial key can be used.

Policy:

  • chunk size ~8–16MiB; footer required;
  • data is stored in deterministically defined global order; in the lexical order of the keys;
  • all keys in the record must be unique;
  • all key-values in the record must refer to the same namespace;
  • CHUNK records are ordered by namespace using bytewise (UTF-8 lexicographic) ordering; all CHUNK records for a given namespace appear consecutively and before any CHUNK records for a namespace that sorts later;
  • within a namespace, all keys in CHUNK n must be lexicographically lower than all keys in CHUNK n+1;
  • readers should verify footer before relying on the data;
  • chunk_hash = H(concat [ digest(e) | e in entries ]);

The format proposes support of data compression. For future-compatibility the format is described by the chunk_format field, and following variants are introduced:

CodeNameDescription
0x00RAWRaw CBOR Entries
0x01ZSTDAll entries are compressed with seekable zstd
0x02ZSTDECompress each value independently

When calculating and verifying hashes, its built over the uncompressed data.

MANIFEST Record

Purpose: index of chunks and information for file integrity check.

Structure:

  • slot_nou64 identifier of the blockchain point (slot number).
  • total_entries: u64 — number of data entries in the file (integrity purpose only)
  • total_chunks: u64 — number of chunks in the file (integrity purpose only)
  • root_hash: Digest - global Merkle root, computed over the per-namespace Merkle roots in lexicographic namespace order (see Verification section for details)
  • namespace_info: a list of { entries_count, chunks_count, name, digest } for each namespace in the file, where:
    • entries_count: u64 — number of entries in the namespace
    • chunks_count: u64 — number of chunks for the namespace
    • name: bstr — namespace name
    • digest: Digest - Merkle root of all entry_e in the namespace
  • prev_manifest: u64 — offset of the previous manifest (used with delta files), zero if there is no previous manifest entry
  • summary: { created_at, tool, comment? }

Digest is defined as a fixed-size 224-bit (28-byte) Blake2b hash.

Policy: used to verify all the chunks.

DELTA Record

Delta records are reserved for future use, where the node wishes to incrementally update an existing CLS file rather than writing a new one.

Purpose: Delta records are used to build iterative updates when a base format has been created and we want to store additional transactions efficiently. Delta records are designed to be compatible with UTxO-HD, LSM-Tree, or other storage types where it is possible to stream a list of updates. This means that instead of serializing the entire ledger state, a node may only need to store a list of entry changes.

Updating the file in place is unsafe, so instead we store a list of updates, i.e. removed and inserted items.

All updates are written in the following way:

  • to update a value, a new entry with the same key should be stored;
  • to remove a value, a special tombstone entry for the key should be stored.

Structure:

  • namespace: bstr — namespace name
  • changes: CBOR — array of the entries, either tombstone entry or value entry
  • footer: {entries_count, chunk_hash}

Policy:

  • chunk size ~8–16MiB; footer required;
  • reader should verify hash before relying on data;
  • dead entries are marked by the special tombstone entry;
  • there must be only one element for the given key in the delta record.

Each set of DELTA records are followed by the MANIFEST slot, that keeps the information about the current slot and state.

BLOOM Record

Bloom record is reserved for the future use, in case if search in the file will be required by the downstream.

Purpose: additional information for allowing fast search and negative search.

Structure:

  • chunk_seq: u64 sequence number of the record.
  • m: u32 - total number of bits in the Bloom filter’s bitset (the length of the bit array).
  • k: u8 - number of independent hash functions used to map a key into bit positions in that array.
  • bitset: bytes[ceil(m/8)] — actual bitset.

DIR Record

Directory record is reserved for the future use, in case if index records or delta records will be implemented.

Purpose: If a file has index records then they will be stored after the records with actual data, and directory record allow a fast way to find them. Directory record is intended to be the last record of the file and has a fixed size footer.

Structure:

  • metadata_offset: u64 offset of the previous metadata record, zero if there is no record
  • index_offset: u64 offset of the last index record, zero if there is no record

META Record

Purpose: record with extra metadata that can be used to store 3rd party data, like signatures for Mithril, metadata information. This is an additional record that may be required for additional scenarios.

Structure:

  • entries: Entry[] list of metadata entries, stored in lexicographical order
  • footer: {entries_count: u64, entries_hash}

Entry:

  • subject: URI — subject stored in the URI format
  • value: cbor — data stored by the metadata entry owner.

Policy:

  • Meta chunks are completely optional and do not change the hash of the entries in data.
  • entries_hash = H(concat (digest e for e in entries))

Entries

Data is stored in the list of Entries, each entry consist of the namespace and its data:

  • size : u32 — length of the entry, stored in big endian;
  • key : bstr - CBOR-encoded string key;
  • dom : bstr – CBOR-encoded data (canonical form).

size is used in a fast search scenario, this way its possible to skip values without interpretation.

The exact definition of the domain data is left out in this CIP. We propose that the ledger team would propose a canonical representation for the types in each new era. For the types they must be in a canonical CBOR format with restrictions from deterministic cbor. Values must not be derivable, that is, if some part of the state can be computed based on another part, then only the base one should be in the state."

All concrete formats should be stored in an attachment to this CIP and stored in namespaces/{namespace}.cddl for each namespace. All the changes should be introduced using current CIP update process.

Canonicalization Rules

  • CBOR maps must be deterministic with sorted keys and no duplicates.
  • Numbers use minimal encoding.
  • Arrays follow a fixed order.

Verification

Manifest stores an overall global root and per-namespace commitments. Verification uses a two-level Merkle structure: each namespace has its own Merkle tree built over its entries, and the global Merkle tree is then built over the per-namespace roots, not directly over entries.

Per-namespace Merkle trees

For each namespace, a Merkle tree is constructed over all live entry digests in canonical key order; tombstones are excluded, last-writer-wins for overlays.

The digest of each entry (the leaf hash), defined as a (key,value) pair for a namespace ns_str, is H(0x01 || ns_str || key || value), where H is Blake2b-224 and ns_str is the UTF-8 encoded namespace name. Internal nodes are hashed as H(0x00 || left || right). The 0x00 and 0x01 domain separators make leaf and internal-node hashes structurally distinct, preventing second pre-image attacks.

The Merkle tree is unbalanced. During construction, entries are inserted in canonical order; whenever two subtrees at the same depth exist they are immediately merged into a node one depth up. On finalization, any remaining trailing subtrees are processed from shallowest to deepest: a trailing subtree whose depth is lower than its successor's is promoted -- its depth counter is incremented with its hash left unchanged -- until the depths match, at which point they are merged with H(0x00 || left || right). This repeats until a single root digest remains.

An empty namespace (zero entries) produces H("") as its root.

Basic chunks store all the values in canonical key order; the namespace Merkle tree is built over all values in this order.

The rule of thumb is that when we calculate a hash of the data we take into account only the live (non deleted) values in canonical order. In the case when there is a single dump without delta records, this is exactly the order of how values are stored. But when delta—records appear we need to take into account that in the following records there may be values that are smaller than the ones in the base and some values may be deleted or updated. As a result writer should calculate a live-set of values, which can be done by running a streaming multi-merge algorithm (when we search for a minimal value from multiple records). In the case a value exists in multiple records we use a last—writer—wins rule. If there is a tombstone, we consider a value deleted and do not include it in a live-set.

Global Merkle tree

The global Merkle tree is constructed from the per-namespace Merkle roots, not from entries directly. Namespace roots are taken in lexicographic (bytewise UTF-8) order of namespace name. Each namespace root is a leaf with digest H(0x01 || namespace_root), where namespace_root is the 28-byte per-namespace Merkle root. (Note that the namespace name is not included in the global tree leaf hash, unlike the entry hashes.) Internal nodes are formed identically to the per-namespace trees: H(0x00 || left || right) with the same unbalanced finalization procedure.

An empty file (no namespaces) produces H("") as the global root.

Extensibility

  • Unknown fields in HDR or unknown chunk types can be skipped by readers.
  • Allows future extension (e.g., index chunks, metadata) without breaking compatibility.

Rationale: How does this CIP achieve its goals?

This CIP achieves its goals by:

  1. Define a canonical format that has the following properties: the format is very simple, it would be easy to write an implementation in any language.
  2. The format is extensible and agnostic, it provides a versioned format that simplifies ledger evolution.
  3. Removes ambiguity, allows signed checkpoints, and improves auditability

Format defines canonical format and ordering for the stored data, thus allowing reproducible and verifiable hashing of the data. It supports Mithril and fast node bootstrap use cases.

Prior Work and Alternatives

Global alternatives:

  • CIP PR #9 by Jean-Philippe Raynaud: the CIP that discusses the state and integration with Mithril a lot. Without much details CIP discusses immutable db and indices. Current CIP discussing adding indices as well, we believe that we can combine the approaches from the work and related work with our own and use the best of both words.

  • CIP draft by Paul Clark: this was an early work of the CIP of the canonical ledger state. The work was more targeted towards what is stored in the files. Proposal also uses deterministic CBOR (canonical CBOR in this CIP). The proposal opens a discussion and rules about how and when snapshots should be created by the nodes, that is deliberately not discussed in the current CIP, as we do not want to impose restrictions on the nodes, and the format allow the nodes not to have any agreement on those rules. As a solution for extensibility and partiality the CIP proposes using a file per "namespace" (in the terminology of the current CIP), in our work we proposed to have a single chunked file that is more friendly for the producer. Currently we are considering at least to have an option for extracting multi-files version. See discussion in open questions.

  • Do Nothing: rejected due to interoperability and Mithril requirements.

Implementation alternatives

Container format

More common container file formats: many container formats were evaluated, but most implementations do not meet all the required properties. The container format closest to the required properties are CARv2 and gitpack-files, but they are more complex and would require additional tooling and language support for all node implementations. For simplicity and ease of adoption, a straightforward binary format was chosen. It's still possible to express the current approach in both CARv2 and gitpack files, if it will be shown that the approach is superior.

Data encoding
  • gRPC: current Haskell node is using CBOR-based ecosystem, so nodes have to support CBOR anyway. In contrast to self-describing CBOR, gRPC requires schema to read the document, and that may be a problem for future compatibility.
  • Plain CBOR stream: easier decoding, but prevents skipping unwanted chunks needed for filtering and additional properties, like querying the data in the file.

JSON vs CBOR for Canonical Ledger State

There are strong reasons to prefer CBOR over JSON for representing the canonical ledger state. The ledger state is large and contains binary data; a binary format is therefore much more compact and efficient than JSON.

While JSON libraries are widely available in nearly every language, JSON lacks a notion of canonical form. Two JSON serializations of the same object are not guaranteed to be byte-identical, so additional tooling and specification would be required to achieve determinism.

By contrast, CBOR has a defined deterministic encoding (see RFC 8949 and restrictions), making it suitable for a canonical format. CBOR also has mature implementations across many programming languages (list here).

Importantly, RFC 8949 also defines a mapping between CBOR and JSON. This allows us to specify a JSON view of the format so that downstream applications can consume the data using standard JSON tooling, while the canonical form remains CBOR.

Multi-file or single file?

We considered using single file because it's more friendly to the producer, because it's possible to ensure required atomicity and durability properties, together with footers-in records, it's possible to validate that the data was actually written and is correct. In case of failure it's possible to find out exactly the place where the failure happened.

However, we agree that for the consumer who wants to get partial states in will be much simpler to use multiple files.

The proposed SCLS format does not contradict having multiple files, on the contrary for things like additional indices we suggest using additional files, it will work as the records have sequential numbers and we can reconstruct full file and have an order. In this proposal we would file to set an additional constraint of the tooling that will come with the libraries: that the tool should be able to generate multi-files on a request and convert formats between those

Should files be byte-identical?

Current approach does not provide byte-identical files, only the domain data that is stored and it's hashes are canonical. It means that tools like Mithril will have to use additional tooling or recalculate hash on their own. It's done for the purpose, this way software may add additional metadata entries, e.g. Mithril can add its own signatures to the file without violating validation properties. Other implementation may add records that are required for them to operate or bootstrap. It's true that other approaches, does solve that issue by creating multiple files, each of them will be byte-identical.

There are few solutions that we propose:

  1. allow tooling to export (or even generate) raw cbor files, that will have required byte-identical property
  2. set additional restrictions on the policies for the records, and instead of defining variable size records require all the records to have exact number of entries inside. It will harm some properties of the hardware but the files will be byte-identical in case if they have similar metadata. Or the metadata we can place it in separate file, then everything will be byte-identical.

Open Questions

What is the exact implementation for data compression, especially for indexing and search?

There are many ways to write indices, hashtables or btrees, each of them may have interesting properties. It's an open question which of them do we want to support.

Do we want the file be optimized for querying with external tools? If so how to achieve that?

We are proposing adding additional records types:

  • bloom records — they would allow faster search of the values by the key, still require file traversal;
  • index records - it would allow faster search by key without full file traversal.

Both changes will not change the structure of the file.

Path to Active

Acceptance Criteria

  • Expert review and consensus from Ledger Committee, IOG, and Node teams.
  • Reference implementation in Cardano node with CLI tool for export/import.
  • Verified test vectors showing identical output across implementations, including Mithril compatibility.
  • Full documentation and CDDL schemas.

Implementation Plan

  1. Prototype SCLS writer/reader.
  2. Refine specification and finalise CDDL.
  3. Integrate into Cardano node CLI.
  4. Validate with Mithril.
  5. Rollout and ecosystem tooling.

References

  1. CARv2 format documentation
  2. Draft Canonical ledger state snapshot and immutable data formats CIP
  3. Mithril
  4. Canonical ledger state CIP draft by Paul Clark
  5. Deterministically Encoded CBOR in CBOR RFC
  6. A Deterministic CBOR Application Profile

This CIP is licensed under CC-BY-4.0.