Cardano Improvement Proposals


CIP 42 - New Plutus built-in serialiseData

Contents

New Plutus built-in serialiseData

Abstract

This document describes the addition of a new Plutus builtin for serialising BuiltinData to BuiltinByteString.

Motivation

As part of developing on-chain script validators for the Hydra Head protocol, we stumble across a peculiar need for on-chain scripts: we need to verify and compare digests obtained from hashing elements of the script's surrounding transaction.

In this particular context, those elements are transaction outputs (a.k.a. TxOut). While Plutus already provides built-in for hashing data-structure (e.g. sha2_256 :: BuiltinByteString -> BuiltinByteString), it does not provide generic ways of serialising some data type to BuiltinByteString.

In an attempt to pursue our work, we have implemented an on-chain library (plutus-cbor) for encoding data-types as structured CBOR / RFC 8949 in a relatively efficient way (although still quadratic, it is as efficient as it can be with Plutus' available built-ins) and measured the memory and CPU cost of encoding TxOut in a script validator on-chain.

The above graph shows the memory and CPU costs relative against a baseline, of encoding a TxOut using plutus-cbor in function of the number of assets present in that TxOut. The costs on the y-axis are relative to the maximum execution budgets (as per mainnet's parameters, December 2021) allowed for a single script execution. As can be seen, this is of linear complexity, i.e. O(n) in terms of the number of assets. These results can be reproduced using the encoding-cost executable in our repository.

Note that we have also calculated similar costs for ada-only TxOut, in function of the number of TxOut which is about twice as worse but of similar linear shape.

We we can see on the graph, the cost is manageable for a small number of assets (or equivalently, a small number of outputs) but rapidly becomes limiting. Ideally, we would prefer the transaction size to be the limiting factor when it comes to the number of outputs we can handle in a single validation.

Besides, in our discussions with the Marlowe team, we also discovered that they shared a similar problem when it came to serialising merkleized ASTs.

Underneath it all, it seems that it would be beneficial to have a new built-in at our disposal to serialise any Plutus BuiltinData to BuiltinByteString such that validators could leverage more optimized implementations and bytestring builders via built-ins than what's available on-chain, hopefully reducing the overall memory and CPU costs.

Specification

Function definition

We define a new Plutus built-in function with the following type signature:

serialiseData :: BuiltinData -> BuiltinByteString
    

Binary data format

Behind the scene, we expect this function to use a well-known encoding format to ease construction of such serialisation off-chain (in particular, for non-Haskell off-chain contract codes). A natural choice of binary data format in this case is CBOR which is:

  1. Efficient;
  2. Relatively simple;
  3. Use pervasively across the Cardano ecosystem

Furthermore, the Plutus' ecosystem already provides a quite opinionated implementation of a CBOR encoder for built-in Data. For the sake of documenting it as part of this proposal, we provide here-below the CDDL specification of that existing implementation:

plutus_data =
        constr<plutus_data>
      / { * plutus_data => plutus_data }
      / [ * plutus_data ]
      / big_int
      / bounded_bytes
    
    constr<a> =
        #6.121([])
      / #6.122([a])
      / #6.123([a, a])
      / #6.124([a, a, a])
      / #6.125([a, a, a, a])
      / #6.126([a, a, a, a, a])
      / #6.127([a, a, a, a, a, a])
      ; similarly for tag range: #6.1280 .. #6.1400 inclusive
      / #6.102([uint, [* a]])
    
    big_int = int / big_uint / big_nint
    big_uint = #6.2(bounded_bytes)
    big_nint = #6.3(bounded_bytes)
    
    bounded_bytes = bytes .size (0..64)
    

NOTE: The CDDL specification is extracted from the wider alonzo_cddl specification of the Cardano ledger.

Cost Model

The Data type is a recursive data-type, so costing it properly is a little tricky. The Plutus source code defines an instance of ExMemoryUsage for Data with the following interesting note:

This accounts for the number of nodes in a Data object, and also the sizes of the contents of the nodes. This is not ideal, but it seems to be the best we can do. At present this only comes into play for 'equalsData', which is implemented using the derived implementation of '==' […].

We propose to re-use this instance to define a cost model linear in the size of data defined by this instance. What remains is to find a proper coefficient and offset for that linear model. To do so, we can benchmark the execution costs of encoding arbitrarily generated Data of various sizes, and retro-fit the cost into a linear model (provided that the results are still attesting for that type of model).

Benchmarking and costing serialiseData was done in this PR according to this strategy. As the benchmark is not very uniform, because some cases of Data "structures" differ in CPU time taken to process, the linear model is used as an upper bound and thus conservatively overestimating actual costs.

Rationale

Path To Active

Alternatives

Backward Compatibility

This CIP is licensed under Apache-2.0