Plebeia is a functional implementation of a Merkle Patricia tree. The library allows the manipulation of in-memory functional views of the tree and perform atomic commits of those views to the disk.
Plebeia supports sub-trees. That is, a non internal node can be either a leaf with a value or the root of a sub-tree called "Bud". Trees and sub-trees can be navigated like directories and sub-directories using a cursor implemented with a zipper.
The Merkle Patricia trees are persisted to a file on disk by appending nodes. By default, the underlying storage system uses 256 bits (32 bytes) fixed length cell layout. Each plebeia node is arranged to fit within one cell as possible: a pointer called index has 32 bits. Hashes are 224 bit long.
Leaf data are persisted in the same data file as the tree.
Plebeia is not only a generic Merkle Patricia tree implementation but also specialized for version controlled file systems. It is aimed to be used as a storage subsystem for Tezos blockchain.
Here we give logical definitions of Plebeia trees.
Their implementations and limitations are detailed later.
A segment is a label of Plebeia tree. It is a non empty list of L (left or 0) and R (right or 1).
The longest possible segment in Plebeia is 2039, which is about 254 bytes. In reality, segments are not so long. Currently the longest segments possible to store Tezos blockchain state is around 430.
A bud is a node to store a root or sub tree to give the multi-depth directory structure to Plebeia. It corresponds with the directory seprator character / in the Unix file system.
Empty bud [/] Non empty Bud [/]
|
. <- the subnode, either an internal or an extenderA leaf is a node which stores a data and has no subnode. Plebeia does not assume any structure of the contents: just a binary blob:
Leaf <data>An internal is a node which always has 2 subnodes. The 2 edges to these subnodes are labeled with L (Left) and R (Right):
Internal o
L / \ R
. . <- the subnodesAn extender is a node which always has 1 subnode. The edge to the subnode is labeled with a segment.
For the uniqueness of the tree representation, an extender cannot have another extender as its subnode:
Extender o
/
\ LRL... <- segment
/
. <- the subnode, which is not an Extender [/]
|
o
L/ \R
o o
/ L/ \R
RL\ [/] <3>
/ |
<1> o
L/ \R
<2> [/]The above Plebeia tree has the following terminal items:
1 at /LRL.2 at /RL/L./RL/R3 at /RR.Segment encoding SE(seg) of a binary representation of a segment seg in the following manner:
101110 for RLRRRL.10{i} where 0 <= i < 8 to make the length of the entire bits a multply of 8. |<--------- 8n bits -------->|
|<- segment bits --->|100..00|Examples
SE(RRRLLL) = 11100010SE(RLRLRLRL) = 1010101010000000SE(RRRLLLRLRLRLRL) = 1110001010101010Each Plebeia node n has its Merkle hash or just hash h(n).
The size of hash is fixed to 224 bits (28 bytes) for Buds, Leafs, and Internals.
Extenders have hashes with variable length from 232 bits (29 bytes) to 2264 bits (283 bytes).
The goal is to make the root hash independent of the Patricia implementation. The approach is inspired from the one described by Vitalik Buterin.
hash222_2(x,bb) be the 28 byte BLAKE2B hash of x whose last 2 bits are replaced by bb. bb is a tag to avoid preimage attack.|| is the concatenation of binary chars and strings.take(n,x) is the prefix of n bits of binary string x.0 and 1 denote zero and one bit, respectively.b{n} denotes the n sequence of bit b.SE(seg): segment encoding, a string which encodes seg. Defined above.len(s) : length of bytes of string schr(n) : character of code nThe hash of a leaf is:
h(Leaf(v)) = hash222_2(v,0b10)
|<- h(Leaf(v)) ->|
Leaf |............................10|Example : A Leaf of value "hello world"(0x68656c6c6f20776f) is: hash222_2(0x0068656c6c6f20776f, 0b10) = 0x42d1854b7d69e3b57c64fcc7b4f64171b47dff43fba6ac0499ff437e
The hash of the empty bud is all (224 bits) zeros:
h(Bud None) = 0{224}
|<-------- 224bits--------->|
Empty bud |000000000..........00000000|The hash of a bud with a subnode of hash h(n) is:
h(Bud (Some n)) = hash222_2(n,0b11)
|<- h(Bud (Some n)) ->|
Bud |............................11|Example : The hash of a Bud with an Internal which has 2 empty Buds is 0x79eb24d7ef79749e5031c2791625956546aeb53ac7f344cde79d5783:
* The hash of the empty bud is `0x00000000000000000000000000000000000000000000000000000000`.
* The hash of the internal is `hash222_2(0x0000..0 || 0x0000..0 || 0x00, 0b00) = 0x21e2540637fdb988202f3cb196c896e9e472c779f22f2f3e98a46e08`.
* The hash of the top bud is `hash222_2(0x21e2540637fdb988202f3cb196c896e9e472c779f22f2f3e98a46e08, 0b11) = 0x79eb24d7ef79749e5031c2791625956546aeb53ac7f344cde79d5783`The hash of an internal node with children hashes l and r is:
h(Internal(n1,n2)) = hash222_2(h(n1) || h(n2) || chr(len(h(n2))-28), 0b00)
|<- h(Internal(n1,n2)) ->|
Internal |.............................00|chr(len(h(n2)) - 28) is appended to make the hash safer; s1 || s2 is not injective by itself where s1 and s2 have variable lengths.0 <= len(h(n2))-28 < 256.Example : The hash of a Bud with an Internal which has 2 empty Buds is 0x21e2540637fdb988202f3cb196c896e9e472c779f22f2f3e98a46e08: * The hash of the empty bud is 0x00000000000000000000000000000000000000000000000000000000, whose length is 28 bytes. * The hash of the internal is hash222_2(0x0000..0 || 0x0000..0 || 0x00, 0b00). It is 0x21e2540637fdb988202f3cb196c896e9e472c779f22f2f3e98a46e08.
Extenders have variable length hashes. The hash of an Extender(seg, n) is the hash of subnode h(n) followed by the segment encoding SE(seg):
h(Extender(seg,n)) = h(n) || SE(seg)
|<- hash of the subnode ->|<- segment encoding of seg ->|
Extender | h(n) |<--- segment bits --->|100..0|From the hard limit of the longest node hash (283 bytes), we have:
0{0,7}1.Example : The hash of an Extender which has an empty Bud with a segment R is 0x00000000000000000000000000000000000000000000000000000000c0. * The hash of the empty Bud is 0x00000000000000000000000000000000000000000000000000000000 * The segment encoding of segment R is 11000000 in binary, which is 0xc0
This section explains how the current Plebeia implementation at https://gitlab.com/dailambda/plebeia stores its nodes in a persisted file storage. Other implementations need not to follow the same storage format, but this section should still give some information to explain the characteristics and limits of its logical representation and hash caclulation algorithm.
This implementation can change in future for optimizations and bug fixes, but it should avoid any change of the hash calculation of the logical nodes. If an update changes the hash calculation, then entire Plebeia data files must be converted from the original hash to the new one, which takes very long time.
The first 256 bytes of the storage is the header. The header is followed by a list of cells. The cells have a fixed size bytes_per_cell, 32 bytes by default.
Cell locations are aligned to the beginning of the file for performance. Because of the 256 bytes header at the head, some of the first cells are not usable. The first usable cell, Cell #n, locates at bytes_per_cell * n bytes of the file, where n = (256 + bytes_per_cell - 1) / bytes_per_cell.
|<- bytes_per_cell * n ->|
|<- 256 bytes ->| | |<- bytes_per_cell ->||<- bytes_per_cell ->||<- bytes_per_cell ->|...
|<- Header ->|........| |<----- Cell #n ---->||<---- Cell #n+1 --->||<---- Cell #n+2 --->|...Integers are stored using little endian.
The header of Plebeia data file occupies the first 256 bytes:
File identifier
+0
|0 19|20 23|24 27|28 31|
|HEADER STRING |<bytes/cell->|<-max idx ->|<-version ->|
Header for disk sync #1
+32
|0 11|12 23|24 27|28 31|
|<- hash ->|<- 0 ->|<- i root ->|<- i next ->|
Header for disk sync #2
+64
|0 11|12 23|24 27|28 31|
|<- hash ->|<- 0 ->|<- i root ->|<- i next ->|
Header for process sync #1
+96
|0 11|12 23|24 27|28 31|
|<- hash ->|<- hash w/ key ->|<- i root ->|<- i next ->|
Header for process sync #2
+128
|0 11|12 23|24 27|28 31|
|<- hash ->|<- hash w/ key ->|<- i root ->|<- i next ->|
+160 to +255 : reserved +0
|0 19|20 23|24 27|28 31|
|HEADER STRING chH|<bytes/cell->|<-max idx ->|<-version ->|Header #1 at B32-63 and header #2 at B64-95 store the same contents: the current state of Plebeia tree storage. They are modified during the operation. The state is written together with its hash:
|0 19|20 23|24 27|28 31|
--------------|---------------------------------------------------------------
Header #1/#2 |<- hash of the right -------------->|<- i-root ->|<- i-next ->|The first 24 bytes are the Blake2B24 hash of the rest of the cell. The rest of the cell consists of the 2 indices:
The system writes the same contents to these headers for crash recovery. If the system restarts from a crash, it checks the headers and if:
For the performance, the header should not be updated for each DB write. If the system crashes, then the updates to the DB after the last valid header write are lost, even though they are properly written to the file.
B96-159 carries another copies of headers. They provide the same information as the header for disk sync, but used for faster communcation between the writer and reader processes.
When the system recovers from a crash, it checks the values of the disk sync and process sync. If the process sync header points a later cell than the disk sync header does, Plebeia ignore the cells between the 2 pointers since they may not be flushed completely and be corrupted. Plebeia append new data after the cell pointed by the process sync header.
Plebeia data file keeps the commit information together with the tree data. The information of commits are called "root records". A root record consists of 2 contiguous cells:
Node pointed by i-root or prev:
|0 15|16 19|20 23|24 27|28 31|
|<- zero ->|<- info ->|<- prev ->|<- parent ->|<- idx ->|
Previous cell:
|0 31|
|<-------------------- hash ----------------------------->|Info for more details.The index of the last root hash record is recorded in the i-root field of the header.
:::warning Do not confuse with prev and parent fields. The commit found at prev may not be the parent of the current commit but a commit of an unrelated branch. parent and prev's parent can be different. :::
Every node is stored in one or more cells, an array with bytes_per_cell bytes, 32 bytes by default. The constant size makes it easy for nodes to refer to each other using an index in the array.
Index is a 32 bit integer to identify the cells. The cell of index i occupies from i * bytes_per_cell bytes to (i * bytes_per_cell + 1) - 1 bytes of the file.
Since the head of the file is occuiped by the header of 256 bytes, the first usable cell index is not 0 but (256 + bytes_per_cell - 1) / bytes_per_cell.
The indices from 2^32 - 256 to 2^32 - 1 are reserved for tags of Leaf, Link, and empty Bud. Therefore the usable indices are between (256 + bytes_per_cell - 1) / bytes_per_cell and 2^32 - 257.
The maximum size of the storage file is (2^32 - 256) * bytes_per_cell bytes, about 128GB when bytes_per_cell is 32bytes. It is possible to store more cells than this limit by splitting entire data into several independent data files, giving up sharing cells between files.
The last 32 bits of a node cell are called the index part of the node, which often refers to a node linked from the node.
The index part value from 2^32 - 256 to 2^32 - 1 are used for tags:
This cell layout is for the default configuration without any cell extensions with bytes_per_cell = 32.
|<-- hash_size -->| |<------- 32 bits ---->|
------------------------------------------------------
internal |<-- hash --->|D|0| |<- idx of A child --->| (also refers to the previous cell)
extender -- seg -->|len|0|1| |<- idx of the child ->| (may use some of previous cells)
bud |<--- hash -->|1|1| |<- idx of the child ->|
empty bud |<-- 11111111 --->| |<- -256 ------------->|
leaf (S) |<---- hash ----->| |<- -1 to -128 ------->| (use previous cells)
leaf (L) |<---- hash ----->| |<- -253 ------------->| (use previous cells)
leaf (XL) |<---- hash ----->| |<- -255 ------------->| (use previous cells)
link |<0>|<-child idx->| |<- -254 ------------->|D, which denotes which child's index is referred in the index part. 0: left, 1: right.DThe other child which is not referred in the index part always stored in the previous cell of the internal node.
|<-- hash_size -->| |<------- 32 bits ---->|
------------------------------------------------------
internal |<-- hash --->|D|0| |<- idx of A child --->| (also refers to the previous cell)In almost of all the cases, the cell of an Internal is written just after the cell of the one of its subndoes. Some exceptional cases when nodes are shared by hashconsing, the both subnodes cannot be adjacent to the Internal node. Link is introduced to work around them.
01. |<-- hash_size -->| |<------- 32 bits ---->|
------------------------------------------------------
extender -- seg -->|len|0|1| |<- idx of the child ->| (may use some of previous cells)Example:
Suppose we have an Extender whose subnode is stored at index 0x12345678 with a segment RRRRR...R, 1000 R's. Then,
01111...1110000000, 1001 1's followed by 7 0's. This has 126 bytes.4 which is 000100 in binary:-4 |111111111111111111.........111111111111111111111111111|
-3 |111111111111111111.........111111111111111111111111111|
-2 |111111111111111111.........111111111111111111111111111|
-1 |111111111111111111.........1111111111111111111000..000| <- 233 1's
|< ---- 224 bits -------->| |<------- 32 bits ------>|
------------------------------------------------------------------
extender |00000...00000000|000100|0|1| | 0x12345678 |The first 224bits are used to store the hash of the leaf.
The index 0 is a special index that is always used to refer to the zero leaf, the leaf with the empty data. The zero leaf is not recorded in the disk. The actual 0th cell in a data file is used for the header to store some meta data.
The tag of a small leaf is between 2^32-128 to 2^32-1 inclusive. The data is stored in previous cells of the leaf from its head with right padding with 0's.
The size of the data is calculated:
size = 2^32 - tag |<-- hash_size -->| |<----- 32 bits ------>|
------------------------------------------------------
leaf (S) |<---- hash ----->| |<- -1 to -128-------->| (use the previous cell)Previous cells:
|<------ (length_in_bytes + 31) / 32 cells ----------->|
|<------ size bytes ----------------------->| |
------------------------------------------------------------------
data |<- value contents ------------------------>|000......0| |The tag is 2^32-253. It is for values between 129 bytes and 4GB-1.
|<-- hash_size -->| |<----- 32 bits ------>|
------------------------------------------------------
leaf (L) |<---- hash ----->| |<- -253 ------------->| (use the previous cell)Previous cells:
|<--------------------- n cells ------------------------->|
|<-------- size bytes --------------->| |<- 32bits ->|
-------------------------------------------------------|-------------
data |<- value contents ------------------>|<0000>|<- size --->|The contents of the value of size bytes are stored in previous n cells, where n = (size + 4 + bytes_per_cell - 1) / bytes_per_cell. size is recorded at the last 4 bytes of the previous cell of the leaf node cell.
The tag is 2^32-255. It is for values over 4GB.
|<-- hash_size -->| |<----- 32 bits ------>|
------------------------------------------------------
leaf (XL) |<---- hash ----->| |<- -255 ------------->| (use previous cells)data |<- the last cell of the last cell chunk for the value ->|
The contents of the value is stored in previous cells, forming a linked list of *cell chunks* whose format is defined below. The sole size limit of values storable in the large leaf is the maximum number of cells allowed in a data file.
### Cell chunk
(I use the word 'chunk' since 'block' means a different thing in the blockchain technology.)
A cell chunk is a contiguous cells. There is a 6 byte length *footer fields* in the last bytes of each cell chunk:
* The last 4 bytes is the *cdr index* which points to the last cell of the next cell chunk in the chunk list. If the cdr index is 0, the cell chunk is the last chunk in the list.
* The 2 bytes prior to the cdr index is the *content length* in `uint16`. It is the number of bytes the cell chunk carries for the value contents.
* The data for the value in a cell chunk is stored from the head of the first cell of the cell chunk. Number of the cells in the cell chunk is computable from the expression `(content_length + 6 + 31) / 32`.
* The remaining space between the content and the footer is filled with 0's.
* A cell chunk can carry up to 65535 bytes of the value, which consist of 2049 cells.
Cell chunk layout:cell #1 | cell #2 | .. | the last cell in the chunk (#n) | ||
.. | footer fields | ||||
<-16bits-> | <-32bits-> |
<- contents (size bytes) -------------------> | 000...0 | size | cdr index |
The contents of a value are stored from the last cell chunk in the list whose cdr index is 0. The head of cell chunk list carries the *last* part of the contents.
Example: TODO
## Bud
### Empty bud
224bits of ones are prepended in front of the 32bits of 2^32 - 256: |<-- hash_size -->| |<----- 32 bits ------>|empty bud |<-1111111111111->| |<- -256 ------------->|
### Non-empty bud
* 224 bit hash of the node, ending with 0b11
* followed by 32 bits of the index of the subnode. |<-- hash_size -->| |<------- 32 bits ---->|bud |<--- hash -->|1|1| |<- idx of the child ->|
## Link
Hash-consing may introduce Internal node whose either subnodes cannot be placed prior to the Internal node in the storage. To work around this, a special Link node with tag -254 is introduced. One of the sub-nodes is pointed indirectly via this link node which is placed in the previous cell of the internal node:
* Tag is 2^32 - 254
* The index of the subnode is placed from 192nd bit.
* The first 192 bits are 0 filled. |<-- hash_size -->| |<----- 32 bits ------>|link |<0>|<-child idx->| |<- -254 ------------->|
Exapmle: TODO
## Reserved
The other tags between 2^32-256 and 2^32-1 are reserved for future extension. |<-- hash_size -->| |<----- 32 bits ------------>|reserved | | |<-unused btwn -256 and -1 ->|
## Extensions
Extensions enlarge the cell size from 32 bytes to larger one to store more data.
The size of the header part is fixed to 256 bytes even with extended cells.
### Root hash records under extensions
The format of the root hash records does not change even with the extended cells.
They use only the first 32 bytes of each cell. The rest is filled with 0s.
### Larger hash sizes
`bytes_per_hash` can be extended to a larger size, such as 256 bits.
The hash calculation algorithms are unchanged, except using hash functions
returning longer bits.
### Independent flags
This extends the cell with an independent space of `flag_size` bytes for the flags
between the hash and the index area.
Hashes
When the flags are not combined and the pagination is disabled: |<-- hash_size -->| |<flag_size>| |<------- 32 bits ---->|internal |<---- hash ----->| | |D|0| |<- idx of A child --->| (also refers to the previous cell) extender -- seg ---------------->|len|0|1| |<- idx of the child ->| (may use some of previous cells) bud |<---- hash ----->| | |1|1| |<- idx of the child ->| empty bud |<--- 11111111 -->| | | |<- -256 ------------->| leaf (S) |<---- hash ----->| | | |<- -1 to -128 ------->| (use previous cells) leaf (L) |<---- hash ----->| | | |<- -253 ------------->| (use previous cells) leaf (XL) |<---- hash ----->| | | |<- -255 ------------->| (use previous cells) link |<0>|<-child idx->| | | |<- -254 ------------->|
When the flags are combined and the pagination is enabled: |<- 32 bits ->| |<-- hash_size -->| |<------- 32 bits ---->|internal |<-- count -->| |<-- hash --->|D|0| |<- idx of A child --->| (also refers to the previous cell) extender ---------- seg ---------->|len|0|1| |<- idx of the child ->| (may use previous cells) bud |<-- count -->| |<--- hash -->|1|1| |<- idx of the child ->| empty bud |<-- 0 -->| |<-- 11111111 --->| |<- -256 ------------->| leaf (S) ---- value -->| |<---- hash ----->| |<- -1 to -132 ------->| (may use previous cells) leaf (L) |<-leaf_size->| |<---- hash ----->| |<- -255 ------------->| (use previous cells) leaf (XL) | unused | |<---- hash ----->| |<- -255 ------------->| (use previous cells) link | unused | |<0>|<-child idx->| |<- -254 ------------->|
When the flags are not combined and the pagination is enabled: |<- 32 bits ->| |<-- hash_size -->| |<flag_size>| |<------- 32 bits ---->|internal |<-- count -->| |<---- hash ----->| | |D|0| |<- idx of A child --->| (also refers to the previous cell) extender -------------- seg -------------------->|len|0|1| |<- idx of the child ->| (may use previous cells) bud |<-- count -->| |<---- hash ----->| | |1|1| |<- idx of the child ->| empty bud |<-- 0 -->| |<-- 11111111 --->| | | |<- -256 ------------->| leaf (S) ---- value -->| |<---- hash ----->| | | |<- -1 to -128 ------->| (may use previous cells) leaf (L) |<-leaf_size->| |<---- hash ----->| | | |<- -1 to -128 ------->| (use previous cells) leaf (XL) | unused | |<---- hash ----->| | | |<- -255 ------------->| (use previous cells) link | unused | |<0>|<-child idx->| | | |<- -254 ------------->|
This layout restricts the maximum index number to 2^32-257. Which is about 4G cells, 128GB file size.