Ethereum's Recursive Length Prefix encoding in Elixir


eth_rlp

Introduction

This week we, at Mana Project, rewrote our implementation of Ethereum’s Recursive Length Prefix (RLP) encoding and now encoding is completely done recursively using Elixir’s protocols. I think it’s a good opportunity to describe RLP and its implementation details.

Background

Recursive Length Prefix is Ethereum’s homebrew binary encoding that extensively used from Merkle Paticia Tree (keys are encoded with RLP) to peer-to-peer communications between nodes with RLPx (message data are encoded with RLP).

Why was RLP chosen as common encoding algorithm in Ethereum?

From Ethereum’s design rationale:

RLP is intended to be a highly minimalistic serialization format; its sole purpose is to store nested arrays of bytes. Unlike protobuf, BSON and other existing solutions, RLP does not attempt to define any specific data types such as booleans, floats, doubles or even integers; instead, it simply exists to store structure, in the form of nested arrays, and leaves it up to the protocol to determine the meaning of the arrays

Ethereum's design rationale

Advantages of RLP:

  • simplicity of implementation
  • guaranteed absolute byte-perfect consistency

Implementation

First I’ll provide a formal definition of RLP from Ethereum’s Yellow Paper, then I’ll show how it corresponds to our implementation, and finally, I’ll describe how to encode custom types.

Definition from the Yellow Paper

We define the RLP function as RLP through two sub-functions, the first handling the instance when the value is a byte array, the second when it is a sequence of further values.

If the value to be serialized is a byte-array, the RLP serialization takes one of three forms:

  • If the byte-array contains a single byte solely and that single byte is less than 128, then the input is exactly equal to the output.
  • If the byte-array contains fewer than 56 bytes, then the output is equal to the input prefixed by the byte equal to the length of the byte array plus 128.
  • Otherwise, the output is equal to the input prefixed by the minimal-length byte-array which when interpreted as a big-endian integer is equal to the length of the input byte array, which is itself prefixed by the number of bytes required to faithfully encode this length value plus 183.

If RLP is used to encode a scalar, defined only as a positive integer, it must be specified as the shortest byte array such that the big-endian interpretation of it is equal.

If instead, the value to be serialized is a sequence of other items then the RLP serialization takes one of two forms:

  • If the concatenated serializations of each contained item are less than 56 bytes in length, then the output is equal to that concatenation prefixed by the byte equal to the length of this byte array plus 192.
  • Otherwise, the output is equal to the concatenated serializations prefixed by the minimal-length byte-array which when interpreted as a big-endian integer is equal to the length of the concatenated serializations byte array, which is itself prefixed by the number of bytes required to faithfully encode this length value plus 247.
Protocols

From formal definition, we see that there are three different variants. They are:

  • Encoding byte-arrays.
  • Encoding positive integers.
  • Encoding sequences of other items.

In our implementation for each of these variants, we define a separate protocol.

Integers

So let’s start with the simplest one, the protocol for integers:

defimpl ExRLP.Encode, for: Integer do
  alias ExRLP.Encode

  @spec encode(ExRLP.t()) :: binary()
  def encode(value) when value >= 0 do
    value
    |> to_binary()
    |> Encode.encode()
  end

  @spec to_binary(integer()) :: binary()
  defp to_binary(object) when is_integer(object) and object == 0 do
    ""
  end

  defp to_binary(object) when is_integer(object) and object > 0 do
    object |> :binary.encode_unsigned()
  end
end

The protocol for integers is a subset of protocol for binaries. It converts a number to binary and encodes it with the protocol for binaries.

Examples:

iex> 5 |> ExRLP.Encode.encode
<<5>>
iex> 1_000_000 |> ExRLP.Encode.encode
<<131, 15, 66, 64>>
Binaries

As we can see in the formal definition there are three situations encoding binaries and thanks to Elixir’s syntax they can be concisely described in programming code using guard clauses and pattern matching:

defimpl ExRLP.Encode, for: BitString do
  @spec encode(ExRLP.t()) :: binary()
  def encode(value) do
    value |> encode_item
  end

  @spec encode_item(binary()) :: binary()
  defp encode_item(<<byte>> = item) when byte_size(item) == 1 and byte < 128 do
    item
  end

  defp encode_item(item) when is_binary(item) and byte_size(item) < 56 do
    prefix = 128 + byte_size(item)

    <<prefix>> <> item
  end

  defp encode_item(item) when is_binary(item) do
    be_size = item |> Utils.big_endian_size()
    byte_size = be_size |> byte_size

    <<183 + byte_size>> <> be_size <> item
  end
end

All three cases have separate encode_item/1 method. The first method is for a single byte less than 128, the second method is for byte-arrays containing fewer than 56 bytes and finally the third method for all other binaries.

Examples:

iex> "a" |> ExRLP.Encode.encode
"a"

iex> "abc" |> ExRLP.Encode.encode
<<131, 97, 98, 99>>

iex> "abcdefghi" |> ExRLP.Encode.encode
<<137, 97, 98, 99, 100, 101, 102, 103, 104, 105>>
Lists

Here’s protocol for lists:


defimpl ExRLP.Encode, for: List do
  alias ExRLP.{Utils, Encode}

  @spec encode([ExRLP.t()]) :: binary()
  def encode(values) do
    values
    |> encode_items("")
  end

  @spec encode_items([ExRLP.t()], binary()) :: binary()
  defp encode_items([], acc) do
    acc |> prefix_list
  end

  defp encode_items([item | tail], acc) do
    encoded_item = item |> Encode.encode()

    tail |> encode_items(acc <> encoded_item)
  end

  @spec prefix_list(binary()) :: binary()
  defp prefix_list(encoded_concat) when byte_size(encoded_concat) < 56 do
    size = encoded_concat |> byte_size

    <<192 + size>> <> encoded_concat
  end

  defp prefix_list(encoded_concat) do
    be_size = encoded_concat |> Utils.big_endian_size()
    byte_size = be_size |> byte_size

    <<247 + byte_size>> <> be_size <> encoded_concat
  end
end

It traverses list items recursively using tail recursion and encodes each item using already defined protocols. By the way, a list can contain not only lists, binaries and positive integers but also types for which ExRLP.Encode protocol is defined.

Examples:

iex> [[[]], []] |> ExRLP.Encode.encode
<<195, 193, 192, 192>>

iex> [42, "eth"] |> ExRLP.Encode.encode
<<197, 42, 131, 101, 116, 104>>

iex> [42, ["sun", "moon", 5]] |> ExRLP.Encode.encode
<<204, 42, 202, 131, 115, 117, 110, 132, 109, 111, 111, 110, 5>>

Custom protocols

As an example, let’s define protocol for LogEntry struct that is used in our EVM implementation:

defmodule ExRLP.LogEntry do
  defstruct address: nil, topics: [], data: nil

  @type t :: %__MODULE__{
          address: binary(),
          topics: [integer()],
          data: binary()
        }

  @spec new(binary, [integer()], binary()) :: t()
  def new(address, topics, data) do
    %__MODULE__{
      address: address,
      topics: topics,
      data: data
    }
end

As we can see the types of its fields are binaries, integers, and lists and we already defined protocols for these types. The only thing left to do is present LogEntry struct in a form that we can encode:

defmodule ExRLP.LogEntry do

  ...

  def to_list(log) do
    [log.address, log.topics, log.data]
  end
end

defimpl ExRLP.Encode, for: ExRLP.LogEntry do
  alias ExRLP.{Encode, LogEntry}

  @spec encode(LogEntry.t()) :: binary()
  def encode(log) do
    log
    |> LogEntry.to_list()
    |> Encode.encode()
  end
end

Examples:

iex> log_entry = ExRLP.LogEntry.new(
...>                <<15, 87, 46, 82, 149, 197, 127, 21, 136, 111, 155, 38, 62, 47, 109, 45, 108, 123,
...>                  94, 198>>,
...>                [0, 0, 0],
...>                <<255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
...>                  255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255>>
...> )
%ExRLP.LogEntry{
  address: <<15, 87, 46, 82, 149, 197, 127, 21, 136, 111, 155, 38, 62, 47, 109,
    45, 108, 123, 94, 198>>,
  data: <<255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
    255, 255, 255>>,
  topics: [0, 0, 0]
}

iex> log_entry |> ExRLP.Encode.encode
 <<248, 58, 148, 15, 87, 46, 82, 149, 197, 127, 21, 136, 111, 155, 38, 62, 47, 109, 45,
 108, 123, 94, 198, 195, 128, 128, 128, 160, 255, 255, 255, 255, 255, 255, 255, 255, 255,
 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255,
 255, 255, 255, 255, 255, 255>>

Notes on decoding

We do not need protocols for decoding because the result of encoding is always a binary. Moreover, we don’t have information about types of an original object so, the result of decoding always consists of binaries and lists. One way to decode complex objects is by providing resulting type structure. Here’s a code that is used in tests:

  def normalize_decoded_data(input, output, acc \\ [])

  def normalize_decoded_data(input, output, _) when is_number(output) do
    input |> :binary.decode_unsigned()
  end

  def normalize_decoded_data([], [], acc), do: acc

  def normalize_decoded_data([in_head | in_tail], [out_head | out_tail], acc) do
    normalized_item = normalize_decoded_data(in_head, out_head)

    normalize_decoded_data(in_tail, out_tail, acc ++ [normalized_item])
  end

  def normalize_decoded_data(input, _output, _acc), do: input

Usage example:

input
|> Decode.decode()
|> normalize_decoded_data(expected_result)
See also