Technical

Idiomatic Noir - Part 2: Parsing Data

In this blog post series, we are exploring the idiomatic code of Noir. In case you missed the first part, check it out here.

Today, we will focus on how to deal with data parsing. Sounds interesting? Let’s dive in.

A common routine in a circuit is to parse data in a particular format (e.g., Base64, RLP, JSON) from an external source (e.g., email, blockchain, OAuth response, web). Then, verify its content against cryptographic primitives like hash, signature, Merkle proof, etc.

To do this, we usually need to do two things:

compute the cryptographic primitive on the data
decode the data to assert that the content is correct

The majority of ZK projects do all of the above daily. This includes ZK Email, ZK Login, ZK Pass, and many others.

Encoding vs Decoding

A fun fact is that encoding and decoding can be interchangeable in ZK circuits. After comparing the encoded and decoded data, the circuits essentially just return a boolean value.

Therefore, we usually only need to encode one of the two routines. Then, we can choose the cheaper one or the one that is easier to implement. Below are typical reasons why decoding is the preferred option:

decoding of encoded data is usually cheaper than the encoding of decoded data (e.g., JSON)
decoded data is only a subset of the encoded data (e.g., we are only interested in a few fields of the JSON)
encoding is often non-deterministic

Hence, the focus of this post is on decoding.

Decoding

Let’s go back to the JSON parser example from part 1 of this blog post:

fn parse_json<MAX_INPUT_LEN, MAX_NODES>(input: Iterable<u8; MAX_INPUT_LEN>)
  -> Iterable<JsonNode, MAX_NODES> {
    ...
}

We will now focus on what the type of JsonNode should look like. For simplicity, we will start with a function that only parses an array of strings. So we will narrow down inputs to the following: ["zk," "is," "awesome,"..., "hard"].

With these assumptions, we could rewrite our function to look something like this:

global MAX_STRING_LEN = ...;
type JSONString = [u8; MAX_STRING_LEN];

fn parse_json_string_array<MAX_INPUT_LEN, MAX_NODES>
	(input: Iterable<u8; MAX_INPUT_LEN>) -> Iterable<JSONString, MAX_NODES> {
    ...
}

‍

However, this introduces a few new problems:

We have a new global MAX_STRING_LEN. We could switch to generic parameters, but that would be a lot for such a relatively simple function.
We need to copy each string into a new structure. Copying data is quite cheap in Noir, but this is still an O(n) operations. Often, we don't even use all the fields in the parsed input.

Introducing Fragment

We can improve our example code by introducing a Fragment type that only stores a reference to the original string and the offset and length of the parsed input.

struct JSONFragment<MAX_LEN> {
  offset: u64;
  length: u64;
  data: [u8; MAX_LEN]
}

fn parse_json_string_array<MAX_INPUT_LEN, MAX_NODES>(
	input: Iterable<u8; MAX_INPUT_LEN>)
  -> Iterable<JSONFragment<MAX_INPUT_LEN>, MAX_NODES> {
    ...
}

‍

Two things to notice here:

We could merge MAX_STRING_LEN and MAX_INPUT_LEN as MAX_STRING_LEN ≤ MAX_INPUT_LEN> without paying for bigger data length, as it is an immutable reference. This is not the case in the current Noir version - see challenges at the bottom of this article.
We can generalize Fragment to support any array, not just a substring of JSON.

Generalized implementation could look like this:

struct Fragment<T, MAX_LEN> {
  offset: u64;
  length: u64;
  data: [T; MAX_LEN]
}

fn parse_json_string_array<MAX_INPUT_LEN, MAX_NODES>(
	input: Iterable<u8; MAX_INPUT_LEN>)
  -> Iterable<Fragment<u8; MAX_INPUT_LEN>, MAX_NODES> {
    ...
}

‍

A careful reader might notice that there is some performance degradation leaking. Consider the following code using the Iterable trait from the previous part:

let input: [u8; 10000] = "...";
let json_nodes: [u8; 20] = parse_json_string_array(input);

json_nodes[0].each(|c: u8| {
    
});

‍

The loop above will now iterate over 10000 elements. The parsed string may often be shorter than the entire input. Can we do something about this? We may introduce a new method that limits the number of iterations. This would work if known at the compilation time:

impl<NEW_MAX_LEN> Fragment {
	fn focus<NEW_MAX_LEN>(&self) -> Fragment<NEW_MAX_LEN> {
	  ...
	}
}

‍

Now, let’s make sure that the nested loop is optimal:

let input: [u8; 10000] = "...";
let json_nodes: [u8; 20] = parse_json_string_array(input);

let short_str: [u8; 10] = json_nodes[0].focus()
short_str.each(|c: u8| {
    
};

‍

The revised loop above will now iterate over just 10 elements.

Fragment vs. Slice

The natural name for the structure that holds the reference to an array, offset, and length is referred to as the slice. The slice is already used as a type in Rust, and it has very similar properties. However, the Noir Slice is actually closer to a list. It has cheap operations to add elements at the beginning, the end, and in the middle. So we would suggest using the name *List* for it and creating a space to rename Fragment to Slice. Alternatively, we could extend the Slice to match the functionality of the Fragment.

Generalized JSONparser

Now, we can model a generalized JSON parser function:

global JSON_TYPE_STRING = ...
global JSON_TYPE_INTEGER = ...
global JSON_TYPE_ARRAY = ...
...

struct JSONNode {
  fragment: Fragment;
  type: u64;
}

fn parse_json<MAX_INPUT_LEN, MAX_NODES>(
	input: Iterable<u8; MAX_INPUT_LEN>)
  -> Iterable<JSONNode<u8; MAX_INPUT_LEN>, MAX_NODES> {
    ...
}

‍

Which could be used in the following way:

let nodes = parse_json(...);

nodes.each(|node| {
  if (node.type == JSON_TYPE_STRING) {
		let small_str:[u8; 10] = node.focus()
    doStringyStuff(small_str.fragment)
  } else if (node.type == JSON_TYPE_INTEGER) {
    doIntegeryStuff(node.fragment.into())    
  } else if (node.type == JSON_TYPE_ARRAY) {
    let inner_nodes = parse_json(node.fragment);
    ...
  }
});

‍

Again, a few things to notice here:

we may implement trait Into for JSONNode, which can delegate to the same trait in Fragment (i.e., we could do standard conversions from Fragment to integers, fields, strings, arrays, etc.)
We can do recursive parsing. However, in the case of JSON, it is an overhead. In this architecture, we need to do more than one pass of a parser.
We can benefit from the Iterable trait while keeping iterations in check with the focus method.

Circuit Design

A typical circuit design might go something like this:

Decode and slice encoded data into Fragments (without converting them yet)
Calculate cryptographic primitives
Convert and compare data

Let’s examine the pseudo-code below:

struct Decoded {
  field_1: u64
  field_2: [u8; 32]
  ...
  field_n: bool
  hash: [u8; HASH_LEN]
}

fn main(input: data[u8; LEN], expected: Decoded) {
  let actual: BoundedVec<Fragments> = parseEncoded(input);
  assert(actual.hash == hash(input));
  assert(actual.field_1.into() == expected.field_1);
  assert(actual.field_2.into() == expected.field_2);
  ...
  assert(actual.field_n.into() == expected.field_n);
}

‍

Two things to point out here:

In the first step, we keep slicing Fragments. That might be a complex process, but we resist converting the Fragment to a specific type before moving to the next phases.
In phases 2 and 3, From and To traits become important. Here, we can encode all the conversions into Idiomatic Trait implementations. The goal is to increase readability and decouple the pairing process from the verification process.

Per Use Case Parsers

There is a lot to improve. In this example, we make many decisions during the execution. The parser is generic and does not benefit from the specific structure of JSON. In the case of recursive parsing, the parser has multiple passes. There are multiple passes. This can be mitigated by writing custom parsers for specific use cases. However, they can preserve the principles proposed in this document.

Current Challenges

To our mind, storing an immutable data reference in multiple fragments should be very cheap. However, experiments show we need to pay for the copying. Was that intended?
Closures not working, making Iterable useless
Traits not working well (here and here - the last one is fixed, but the nightly version is unusable for us)
Compilers often panic when experimenting with advanced features, forcing us to make time-consuming trial and error reproduction reports (Search for compiler panic brings up 25 open issues)

Future

In the future, our ultimate goal is to develop serialization and deserialization which would allow direct parsing into Noir structures. This requires a more robust macro system in Noir, which we hope will be possible in the future.

Let us know your take on that on X.

June 27, 2024