Lesson 4

Parsers & AST

Transforming flat tokens into a tree that represents the structure of your code.

30 min read Intermediate

From Tokens to Structure

In the last lesson, we learned how the lexer breaks code into tokens. But tokens are just a flat list — they don't capture the structure of the code.

Consider this expression:

2 + 3 * 4

The lexer produces these tokens:

2 + 3 * 4

But this flat list doesn't tell us: what's the answer?

If we evaluate left-to-right: (2 + 3) * 4 = 20
If we respect precedence: 2 + (3 * 4) = 14

We know multiplication has higher precedence, so the answer should be 14. But how does the compiler know this? That's where the parser comes in.

Think of It Like Diagramming Sentences

In school, you might have diagrammed sentences: "The quick brown fox jumps" becomes a tree with subject, verb, adjectives. A parser does the same thing for code — it figures out what modifies what, and builds a tree structure.

What Is an AST?

The parser builds an Abstract Syntax Tree (AST). It's "abstract" because it captures the essential structure, ignoring unimportant details like whitespace or parentheses (which only existed to guide the parsing).

Here's our expression as an AST:

BinaryExpr ├── op: Add ├── left: Literal(2) └── right: BinaryExpr ├── op: Mul ├── left: Literal(3) └── right: Literal(4)

This tree structure explicitly shows that 3 * 4 is evaluated first (it's deeper in the tree), then added to 2.

A Bigger Example

Let's see how a function definition becomes an AST:

fn add(a: Int, b: Int) -> Int {
    a + b
}

Token Stream

fn add ( a : Int , b : Int ) -> Int { a + b }

→

AST Node

FunctionDef with name, params, return_type, body

FunctionDef ├── name: "add" ├── params: [ │ ├── Param { name: "a", type: Int } │ └── Param { name: "b", type: Int } │ ] ├── return_type: Int └── body: BinaryExpr ├── op: Add ├── left: Var("a") └── right: Var("b")

Notice how the AST captures everything meaningful:

The function is named "add"
It has two parameters, each with a name and type
It returns an Int
The body is an addition expression

But it doesn't capture:

Where the parentheses were
How many spaces between tokens
Whether there was a newline before the brace

This is the "abstract" part — we keep the meaning, discard the syntax noise.

Grammars: The Rules of Structure

How does the parser know how to build the tree? It follows rules called a grammar. A grammar defines what valid programs look like.

Nova Grammar (Simplified)

Program ::= Item*

Item ::= FunctionDef | LetBinding

FunctionDef ::= "fn" IDENT "(" Params ")" "->" Type Block

Params ::= (Param ("," Param)*)?

Param ::= IDENT ":" Type

Expr ::= Term (("+" | "-") Term)*

Term ::= Factor (("*" | "/") Factor)*

Factor ::= NUMBER | IDENT | "(" Expr ")"

Read this as:

A Program is zero or more Items
An Item is either a FunctionDef or a LetBinding
A FunctionDef is the keyword "fn", then an identifier, then "("...

Precedence in Grammar

Notice how the grammar handles 2 + 3 * 4. An Expr is made of Terms connected by + or -. A Term is made of Factors connected by * or /. This structure naturally gives * and / higher precedence than + and -!

How Parsing Works

The parser reads tokens one at a time and tries to match them against the grammar rules. There are several strategies:

Recursive Descent

The most intuitive approach: write a function for each grammar rule. Here's what parsing expressions looks like:

impl Parser {
    // Parse: Expr ::= Term (('+' | '-') Term)*
    fn parse_expr(&mut self) -> Expr {
        // First, parse a Term
        let mut left = self.parse_term();

        // Then, while we see + or -, keep parsing more Terms
        while self.current_is(Plus) || self.current_is(Minus) {
            let op = self.advance();  // consume + or -
            let right = self.parse_term();
            left = Expr::Binary { left, op, right };
        }

        left
    }

    // Parse: Term ::= Factor (('*' | '/') Factor)*
    fn parse_term(&mut self) -> Expr {
        let mut left = self.parse_factor();

        while self.current_is(Star) || self.current_is(Slash) {
            let op = self.advance();
            let right = self.parse_factor();
            left = Expr::Binary { left, op, right };
        }

        left
    }

    // Parse: Factor ::= NUMBER | IDENT | '(' Expr ')'
    fn parse_factor(&mut self) -> Expr {
        match self.current() {
            Number(n) => {
                self.advance();
                Expr::Literal(n)
            }
            Ident(name) => {
                self.advance();
                Expr::Var(name)
            }
            LParen => {
                self.advance();  // consume '('
                let expr = self.parse_expr();  // recurse!
                self.expect(RParen);  // consume ')'
                expr
            }
            _ => self.error("Expected expression")
        }
    }
}

Let's trace through 2 + 3 * 4:

parse_expr starts

Calls parse_term() to get the left side.

parse_term starts

Calls parse_factor(), sees 2, returns Literal(2). No * or /, so returns.

parse_expr sees +

Consumes +, calls parse_term() for the right side.

parse_term for 3 * 4

Gets Literal(3) from factor, sees *, gets Literal(4). Returns Binary(3, Mul, 4).

parse_expr completes

Returns Binary(2, Add, Binary(3, Mul, 4)) — correct!

Defining AST Types in Rust

The AST is defined as a collection of Rust types, usually enums:

// An expression can be many things
pub enum Expr {
    // Literal values
    Literal(Literal),

    // Variable reference
    Var(String),

    // Binary operation: left op right
    Binary {
        left: Box<Expr>,
        op: BinaryOp,
        right: Box<Expr>,
    },

    // Function call: name(args)
    Call {
        name: String,
        args: Vec<Expr>,
    },

    // If expression: if cond { then } else { else }
    If {
        condition: Box<Expr>,
        then_branch: Box<Expr>,
        else_branch: Option<Box<Expr>>,
    },
}

// Binary operators
pub enum BinaryOp {
    Add, Sub, Mul, Div,
    Eq, NotEq, Lt, Gt,
}

// Literal values
pub enum Literal {
    Int(i64),
    Float(f64),
    String(String),
    Bool(bool),
}

Why Box?

In Rust, we use Box<Expr> because Expr is recursive — a Binary contains other Exprs. Without Box, the type would have infinite size! Box puts the nested expressions on the heap with a fixed-size pointer.

Parser Combinators with Chumsky

Writing parsers by hand is educational but tedious. Nova uses Chumsky, a parser combinator library for Rust. Instead of writing parsing functions, you compose small parsers into bigger ones.

use chumsky::prelude::*;

// A parser for expressions
fn expr_parser() -> impl Parser<Token, Expr, Error = Simple<Token>> {
    // First, define what a "factor" is
    let factor = select! {
        Token::Number(n) => Expr::Literal(Literal::Int(n)),
        Token::Ident(name) => Expr::Var(name),
    };

    // Define a term: factor, optionally followed by * or / and more factors
    let term = factor.clone()
        .then(
            just(Token::Star).or(just(Token::Slash))
                .then(factor)
                .repeated()
        )
        .foldl(|left, (op, right)| Expr::Binary {
            left: Box::new(left),
            op: to_binary_op(op),
            right: Box::new(right),
        });

    // Define an expression: term, optionally followed by + or -
    term.clone()
        .then(
            just(Token::Plus).or(just(Token::Minus))
                .then(term)
                .repeated()
        )
        .foldl(|left, (op, right)| Expr::Binary {
            left: Box::new(left),
            op: to_binary_op(op),
            right: Box::new(right),
        })
}

Key Chumsky combinators:

just(token) — match exactly this token
select! — pattern match on tokens
.then(other) — parse this, then that
.or(other) — parse this or that
.repeated() — parse zero or more times
.foldl(fn) — combine repeated results left-to-right

Error Recovery

A good parser doesn't just fail on the first error — it tries to recover and find more errors. This is crucial for IDE support and helpful compiler messages.

// Bad error experience:
error: unexpected token
  --> main.nova:1:5

// Good error experience:
error: expected `)` to close function parameters
  --> main.nova:1:15
   |
1  | fn add(a: Int, b: Int -> Int {
   |               ^ expected `)` here
   |
   = help: you may have forgotten to close the parameter list

error: expected `{` to start function body
  --> main.nova:1:27
   |
   ...

Nova's parser uses synchronization points — when an error occurs, it skips tokens until it finds something recognizable (like a fn keyword or }), then continues parsing.

Spans in the AST

Remember spans from the lexer? They're even more important in the AST. Every node carries its source location:

pub struct Spanned<T> {
    pub node: T,
    pub span: Span,
}

// Every expression knows where it came from
type SpannedExpr = Spanned<Expr>;

When the type checker finds an error like "can't add String to Int", it uses spans to show exactly which expression is wrong.

Key Takeaway

The parser transforms a flat list of tokens into a tree structure (AST) that captures the meaning and structure of your code. The grammar defines what valid programs look like, and the parser builds the tree by following those rules. This structured representation is what the rest of the compiler works with.

Why This Matters for Nova

Nova's parser does something most parsers don't — it understands contracts as first-class syntax.

The Problem

Most languages treat assertions and contracts as runtime checks buried in function bodies. They're invisible to the compiler, can't be used for optimization, and only fail when code is already running.

How Nova Solves It

Nova's grammar includes requires and ensures clauses as part of function definitions. The AST captures them, and the verifier can reason about them before the code ever runs.

Here's what a Nova function definition looks like in the AST:

FunctionDef
├── name: "divide"
├── params: [a: Int, b: Int]
├── return_type: Int
├── requires: [BinaryExpr(b != 0)]  ← Contract!
├── ensures: [...]                      ← Guarantees!
└── body: BinaryExpr(a / b)

Because contracts are in the AST, the compiler can:

Verify — Prove the contract holds for all possible inputs
Optimize — Remove runtime checks when proven unnecessary
Document — Generate API docs with contract information
Aid AI — Give code generators clear specifications to implement

What's Next?

We've built our tree, but we still don't know if the code makes sense. Is add("hello", 5) valid? That's the job of the type checker, which we'll explore in the next lesson.