Lesson 3

Lexers & Tokens

How compilers break your code into bite-sized pieces they can understand.

25 min read Beginner friendly

The First Step: Breaking Things Apart

When you look at code, you see words, numbers, and symbols organized in a way that makes sense to you. But to a computer, your source code is just one long string of characters:

fn add(a: Int, b: Int) -> Int { a + b }

That's 42 characters. The computer sees: f, n, , a, d, d... one at a time. It has no idea that fn is a keyword or that add is a function name.

The lexer (also called a tokenizer or scanner) is the first stage of the compiler. Its job is simple but crucial: break the raw text into meaningful chunks called tokens.

Think of It Like Reading

When you read this sentence, you don't process it letter by letter. Your brain automatically groups letters into words, recognizes punctuation, and understands spacing. A lexer does the same thing for code — it groups characters into "words" (tokens) that the rest of the compiler can work with.

What Tokens Look Like

Let's see what happens when the lexer processes our example:

Source Code
fn add(a: Int, b: Int) -> Int { a + b }
Token Stream
FN, IDENT("add"), LPAREN, IDENT("a"), COLON, IDENT("Int"), COMMA, ...

Here's the full token stream, visualized:

fn add ( a : Int , b : Int ) -> Int { a + b }

Each colored box is a token. Notice how:

Token Types in Nova

Every token has a type (what kind of thing it is) and sometimes a value (the actual content). Here are Nova's main token types:

Token Type Examples What It Represents
Keyword fn, let, if, else, return Reserved words with special meaning
Identifier add, myVariable, User Names for functions, variables, types
Integer 42, 0, 1_000_000 Whole numbers
Float 3.14, 0.5, 1e10 Decimal numbers
String "hello", "world" Text enclosed in quotes
Operator +, -, *, /, ==, != Mathematical and comparison operations
Punctuation (, ), {, }, :, , Structure and grouping symbols

How the Lexer Works

The lexer reads the source code character by character, using patterns to recognize tokens. Here's the algorithm in plain English:

1

Skip Whitespace

Spaces, tabs, and newlines don't mean anything in most contexts. Skip them. (But remember their positions for error messages!)

2

Look at the Current Character

What is it? A letter? A digit? A symbol? This tells us what kind of token we're starting.

3

Read the Full Token

If it's a letter, keep reading until we hit something that's not a letter or digit — that's an identifier. If it's a digit, read the whole number. If it's a quote, read until the closing quote.

4

Emit the Token

Package up what we found (type + value + position) and move to the next character. Repeat until we hit the end of the file.

Real Rust Code

Nova's lexer is written in Rust. Here's a simplified version showing how we define tokens:

// Token types in Nova
pub enum TokenKind {
    // Keywords
    Fn,           // fn
    Let,          // let
    If,           // if
    Else,         // else
    Return,       // return
    Where,        // where
    Requires,     // requires
    Ensures,      // ensures

    // Literals
    Integer(i64),        // 42, 1_000
    Float(f64),          // 3.14
    String(String),      // "hello"
    Identifier(String), // myVar, add

    // Operators
    Plus,     // +
    Minus,    // -
    Star,     // *
    Slash,    // /
    Eq,       // ==
    NotEq,    // !=
    Lt,       // <
    Gt,       // >

    // Punctuation
    LParen,   // (
    RParen,   // )
    LBrace,   // {
    RBrace,   // }
    Colon,    // :
    Comma,    // ,
    Arrow,    // ->
}

And here's how the lexer matches characters to tokens:

impl Lexer {
    fn next_token(&mut self) -> Token {
        // Skip whitespace
        self.skip_whitespace();

        // Check what character we're looking at
        match self.current_char() {
            // Single character tokens
            '(' => self.make_token(LParen),
            ')' => self.make_token(RParen),
            '{' => self.make_token(LBrace),
            '}' => self.make_token(RBrace),
            ':' => self.make_token(Colon),
            ',' => self.make_token(Comma),
            '+' => self.make_token(Plus),
            '*' => self.make_token(Star),
            '/' => self.make_token(Slash),

            // Two-character tokens (need lookahead)
            '-' => {
                if self.peek() == '>' {
                    self.advance();  // consume the '>'
                    self.make_token(Arrow)
                } else {
                    self.make_token(Minus)
                }
            }

            // Strings
            '"' => self.read_string(),

            // Numbers
            c if c.is_ascii_digit() => self.read_number(),

            // Identifiers and keywords
            c if c.is_alphabetic() || c == '_' => self.read_identifier(),

            // Unknown character = error!
            c => self.error("Unexpected character", c),
        }
    }
}

Key Insight: Lookahead

Notice how - needs to check the next character. Is it -> (arrow) or just - (minus)? This is called lookahead — sometimes you need to peek at the next character to decide what token you're building.

Tracking Position: Spans

Tokens aren't just about what you found — they're also about where you found it. When something goes wrong, we need to tell the user exactly where the error is.

Every token includes a span: its position in the source code.

pub struct Span {
    pub start: usize,  // byte offset where token starts
    pub end: usize,    // byte offset where token ends
}

pub struct Token {
    pub kind: TokenKind,
    pub span: Span,
}

When we report an error like "undefined variable foo at line 5, column 12", we use the span to:

  1. Calculate line/column from the byte offset
  2. Highlight the exact location in the source
  3. Show context around the error
// Error reporting with spans
error[E0001]: undefined variable
  --> main.nova:5:12
   |
5  |     let x = foo + 1;
   |             ^^^ not found in this scope
   |

Without spans, we couldn't show those pretty error messages!

Edge Cases and Gotchas

Lexing seems simple, but there are tricky cases:

Numbers with Underscores

Nova allows 1_000_000 for readability. The lexer must recognize this as a single number, not multiple tokens.

fn read_number(&mut self) -> Token {
    let mut value = String::new();

    while self.current_char().is_ascii_digit()
          || self.current_char() == '_'
    {
        if self.current_char() != '_' {
            value.push(self.current_char());
        }
        self.advance();
    }

    // Parse "1000000" (underscores stripped)
    Token::Integer(value.parse().unwrap())
}

String Escape Sequences

Inside strings, \n means newline, \" means a literal quote. The lexer must handle these.

"Hello\nWorld"   // Contains a newline character
"Say \"hi\""     // Contains literal quotes

Comments

Comments are stripped out by the lexer. They never become tokens — they're just ignored.

fn add(a: Int, b: Int) -> Int {
    // This is a comment - the lexer skips it entirely
    a + b
}

Common Mistake

Beginners often try to handle comments in the parser. Don't! Comments should be stripped by the lexer so the parser never sees them. This keeps the parser simpler.

Using Logos for Lexing

Writing a lexer by hand is educational, but production compilers often use libraries. Nova uses Logos, a fast Rust lexer generator:

use logos::Logos;

#[derive(Logos, Debug, PartialEq)]
pub enum Token {
    #[token("fn")]
    Fn,

    #[token("let")]
    Let,

    #[token("(")]
    LParen,

    #[token("->")]
    Arrow,

    #[regex(r"[a-zA-Z_][a-zA-Z0-9_]*")]
    Identifier,

    #[regex(r"[0-9][0-9_]*")]
    Integer,

    #[regex(r#""[^"]*""#)]
    String,

    #[regex(r"//[^\n]*", logos::skip)]  // Skip comments!
    #[regex(r"[ \t\n\r]+", logos::skip)] // Skip whitespace!
}

With Logos, you define tokens using attributes:

Logos generates a fast, optimized lexer at compile time. No runtime overhead!

Putting It All Together

Let's trace through a complete example:

let x = 42;

The lexer processes this character by character:

Position Characters Action Token
0-3 let Identifier starting with 'l', matches keyword Let
3 Whitespace, skip
4 x Identifier (not a keyword) Identifier("x")
5 Whitespace, skip
6 = Single character operator Assign
7 Whitespace, skip
8-10 42 Digit, read number Integer(42)
10 ; Semicolon punctuation Semicolon

Final token stream:

let x = 42 ;

This token stream is now ready for the parser!

Key Takeaway

The lexer transforms a stream of characters into a stream of tokens. Each token has a type (what it is), optionally a value (its content), and a span (where it is). This structured data is much easier for the next stage — the parser — to work with.

Why This Matters for Nova

Lexing might seem like a solved problem, but how a language handles this first stage affects everything downstream — especially error messages.

The Problem

Many languages lose precise position information during lexing. When an error occurs later, the compiler can only give vague locations like "somewhere on line 42." Debugging becomes guesswork.

How Nova Solves It

Nova's lexer preserves byte-accurate spans for every token. When verification fails, we can point to the exact character that caused the problem — not just the line, but the precise column and length.

Nova also tokenizes special keywords that other languages don't have:

These keywords are the foundation of Nova's verification system — they let you write contracts that the compiler can mathematically prove.

What's Next?

Now you understand how code becomes tokens. But tokens are still a flat list — we don't yet know that fn add(...) { ... } is a function definition.

In the next lesson, we'll learn about parsers and ASTs — how we build a tree structure that represents the meaning of the code.