Architecture

Lexer Architecture

How Nova tokenizes source code into a stream of tokens, and the libraries we evaluated.

Library Comparison

We evaluated three main approaches for lexing in Rust:

Library Type Speed Ergonomics Best For
Logos Derive macro Fastest Excellent Most projects
Hand-rolled Manual ~20% faster* Low-level Maximum control
nom Combinator Good Flexible Binary formats

*When aggressively inlined and unrolled. See Beating the fastest lexer generator.

Option 1: Logos (Recommended)

Logos is a lexer generator that compiles token definitions into a single deterministic state machine at compile time. It's ridiculously fast and easy to use.

Pros

Compile-time codegen, zero-copy, excellent error messages, derive macro simplicity

Cons

Less control over exact behavior, some edge cases require callbacks

Example: Nova Token Definition with Logos

use logos::Logos;

#[derive(Logos, Debug, Clone, PartialEq)]
#[logos(skip r"[ \t\r\n]+")]  // Skip whitespace
pub enum Token<'a> {
    // Keywords
    #[token("fn")]
    Fn,
    #[token("let")]
    Let,
    #[token("if")]
    If,
    #[token("else")]
    Else,
    #[token("while")]
    While,
    #[token("return")]
    Return,
    #[token("true")]
    True,
    #[token("false")]
    False,

    // Literals
    #[regex(r"[0-9]+", |lex| lex.slice().parse().ok())]
    IntLit(i64),

    #[regex(r"[0-9]+\.[0-9]+", |lex| lex.slice().parse().ok())]
    FloatLit(f64),

    #[regex(r#""([^"\\]|\\.)*""#, |lex| {
        let s = lex.slice();
        Some(&s[1..s.len()-1])  // Strip quotes
    })]
    StringLit(&'a str),

    // Identifiers
    #[regex(r"[a-zA-Z_][a-zA-Z0-9_]*")]
    Ident(&'a str),

    // Operators
    #[token("+")]
    Plus,
    #[token("-")]
    Minus,
    #[token("*")]
    Star,
    #[token("/")]
    Slash,
    #[token("==")]
    EqEq,
    #[token("!=")]
    NotEq,
    #[token("=")]
    Eq,
    #[token("->")]
    Arrow,

    // Delimiters
    #[token("(")]
    LParen,
    #[token(")")]
    RParen,
    #[token("{")]
    LBrace,
    #[token("}")]
    RBrace,
    #[token(";")]
    Semi,
    #[token(":")]
    Colon,
    #[token(",")]
    Comma,
}

// Usage
fn lex(source: &str) -> Vec<Token> {
    Token::lexer(source)
        .filter_map(|r| r.ok())
        .collect()
}

Cargo.toml

[dependencies]
logos = "0.14"

Option 2: Hand-Rolled Lexer

For maximum performance and control, a hand-written lexer can outperform Logos by ~20%. This is what Nova currently uses.

Pros

Maximum performance, full control, better for learning, custom error recovery

Cons

More code to write and maintain, easier to introduce bugs

Example: Hand-Rolled Lexer Core

pub struct Lexer<'a> {
    source: &'a str,
    chars: std::iter::Peekable<std::str::CharIndices<'a>>,
    start: usize,
    current: usize,
}

impl<'a> Lexer<'a> {
    pub fn new(source: &'a str) -> Self {
        Self {
            source,
            chars: source.char_indices().peekable(),
            start: 0,
            current: 0,
        }
    }

    fn advance(&mut self) -> Option<char> {
        self.chars.next().map(|(i, c)| {
            self.current = i + c.len_utf8();
            c
        })
    }

    fn peek(&mut self) -> Option<char> {
        self.chars.peek().map(|(_, c)| *c)
    }

    fn span(&self) -> Span {
        Span::new(self.start as u32, self.current as u32)
    }

    fn lex_token(&mut self) -> Option<Result<Token, LexError>> {
        self.skip_whitespace();
        self.start = self.current;

        let c = self.advance()?;

        let kind = match c {
            // Single-char tokens
            '(' => TokenKind::LParen,
            ')' => TokenKind::RParen,
            '{' => TokenKind::LBrace,
            '}' => TokenKind::RBrace,
            ';' => TokenKind::Semi,
            ',' => TokenKind::Comma,
            ':' => TokenKind::Colon,
            '+' => TokenKind::Plus,
            '*' => TokenKind::Star,

            // Two-char tokens
            '-' => {
                if self.peek() == Some('>') {
                    self.advance();
                    TokenKind::Arrow
                } else {
                    TokenKind::Minus
                }
            }
            '=' => {
                if self.peek() == Some('=') {
                    self.advance();
                    TokenKind::EqEq
                } else {
                    TokenKind::Eq
                }
            }

            // Numbers
            '0'..='9' => return Some(self.lex_number()),

            // Strings
            '"' => return Some(self.lex_string()),

            // Identifiers and keywords
            'a'..='z' | 'A'..='Z' | '_' => return Some(Ok(self.lex_ident())),

            _ => return Some(Err(LexError::InvalidChar(c, self.span()))),
        };

        Some(Ok(Token::new(kind, self.span())))
    }

    fn lex_ident(&mut self) -> Token {
        while let Some(c) = self.peek() {
            if c.is_alphanumeric() || c == '_' {
                self.advance();
            } else {
                break;
            }
        }

        let text = &self.source[self.start..self.current];
        let kind = match text {
            "fn" => TokenKind::Fn,
            "let" => TokenKind::Let,
            "if" => TokenKind::If,
            "else" => TokenKind::Else,
            "while" => TokenKind::While,
            "return" => TokenKind::Return,
            "true" => TokenKind::True,
            "false" => TokenKind::False,
            _ => TokenKind::Ident,
        };

        Token::new(kind, self.span())
    }
}

Nova's Decision: Hand-Rolled

Nova uses a hand-rolled lexer for several reasons:

Security Considerations

Nova's lexer includes security limits to prevent attacks:

// Security constants
const MAX_SOURCE_SIZE: usize = 4 * 1024 * 1024 * 1024;  // 4GB
const MAX_NESTING_DEPTH: usize = 256;  // Block comments

pub fn lex(source: &str) -> Result<Vec<Token>, NovaError> {
    // Prevent DoS via huge files
    if source.len() > MAX_SOURCE_SIZE {
        return Err(NovaError::SourceTooLarge { ... });
    }
    // ... lexing logic
}

Performance Benchmarks

Approach 10KB file 1MB file Notes
Logos ~0.1ms ~8ms Excellent baseline
Hand-rolled (Nova) ~0.08ms ~6.5ms ~20% faster with inlining
nom ~0.15ms ~12ms More flexible, less optimized

References