How Nova tokenizes source code into a stream of tokens, and the libraries we evaluated.
We evaluated three main approaches for lexing in Rust:
| Library | Type | Speed | Ergonomics | Best For |
|---|---|---|---|---|
| Logos | Derive macro | Fastest | Excellent | Most projects |
| Hand-rolled | Manual | ~20% faster* | Low-level | Maximum control |
| nom | Combinator | Good | Flexible | Binary formats |
*When aggressively inlined and unrolled. See Beating the fastest lexer generator.
Logos is a lexer generator that compiles token definitions into a single deterministic state machine at compile time. It's ridiculously fast and easy to use.
Compile-time codegen, zero-copy, excellent error messages, derive macro simplicity
Less control over exact behavior, some edge cases require callbacks
use logos::Logos;
#[derive(Logos, Debug, Clone, PartialEq)]
#[logos(skip r"[ \t\r\n]+")] // Skip whitespace
pub enum Token<'a> {
// Keywords
#[token("fn")]
Fn,
#[token("let")]
Let,
#[token("if")]
If,
#[token("else")]
Else,
#[token("while")]
While,
#[token("return")]
Return,
#[token("true")]
True,
#[token("false")]
False,
// Literals
#[regex(r"[0-9]+", |lex| lex.slice().parse().ok())]
IntLit(i64),
#[regex(r"[0-9]+\.[0-9]+", |lex| lex.slice().parse().ok())]
FloatLit(f64),
#[regex(r#""([^"\\]|\\.)*""#, |lex| {
let s = lex.slice();
Some(&s[1..s.len()-1]) // Strip quotes
})]
StringLit(&'a str),
// Identifiers
#[regex(r"[a-zA-Z_][a-zA-Z0-9_]*")]
Ident(&'a str),
// Operators
#[token("+")]
Plus,
#[token("-")]
Minus,
#[token("*")]
Star,
#[token("/")]
Slash,
#[token("==")]
EqEq,
#[token("!=")]
NotEq,
#[token("=")]
Eq,
#[token("->")]
Arrow,
// Delimiters
#[token("(")]
LParen,
#[token(")")]
RParen,
#[token("{")]
LBrace,
#[token("}")]
RBrace,
#[token(";")]
Semi,
#[token(":")]
Colon,
#[token(",")]
Comma,
}
// Usage
fn lex(source: &str) -> Vec<Token> {
Token::lexer(source)
.filter_map(|r| r.ok())
.collect()
}
[dependencies]
logos = "0.14"
For maximum performance and control, a hand-written lexer can outperform Logos by ~20%. This is what Nova currently uses.
Maximum performance, full control, better for learning, custom error recovery
More code to write and maintain, easier to introduce bugs
pub struct Lexer<'a> {
source: &'a str,
chars: std::iter::Peekable<std::str::CharIndices<'a>>,
start: usize,
current: usize,
}
impl<'a> Lexer<'a> {
pub fn new(source: &'a str) -> Self {
Self {
source,
chars: source.char_indices().peekable(),
start: 0,
current: 0,
}
}
fn advance(&mut self) -> Option<char> {
self.chars.next().map(|(i, c)| {
self.current = i + c.len_utf8();
c
})
}
fn peek(&mut self) -> Option<char> {
self.chars.peek().map(|(_, c)| *c)
}
fn span(&self) -> Span {
Span::new(self.start as u32, self.current as u32)
}
fn lex_token(&mut self) -> Option<Result<Token, LexError>> {
self.skip_whitespace();
self.start = self.current;
let c = self.advance()?;
let kind = match c {
// Single-char tokens
'(' => TokenKind::LParen,
')' => TokenKind::RParen,
'{' => TokenKind::LBrace,
'}' => TokenKind::RBrace,
';' => TokenKind::Semi,
',' => TokenKind::Comma,
':' => TokenKind::Colon,
'+' => TokenKind::Plus,
'*' => TokenKind::Star,
// Two-char tokens
'-' => {
if self.peek() == Some('>') {
self.advance();
TokenKind::Arrow
} else {
TokenKind::Minus
}
}
'=' => {
if self.peek() == Some('=') {
self.advance();
TokenKind::EqEq
} else {
TokenKind::Eq
}
}
// Numbers
'0'..='9' => return Some(self.lex_number()),
// Strings
'"' => return Some(self.lex_string()),
// Identifiers and keywords
'a'..='z' | 'A'..='Z' | '_' => return Some(Ok(self.lex_ident())),
_ => return Some(Err(LexError::InvalidChar(c, self.span()))),
};
Some(Ok(Token::new(kind, self.span())))
}
fn lex_ident(&mut self) -> Token {
while let Some(c) = self.peek() {
if c.is_alphanumeric() || c == '_' {
self.advance();
} else {
break;
}
}
let text = &self.source[self.start..self.current];
let kind = match text {
"fn" => TokenKind::Fn,
"let" => TokenKind::Let,
"if" => TokenKind::If,
"else" => TokenKind::Else,
"while" => TokenKind::While,
"return" => TokenKind::Return,
"true" => TokenKind::True,
"false" => TokenKind::False,
_ => TokenKind::Ident,
};
Token::new(kind, self.span())
}
}
Nova uses a hand-rolled lexer for several reasons:
Nova's lexer includes security limits to prevent attacks:
// Security constants
const MAX_SOURCE_SIZE: usize = 4 * 1024 * 1024 * 1024; // 4GB
const MAX_NESTING_DEPTH: usize = 256; // Block comments
pub fn lex(source: &str) -> Result<Vec<Token>, NovaError> {
// Prevent DoS via huge files
if source.len() > MAX_SOURCE_SIZE {
return Err(NovaError::SourceTooLarge { ... });
}
// ... lexing logic
}
| Approach | 10KB file | 1MB file | Notes |
|---|---|---|---|
| Logos | ~0.1ms | ~8ms | Excellent baseline |
| Hand-rolled (Nova) | ~0.08ms | ~6.5ms | ~20% faster with inlining |
| nom | ~0.15ms | ~12ms | More flexible, less optimized |