Lexers & Tokens
How compilers break your code into bite-sized pieces they can understand.
The First Step: Breaking Things Apart
When you look at code, you see words, numbers, and symbols organized in a way that makes sense to you. But to a computer, your source code is just one long string of characters:
fn add(a: Int, b: Int) -> Int { a + b }
That's 42 characters. The computer sees: f, n, , a, d, d... one at a time. It has no idea that fn is a keyword or that add is a function name.
The lexer (also called a tokenizer or scanner) is the first stage of the compiler. Its job is simple but crucial: break the raw text into meaningful chunks called tokens.
Think of It Like Reading
When you read this sentence, you don't process it letter by letter. Your brain automatically groups letters into words, recognizes punctuation, and understands spacing. A lexer does the same thing for code — it groups characters into "words" (tokens) that the rest of the compiler can work with.
What Tokens Look Like
Let's see what happens when the lexer processes our example:
Source Code
fn add(a: Int, b: Int) -> Int { a + b }
Token Stream
Here's the full token stream, visualized:
Each colored box is a token. Notice how:
- fn is recognized as a keyword (reserved word in the language)
- add, a, b, Int are identifiers (names for things)
- ( ) { } are punctuation
- -> is an arrow (two characters that form one token!)
- + is an operator
Token Types in Nova
Every token has a type (what kind of thing it is) and sometimes a value (the actual content). Here are Nova's main token types:
| Token Type | Examples | What It Represents |
|---|---|---|
| Keyword | fn, let, if, else, return |
Reserved words with special meaning |
| Identifier | add, myVariable, User |
Names for functions, variables, types |
| Integer | 42, 0, 1_000_000 |
Whole numbers |
| Float | 3.14, 0.5, 1e10 |
Decimal numbers |
| String | "hello", "world" |
Text enclosed in quotes |
| Operator | +, -, *, /, ==, != |
Mathematical and comparison operations |
| Punctuation | (, ), {, }, :, , |
Structure and grouping symbols |
How the Lexer Works
The lexer reads the source code character by character, using patterns to recognize tokens. Here's the algorithm in plain English:
Skip Whitespace
Spaces, tabs, and newlines don't mean anything in most contexts. Skip them. (But remember their positions for error messages!)
Look at the Current Character
What is it? A letter? A digit? A symbol? This tells us what kind of token we're starting.
Read the Full Token
If it's a letter, keep reading until we hit something that's not a letter or digit — that's an identifier. If it's a digit, read the whole number. If it's a quote, read until the closing quote.
Emit the Token
Package up what we found (type + value + position) and move to the next character. Repeat until we hit the end of the file.
Real Rust Code
Nova's lexer is written in Rust. Here's a simplified version showing how we define tokens:
// Token types in Nova
pub enum TokenKind {
// Keywords
Fn, // fn
Let, // let
If, // if
Else, // else
Return, // return
Where, // where
Requires, // requires
Ensures, // ensures
// Literals
Integer(i64), // 42, 1_000
Float(f64), // 3.14
String(String), // "hello"
Identifier(String), // myVar, add
// Operators
Plus, // +
Minus, // -
Star, // *
Slash, // /
Eq, // ==
NotEq, // !=
Lt, // <
Gt, // >
// Punctuation
LParen, // (
RParen, // )
LBrace, // {
RBrace, // }
Colon, // :
Comma, // ,
Arrow, // ->
}
And here's how the lexer matches characters to tokens:
impl Lexer {
fn next_token(&mut self) -> Token {
// Skip whitespace
self.skip_whitespace();
// Check what character we're looking at
match self.current_char() {
// Single character tokens
'(' => self.make_token(LParen),
')' => self.make_token(RParen),
'{' => self.make_token(LBrace),
'}' => self.make_token(RBrace),
':' => self.make_token(Colon),
',' => self.make_token(Comma),
'+' => self.make_token(Plus),
'*' => self.make_token(Star),
'/' => self.make_token(Slash),
// Two-character tokens (need lookahead)
'-' => {
if self.peek() == '>' {
self.advance(); // consume the '>'
self.make_token(Arrow)
} else {
self.make_token(Minus)
}
}
// Strings
'"' => self.read_string(),
// Numbers
c if c.is_ascii_digit() => self.read_number(),
// Identifiers and keywords
c if c.is_alphabetic() || c == '_' => self.read_identifier(),
// Unknown character = error!
c => self.error("Unexpected character", c),
}
}
}
Key Insight: Lookahead
Notice how - needs to check the next character. Is it -> (arrow) or just - (minus)? This is called lookahead — sometimes you need to peek at the next character to decide what token you're building.
Tracking Position: Spans
Tokens aren't just about what you found — they're also about where you found it. When something goes wrong, we need to tell the user exactly where the error is.
Every token includes a span: its position in the source code.
pub struct Span {
pub start: usize, // byte offset where token starts
pub end: usize, // byte offset where token ends
}
pub struct Token {
pub kind: TokenKind,
pub span: Span,
}
When we report an error like "undefined variable foo at line 5, column 12", we use the span to:
- Calculate line/column from the byte offset
- Highlight the exact location in the source
- Show context around the error
// Error reporting with spans
error[E0001]: undefined variable
--> main.nova:5:12
|
5 | let x = foo + 1;
| ^^^ not found in this scope
|
Without spans, we couldn't show those pretty error messages!
Edge Cases and Gotchas
Lexing seems simple, but there are tricky cases:
Numbers with Underscores
Nova allows 1_000_000 for readability. The lexer must recognize this as a single number, not multiple tokens.
fn read_number(&mut self) -> Token {
let mut value = String::new();
while self.current_char().is_ascii_digit()
|| self.current_char() == '_'
{
if self.current_char() != '_' {
value.push(self.current_char());
}
self.advance();
}
// Parse "1000000" (underscores stripped)
Token::Integer(value.parse().unwrap())
}
String Escape Sequences
Inside strings, \n means newline, \" means a literal quote. The lexer must handle these.
"Hello\nWorld" // Contains a newline character
"Say \"hi\"" // Contains literal quotes
Comments
Comments are stripped out by the lexer. They never become tokens — they're just ignored.
fn add(a: Int, b: Int) -> Int {
// This is a comment - the lexer skips it entirely
a + b
}
Common Mistake
Beginners often try to handle comments in the parser. Don't! Comments should be stripped by the lexer so the parser never sees them. This keeps the parser simpler.
Using Logos for Lexing
Writing a lexer by hand is educational, but production compilers often use libraries. Nova uses Logos, a fast Rust lexer generator:
use logos::Logos;
#[derive(Logos, Debug, PartialEq)]
pub enum Token {
#[token("fn")]
Fn,
#[token("let")]
Let,
#[token("(")]
LParen,
#[token("->")]
Arrow,
#[regex(r"[a-zA-Z_][a-zA-Z0-9_]*")]
Identifier,
#[regex(r"[0-9][0-9_]*")]
Integer,
#[regex(r#""[^"]*""#)]
String,
#[regex(r"//[^\n]*", logos::skip)] // Skip comments!
#[regex(r"[ \t\n\r]+", logos::skip)] // Skip whitespace!
}
With Logos, you define tokens using attributes:
#[token("fn")]— matches exactly "fn"#[regex(r"[0-9]+")]— matches any sequence of digitslogos::skip— skip this pattern (whitespace, comments)
Logos generates a fast, optimized lexer at compile time. No runtime overhead!
Putting It All Together
Let's trace through a complete example:
let x = 42;
The lexer processes this character by character:
| Position | Characters | Action | Token |
|---|---|---|---|
| 0-3 | let |
Identifier starting with 'l', matches keyword | Let |
| 3 | |
Whitespace, skip | — |
| 4 | x |
Identifier (not a keyword) | Identifier("x") |
| 5 | |
Whitespace, skip | — |
| 6 | = |
Single character operator | Assign |
| 7 | |
Whitespace, skip | — |
| 8-10 | 42 |
Digit, read number | Integer(42) |
| 10 | ; |
Semicolon punctuation | Semicolon |
Final token stream:
This token stream is now ready for the parser!
Key Takeaway
The lexer transforms a stream of characters into a stream of tokens. Each token has a type (what it is), optionally a value (its content), and a span (where it is). This structured data is much easier for the next stage — the parser — to work with.
Why This Matters for Nova
Lexing might seem like a solved problem, but how a language handles this first stage affects everything downstream — especially error messages.
The Problem
Many languages lose precise position information during lexing. When an error occurs later, the compiler can only give vague locations like "somewhere on line 42." Debugging becomes guesswork.
How Nova Solves It
Nova's lexer preserves byte-accurate spans for every token. When verification fails, we can point to the exact character that caused the problem — not just the line, but the precise column and length.
Nova also tokenizes special keywords that other languages don't have:
requires— preconditions for function contractsensures— postconditions the function guaranteesinvariant— conditions that must always holdwhere— type constraints and refinements
These keywords are the foundation of Nova's verification system — they let you write contracts that the compiler can mathematically prove.
What's Next?
Now you understand how code becomes tokens. But tokens are still a flat list — we don't yet know that fn add(...) { ... } is a function definition.
In the next lesson, we'll learn about parsers and ASTs — how we build a tree structure that represents the meaning of the code.