Writing a Lisp Interpreter in Rust - Part 1: Lexing

Introduction to Oxylisp

Why build a Lisp interpreter? Because Lisp is syntax in its purest form. No keywords, no precedence rules, just lists and symbols. It’s the perfect language for learning how interpreters work.

This series will teach you interpreter basics through building. No theory, just code that works.

First, let’s define the syntax of our Lisp:

Besides numeric and string literals, we have the following data structures:

Lists can be defined by [] and elements are separated by white space

[1 2 3]

Records are defined by key and value pairs.

{:a [3 5 7] :b "foo"}

Variables are defined by def and functions by defn. Additionally, you can write anonymous functions using fn.

(def x 42)

(defn double [n] (* 2 n))

(def plus-five (map (fn [n] (+ n 5)) (range 0 10)))

Lexing

Now that we know the syntax of our language, we can start writing our lexer. Lexing is the process of converting a sequence of characters into a sequence of tokens. We need a well-defined list of tokens that our lexer will recognize so we can later parse them.

I’ve come up with the following main tokens:

pub enum Tokens {
    Bounds(TokenBounds),
    Literal(Literal),
    Symbol(String),
    Key(String),
}

That can be broken down into:

Bounds: defines the start and end of a list, record or function call.

pub enum TokenBounds {
    LeftParen,
    RightParen,
    LeftBracket,
    RightBracket,
    LeftCurlyBraces,
    RightCurlyBraces,
}

Literals: numeric, string, list, records and primitives (true, false, nil) literals.

pub enum Literal {
    Nil,
    Symbol(String),
    String(String),
    Integer(i32),
    Bool(bool),
    List(Vec<Literal>),
    Record(HashMap<String, Literal>),
}

Symbols: variables and function names.
Keys: record keys.

With the types defined, we can now write the lexer.

We will define some regular expressions to read the more complex tokens.

lazy_static! {
    static ref IS_SYMBOL: Regex = Regex::new(r"^[A-Za-z+*/-=<>!][A-Za-z0-9+*/-=<>!]*$")
        .expect("Invalid symbol regex pattern");
    static ref IS_KEY: Regex = Regex::new(r"^:[A-Za-z+*/-=<>!][A-Za-z0-9+*/-=<>!]*$")
        .expect("Invalid key regex pattern");
    static ref IS_STRING: Regex = Regex::new(r#"^"([^"\\]|\\.)*"$"#)
        .expect("Invalid string regex pattern");
    static ref IS_INTEGER: Regex = Regex::new(r"^-?\d+$")
        .expect("Invalid integer regex pattern");
    static ref IS_FLOAT: Regex = Regex::new(r"^-?\d+\.\d+$")
        .expect("Invalid float regex pattern");
}

Using the whitespace trick, we can split on whitespace and match each substring to its token. (For why we chose this approach, see Design Decisions.)

pub fn tokenize(expression: &str) -> Result<Vec<Tokens>> {
    let tokens: Result<Vec<Tokens>> = expression
        .replace('(', " ( ")
        .replace(')', " ) ")
        .replace('[', " [ ")
        .replace(']', " ] ")
        .replace('{', " { ")
        .replace('}', " } ")
        .split_whitespace()
        .map(tokenize_single)
        .collect();
    
    tokens
}

fn tokenize_single(token: &str) -> Result<Tokens> {
    match token {
        "(" => Ok(Tokens::Bounds(TokenBounds::LeftParen)),
        ")" => Ok(Tokens::Bounds(TokenBounds::RightParen)),
        "[" => Ok(Tokens::Bounds(TokenBounds::LeftBracket)),
        "]" => Ok(Tokens::Bounds(TokenBounds::RightBracket)),
        "{" => Ok(Tokens::Bounds(TokenBounds::LeftCurlyBraces)),
        "}" => Ok(Tokens::Bounds(TokenBounds::RightCurlyBraces)),
        "true" => Ok(Tokens::Literal(Literal::Bool(true))),
        "false" => Ok(Tokens::Literal(Literal::Bool(false))),
        "nil" => Ok(Tokens::Literal(Literal::Nil)),
        "+" | "-" | "*" | "/" | "=" | "<" | ">" | "or" => {
            Ok(Tokens::Symbol(token.to_string()))
        },
        token if IS_KEY.is_match(token) => {
            if token.len() <= 1 {
                return Err(anyhow!("Invalid key token: {}", token));
            }
            Ok(Tokens::Key(token[1..].to_string()))
        },
        token if IS_STRING.is_match(token) => {
            if token.len() < 2 {
                return Err(anyhow!("Invalid string token: {}", token));
            }
            let content = &token[1..token.len()-1];
            let unescaped = unescape_string(content)?;
            Ok(Tokens::Literal(Literal::String(unescaped)))
        },
        token if IS_FLOAT.is_match(token) => {
            let float_val: f64 = token.parse()
                .map_err(|_| anyhow!("Invalid float format: {}", token))?;
            let int_val = float_val as i32;
            Ok(Tokens::Literal(Literal::Integer(int_val)))
        },
        token if IS_INTEGER.is_match(token) => {
            let int_val: i32 = token.parse()
                .map_err(|_| anyhow!("Integer overflow or invalid format: {}", token))?;
            Ok(Tokens::Literal(Literal::Integer(int_val)))
        },
        token if IS_SYMBOL.is_match(token) => {
            Ok(Tokens::Symbol(token.to_string()))
        },
        _ => Err(anyhow!("Unrecognized token: '{}'", token)),
    }
}

fn unescape_string(s: &str) -> Result<String> {
    let mut result = String::new();
    let mut chars = s.chars();
    
    while let Some(ch) = chars.next() {
        if ch == '\\' {
            match chars.next() {
                Some('n') => result.push('\n'),
                Some('t') => result.push('\t'),
                Some('r') => result.push('\r'),
                Some('\\') => result.push('\\'),
                Some('"') => result.push('"'),
                Some(c) => return Err(anyhow!("Invalid escape sequence: \\{}", c)),
                None => return Err(anyhow!("Incomplete escape sequence at end of string")),
            }
        } else {
            result.push(ch);
        }
    }
    
    Ok(result)
}

A expression like (+ 40 2) will be tokenized into a array of:

[Tokens::Bounds(TokenBounds::LeftParen),
  Tokens::Symbol("+".to_string()),
  Tokens::Literal(Literal::Integer(40)),
  Tokens::Literal(Literal::Integer(2)),
  Tokens::Bounds(TokenBounds::RightParen)]

The lexer is the foundation. It turns text into structured data our parser can understand.

Next: Part 2 - Parsing