Writing Go lexer in Go

2024-11-12

gopher image

Introduction

Learning about lexical analysis from a high-level perspective might seem straightforward. However, when it comes to actually implementing it, things can quickly get confusing (at least, that’s been my experience!). This article is an attempt to delve deeper into lexing — understanding its inner workings, nuances, and practical implementation in Go. Hopefully, you’ll find something helpful here too.

What is Lexical Analysis?

Lexical analysis (also called lexing or tokenization) is the process of breaking down a large block of text — in this case, a program — into smaller, recognizable chunks called tokens. Think of it as identifying individual “words” in a sentence. Each token can represent keywords, identifiers, symbols, or other elements that form the structure of a programming language.

Lexical analysis doesn’t aim to understand the meaning of the code. Instead, it’s an initial step to organize the code into tokens that can be further analyzed by a parser or evaluator. Here, we’ll walk through building a simple lexer in Go.

Writing a lexer in Go

In this article, we’ll create a lexer specifically to parse Go’s struct syntax. We’ll start with a small example to get an idea of what we want to lex.

Example Input

Consider the following go snippet.

1
package main
2

3
type users struct {
4
    ID int
5
    Name string
6
    IsMember bool
7
}

We want our lexer to tokenize this code. But before we can write our lexer, let’s define what “tokens” we’re dealing with.

Defining Tokens

Tokens represent the smallest meaningful elements of the code, such as keywords, data types, identifiers, and symbols. We’ll define our tokens in a Go file called token.go.

1
package parser
2

3
type TokenType string
4

5
const (
6
    // Special tokens
7
    ILLEGAL = "ILLEGAL"
8
    EOF     = "EOF"
9

10
    // Literals
11
    IDENT = "IDENT"
12

13
    // Misc characters
14
    ASTERISK = "*"
15
    COMMA    = ","
16
    LBRACE   = "{"
17
    RBRACE   = "}"
18
    LPAREN   = "("
19
    RPAREN   = ")"
20

21
    // Keywords
22
    TYPE    = "TYPE"
23
    STRUCT  = "STRUCT"
24
    PACKAGE = "PACKAGE"
25

26
    // Data Types
27
    INT    = "INT"
28
    STRING = "STRING"
29
    BOOL   = "BOOL"
30
)
31

32
type Token struct {
33
    Type    TokenType
34
    Literal string
35
}
36

37
var keywords = map[string]TokenType{
38
    "type":    TYPE,
39
    "struct":  STRUCT,
40
    "package": PACKAGE,
41
    "int":     INT,
42
    "string":  STRING,
43
    "bool":    BOOL,
44
}

Here, we define constants for various tokens. The Token struct contains a token type and a literal value. The keywords map helps in identifying Go-specific keywords and data types.

Implementing the Scanner.

The purpose of scanner.go is to read through the input code one character at a time and identify small pieces of it, called tokens. These tokens will be basic building blocks (like keywords, identifiers, braces, etc.) that will make it easier to analyze and understand the structure of the code later. Here’s how it achieves this:

Character-by-Character Scanning:
- scanner reads the code one character at a time.
- It keeps track of its current position (position) , next position (readPosition) , and the current character (ch).
Tokenizing Individual Elements:
- Each time NextToken() is called, the scanner determines what kind of token it’s looking at. It uses the character in ch to decide if it’s a specific symbol (like { or }), a keyword (like package), an identifier (like variable names), or an integer.
- The NextToken() method uses helper functions, readChar that advances the scanner by reading the next character, skipWhitespace that skips spaces, tabs, and newlines, lastly readIdentifier and readNumber that capture sequences of letters(identifier/keywords) or numbers, respectively.
Token Creation:
- Depending on what it reads, the scanner creates a Token with a Type and Literal (actual text value).
- For example, if it reads {, it creates a token with Type: LBRACE and Literal: "{".
Basic Error Handling:
- If the scanner encounters a character it doesn’t recognize, it returns a token of type ILLEGAL.

The scanner.go file serves as the foundation of lexical analysis. It’s responsible for breaking the input code into recognizable pieces, which will then be used by the lexer.go for further processing.

1
package parser
2

3
type Scanner struct {
4
    input        string
5
    position     int  // current position in input
6
    readPosition int  // current reading position
7
    ch           byte // current character
8
}
9

10
// NewScanner creates a new instance of Scanner
11
func NewScanner(input string) *Scanner {
12
    s := &Scanner{input: input}
13
    s.readChar() // Initialize first character
14
    return s
15
}
16

17
// readChar advances the scanner and updates the current character
18
func (s *Scanner) readChar() {
19
    if s.readPosition >= len(s.input) {
20
        s.ch = 0
21
    } else {
22
        s.ch = s.input[s.readPosition]
23
    }
24
    s.position = s.readPosition
25
    s.readPosition++
26
}
27

28
// NextToken generates the next token from input
29
func (s *Scanner) NextToken() Token {
30
    var tok Token
31
    s.skipWhitespace()
32

33
    switch s.ch {
34
    case '{':
35
        tok = Token{Type: LBRACE, Literal: string(s.ch)}
36
    case '}':
37
        tok = Token{Type: RBRACE, Literal: string(s.ch)}
38
    case 0:
39
        tok = Token{Type: EOF, Literal: ""}
40
    default:
41
        if isLetter(s.ch) {
42
            literal := s.readIdentifier()
43
            tokType := LookupIdent(literal)
44
            tok = Token{Type: tokType, Literal: literal}
45
            return tok
46
        } else if isDigit(s.ch) {
47
            literal := s.readNumber()
48
            tok = Token{Type: INT, Literal: literal}
49
            return tok
50
        } else {
51
            tok = Token{Type: ILLEGAL, Literal: string(s.ch)}
52
        }
53
    }
54
    s.readChar()
55
    return tok
56
}

Testing the Scanner

To ensure our lexer correctly tokenizes input, we’ll create tests.

1
package parser
2

3
import (
4
    "testing"
5
)
6

7
func TestNextToken(t *testing.T) {
8
    input := `package main
9

10
    type users struct {
11
        ID int
12
        Name string
13
        IsActive bool
14
    }
15
    `
16

17
    tests := []struct {
18
        expectedType    TokenType
19
        expectedLiteral string
20
    }{
21
        {PACKAGE, "package"},
22
        {IDENT, "main"},
23
        {TYPE, "type"},
24
        {IDENT, "users"},
25
        {STRUCT, "struct"},
26
        {LBRACE, "{"},
27
        {IDENT, "ID"},
28
        {INT, "int"},
29
        {IDENT, "Name"},
30
        {STRING, "string"},
31
        {IDENT, "IsActive"},
32
        {BOOL, "bool"},
33
        {RBRACE, "}"},
34
        {EOF, ""},
35
    }
36

37
    scanner := NewScanner(input)
38

39
    for i, tt := range tests {
40
        tok := scanner.NextToken()
41

42
        if tok.Type != tt.expectedType {
43
            t.Fatalf("tests[%d] - tokentype wrong. expected=%q, got=%q", i, tt.expectedType, tok.Type)
44
        }
45

46
        if tok.Literal != tt.expectedLiteral {
47
            t.Fatalf("tests[%d] - literal wrong. expected=%q, got=%q", i, tt.expectedLiteral, tok.Literal)
48
        }
49
    }
50
}

Building the Lexer

The purpose of lexer.go is to orchestrate the tokenization process with a bit more control, allowing for backtracking and other parser-level functionalities. Here’s what it’s doing:

Buffering for Backtracking:
- Lex (lexer) has a buffer for tokens (buf), allowing it to store one token in case it needs to go back (known as “backtracking”). This feature is helpful when parsing, as some tokens may need to be checked multiple times to confirm they are correctly processed.
Coordination with the Scanner:
- The lexer uses the scanner to get tokens via nextToken().
- This allows the lexer to step through tokens as it builds a more structured representation of the code.
Token Management:
- nextToken(): Retrieves the next token, either from the buffer (if available) or directly from the scanner.
- unscan(): Allows the lexer to “push back” a token into the buffer, effectively enabling it to revisit a token if needed.
Higher-Level Parsing Interface:
- Lexer() in the lexer goes through each token produced by nextToken() until reaching the end of the file (EOF).
- By handling token organization in this structured way, the lexer provides a foundation for interpreting the tokenized code and converting it into an intermediate form ready for parsing.

In essence, lexer.go is a layer on top of the scanner that adds token management features, enabling backtracking and providing an interface to walk through the tokenized code systematically. This structure is crucial as the next step (a full parser) will need to analyze and interpret tokens, possibly revisiting them, to correctly understand the overall code structure.

1
package parser
2

3
type Lexer struct {
4
    s   *Scanner
5
    buf struct {
6
        tok Token
7
        n   int
8
    }
9
}
10

11
// NewLexer creates a new instance of Lexer
12
func NewLexer(s *Scanner) *Lexer {
13
    return &Lexer{s: s}
14
}
15

16
func (p *Lexer) Lex() {
17
    for {
18
        tok := p.nextToken()
19
        if tok.Type == EOF {
20
            break
21
        }
22
    }
23
}
24

25
// nextToken retrieves the next token
26
func (p *Lexer) nextToken() Token {
27
    if p.buf.n != 0 {
28
        p.buf.n = 0
29
        return p.buf.tok
30
    }
31

32
    tok := p.s.NextToken()
33
    p.buf.tok = tok
34
    return tok
35
}

Testing the Lexer

1
package parser
2

3
import (
4
    "testing"
5
)
6

7
func TestLexer(t *testing.T) {
8
    input := `package main
9

10
    type users struct {
11
        ID int
12
        Name string
13
        IsActive bool
14
    }
15
    `
16

17
    tests := []Token{
18
        {Type: PACKAGE, Literal: "package"},
19
        {Type: IDENT, Literal: "main"},
20
        {Type: TYPE, Literal: "type"},
21
        {Type: IDENT, Literal: "users"},
22
        {Type: STRUCT, Literal: "struct"},
23
        {Type: LBRACE, Literal: "{"},
24
        {Type: IDENT, Literal: "ID"},
25
        {Type: INT, Literal: "int"},
26
        {Type: IDENT, Literal: "Name"},
27
        {Type: STRING, Literal: "string"},
28
        {Type: IDENT, Literal: "IsActive"},
29
        {Type: BOOL, Literal: "bool"},
30
        {Type: RBRACE, Literal: "}"},
31
        {Type: EOF, Literal: ""},
32
    }
33

34
    scanner := NewScanner(input)
35
    lexer := NewLexer(scanner)
36

37
    for i, expected := range tests {
38
        tok := lexer.nextToken()
39
        if tok.Type != expected.Type {
40
            t.Fatalf("tests[%d] - tokentype wrong. expected=%q, got=%q", i, expected.Type, tok.Type)
41
        }
42
        if tok.Literal != expected.Literal {
43
            t.Fatalf("tests[%d] - literal wrong. expected=%q, got=%q", i, expected.Literal, tok.Literal)
44
        }
45
    }
46
}

With this, we have a basic lexer that can tokenize Go struct syntax.

In summary, scanner.go and lexer.go play essential roles in transforming raw code into a structured series of tokens, laying the groundwork for building a complete parser. scanner.go performs the detailed work of reading and identifying individual characters, while lexer.go adds higher-level control, like backtracking and token management, to make parsing more flexible and powerful. Together, they act as the first step in interpreting code, converting raw text into a form that can be further analyzed and eventually executed or compiled.

Understanding these files is crucial for anyone interested in building parsers, compilers, or interpreters, as they highlight the process of breaking down and managing code syntax. With a solid scanner and lexer in place, the next stages of parsing and building an abstract syntax tree become significantly easier, unlocking the potential for language processing and custom compilers. Whether you’re working with Go or another language, mastering these foundational components is key to designing robust and efficient language tools.

#go

#lexing