Inverted index is the data-structure that powers the most if not all the Information Retrieval mechanisms, be it web search engine, ElasticSearch (which is powered by Apache Lucene under the hood written in Yava{Java}). I started reading about it after getting to know that ElasticSearch uses different type of database data indexing from my senior. So I did what I wanted to do, starting reading a little about it and started to think about implementing it (who wouldn’t want to do that).
In this blog, I will be sharing how I created a toy Inverted Index data-structure, and will(not) be creating a simple parser for Boolean Query to retrieve the relevant data that I want(maybe next time).
What is Inverted Index
Heck, what the hell is Information Retrieval. Anything that would require us to find some piece of data from the ocean of unstructured data from some small context. You searching anything on the web, the swarm of the unstructured data be it good or bad (your luck man…) based on some keywords that you give to the search engine.
Information Retrieval has been a critical part of human history(I don’t know if it’s that deep). Like you trying to find a book on how to Rizz a gal from library, so you tell the description of what you are looking or some important keywords. Using that either description or keyword, the librarian will get you the best book that might be available at the library or what the librarian will think is the best.
Now, Inverted Index helps us to structure those vast amount of unstructured data, so that information retrieval might be the most accurate according to the keywords provided.
How is Inverted Index structured?
Basically, the inverted index maps the word with the documents in which it is present.
+-----------+------+ +-------------+
| word | freq | | document ID |
+-----------+------+ +-------------+
| bankai | 2 | ---> | 1 -> 2 |
| domain | 1 | ---> | 2 |
| expansion | 1 | ---> | 2 |
+-----------+------+ +-------------+
Above is the diagram that I made to represent what I meant.
Toy Inverted Index
So, first I want to have 2 strings, and I want to create an Inverted index on those. For doing so we have to firstly process those string i.e., making them go through some kind of language processing and getting the output and then creating an inverted index on that processed data.
Flow of code will look like this.
DATA --> NLP(natural language processing) --> INDEXER
Data processing
Firstly, we will build our tokenizer
that will tokenize the given input string stream.
/*
tokenizer.go
*/
package main
import (
"strings"
"unicode"
)
type TokenType int
const (
word TokenType = iota
punct
whitespace
)
type Token struct {
Value string
Type TokenType
}
func isSeparator(r rune) bool {
return unicode.IsSpace(r) || unicode.IsPunct(r)
}
func tokenize(input string) []Token {
var token []Token
var currTokenType TokenType
var currTokenVal strings.Builder
for _, r := range input {
if isSeparator(r) {
if currTokenType == word {
tokens = append(tokens, Token{
Value: currTokenVal.String(),
Type: currTokenType,
})
currTokenVal.Reset()
}
currTokenType = whitespace
if unicode.IsPunct(r) {
currTokenType = punct
}
tokens = append(tokens, Token{
Value: string(r),
Type: currTokenType,
})
} else {
if currTokenType != word {
currTokenType = word
}
currTokenVal.WriteRune(r)
}
}
if currTokenType == word {
tokens = append(tokens, Token{
Value: currTokenVal.String(),
Type: currTokenType,
})
}
return tokens
}
With this we have created our tokenizer
that will take a stream of string input and tokenize them.
Now, we will be stemming
our stream of tokens that we will get from our tokenizer
.
/*
stemmer.go
*/
package main
import (
"strings"
golem "github.com/aaaton/golem/v4"
"github.com/aaaton/golem/v4/dicts/en"
)
// Initialize the stopwords with the english stopwords, you can ask the LLM to generate
// this for you. Map them to empty struct
// because empty struct are of size 0.
var stopwords = map[string]struct{}{
// initialize here.
}
func removeStopWords(tokens []Token) []Token {
result := []Token{}
for _, token := range tokens {
_, ok := stopwords.[token.Value]
if !ok {
result = append(result, token)
}
}
return result
}
func removePunctAndSpace(tokens []Token) []Token {
result := []Token{}
for _, token := range tokens {
if token.Type == word {
result = append(result, token)
}
}
return result
}
func lemmetizedTokens(tokens []Token) []Token {
lemmer, err := golem.New(en.New())
if err != nil {
panic(err)
}
for i, token := range tokens {
token.Value = lemmer.Lemma(token.Value)
tokens[i] = token
}
return tokens
}
func stemmer(tokens []Token) []Token {
for i, token := range tokens {
tokens[i] = Token{
Value: strings.ToLower(token.Value),
Type: token.Type,
}
}
tokens = removePunctAndSpace(tokens)
tokens = removeStopWords(tokens)
return lemmetizedTokens(tokens)
}
We are firstly, initializing an map
, that contains all the English stop words. Stop words are words that themselves
do not add any meaning to a sentence for natural language processing. So we want to filter these words out, we
wouldn’t want to index words that in and of itself do not add any meaning to the document(let’s save some precious space).
Then we lemmetize our word, now lemmetization is the process of converting a word to its base form or grouping different inflected forms of a word so they can be analyzed as a single item. Example words good, best, better can be reduced to a single word good. By doing this we can index different word that would mean same thing using only a single word.
Indexer
Now that we are done with our data-processing, we can go ahead and create our indexer.
/*
indexer.go
*/
package main
import (
"encoding/gob"
"errors"
"os"
"path/filepath"
"slices"
)
type IR struct {
Index map[string][]int
}
func NewIR() *IR {
return &IR{
Index: make(map[string][]int),
}
}
// save the data into a file, for persistant storage.
func (i *IR) serializer(indexPath string) {
f, err := os.OpenFile(indexPath, os.O_CREATE|os.O_WRONLY|os.O_TRUNC, 0644)
if err != nil {
panic(err)
}
encoder := gob.NewEncoder(f)
err = encoder.Encode(i)
if err != nil {
panic(err)
}
}
// loads any data that might be already indexed.
func (i *IR) deserializer(indexFile string) {
f, err := os.Open(indexFile)
if err != nil {
if errors.Is(err, os.ErrNotExist) {
return
}
panic(err)
}
decoder := gob.NewDecoder(f)
err = decoder.Decode(i)
if err != nil {
panic(err)
}
}
func indexer(tokens []Token, docID int) *IR {
cwd, err := os.Getwd()
if err != nil {
panic(err)
}
indexFile := filepath.Join(cwd, "invert.index")
ir := NewIR()
ir.deserializer(indexFile)
for _, token := range tokens {
list, found := ir.Index[token.Value]
// if the word is not present in the index, then we add
// it to the index, with docID it was found in and
// go to next iteration
if !found {
ir.Index[token.Value] = []int{docID}
continue
}
// check if the docID is present, if it not then add
// the docID in the slice of docID's
present := slices.Contains(list, docID)
if !present {
ir.Index[token.Value] = append(list, docID)
}
}
ir.serializer(indexFile)
return ir
}
The structure of the Inverted Index is very simple, we are creating a map
that will maps the word(string) to the
slice of int (acting as docID
).
Now, we want to persist our index, so that if we want we can feed our indexer data in multiple iteration. For this
I am using encoding/gob
package, that encodes the go data structure so that transferring can become easy. We will
be saving this data to a file, we can do so because gob encoder
and decoder
uses the interface io.Reader
and
io.Writer
and we can get them by opening a file.
Then creating the index is pretty easy, just loop over the token map them to the slice of docID
.
And there you have it, your own toy Inverted Index.
Testing Indexer
Let’s add some sample data for testing our indexer.
/*
main.go
*/
package main
import (
"flag"
"fmt"
"sort"
"os"
)
func main() {
flag.String("init", "", "Initialize the data and create an Inverted index.")
flag.Parse()
if len(os.Args) < 2 {
flag.Usage()
return
}
switch os.Args[1] {
case "init":
initializeData()
}
func initializeData() {
inputs := []struct {
ID int
Data string
}{
{
ID: 1,
Data: "I did enact Julius Caesar: I was killed i' the Capitol; Brutus killed me",
},
{
ID: 2,
Data: "So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious",
},
}
for _, input := range inputs {
tokens := tokenize(input.Data)
fmt.Println("Original input: ", input)
fmt.Println("Tokenized output:")
for i, token := range tokens {
fmt.Printf("%d: %s (%d)\n", i+1, token.Value, token.Type)
}
stemmedTokens := stemmer(tokens)
fmt.Println("Stemmed output:")
for i, token := range stemmedTokens {
fmt.Printf("%d: %s (%d)\n", i+1, token.Value, token.Type)
}
fmt.Println("Starting indexing the input.")
ir := indexer(stemmedTokens, input.ID)
fmt.Println("Finished indexing the input.")
printSortedMap(ir)
}
}
func printSortedMap(ir *IR) {
fmt.Println("Printing the index.")
keys := make([]string, 0, len(ir.Index))
for key := range ir.Index {
keys = append(keys, key)
}
sort.Strings(keys)
for _, k := range keys {
fmt.Printf("word: %s, list: %v\n", k, ir.Index[k])
}
}
Now, run the command
go run . init
And you would get this output at the last
Printing the index.
word: ambitious, list: [2]
word: brutus, list: [1 2]
word: caesar, list: [1 2]
word: capitol, list: [1]
word: enact, list: [1]
word: hath, list: [2]
word: julius, list: [1]
word: kill, list: [1]
word: let, list: [2]
word: noble, list: [2]
word: tell, list: [2]
Afterwords
So, with this we have our toy Inverted Index structure that can help us with Information Retrieval.
Just know this,
Reinvent the wheel, so that you can learn how to invent wheel
– a nobody