In the past week, I implemented a simple and basic web crawler using Go. And I’m here to share how I did it.
Web Crawler
Web crawler, is a bot or a program that periodically or just once, fetches the website. This can be done for various reasons, either be creating a website ranking on your search engine for better query result or for scraping the data from a website .
In this blog, we will be implementing a very simple and easy web crawler, so let’s get started.
Setting up
Create a project folder, and set up your Go using
go mod init github.com/[your-github-username]/webcrawler
And let’s do the most important thing in the coding, that a person can do
// main.go
package main
import "fmt"
func main() {
fmt.Println("Hello world!")
}
Phewww, we have done the hardest part in our coding that is printing Hello world to the console.
Normalize URL
Now, before fetching the website’s content we want to create a way to normalize the url, so that if we get multiple urls that are pointing to same location but have difference in protocol, or casing of the letters or trailing slash or not.
We will be doing Test driven development(uncle bob will be happy…). Let’s write a test case for this.
// normalize_url_test.go
package main
import (
"testing"
"github.com/stretchr/testify/assert"
)
func TestNormalizeURL(t *testing.T) {
expected := "blog.vikuuu.dev/path"
url := "https://blog.vikuuu.dev/path"
got, err := normalizeURL(url)
assert.NoError(t, err)
assert.Equal(t, expected, got)
url = "http://blog.vikuuu.dev/path"
got, err = normalizeURL(url)
assert.NoError(t, err)
assert.Equal(t, expected, got)
url = "http://blog.vikuuu.dev/path/"
got, err = normalizeURL(url)
assert.NoError(t, err)
assert.Equal(t, expected, got)
url = "https://blog.vikuuu.dev/path/"
got, err = normalizeURL(url)
assert.NoError(t, err)
assert.Equal(t, expected, got)
expected = "vikuuu.github.com"
url = "HTTPS://Vikuuu.github.com"
got, err = normalizeURL(url)
assert.NoError(t, err)
assert.Equal(t, expected, got)
expected = ""
url = ""
got, err = normalizeURL(url)
assert.NoError(t, err)
assert.Equal(t, expected, got)
expected = "www.github.com/vikuuu"
url = "http://www.github.com/Vikuuu"
got, err = normalizeURL(url)
assert.NoError(t, err)
assert.Equal(t, expected, got)
url = "://bankai.xl"
got, err = normalizeURL(url)
assert.Error(t, err)
}
Here I am using the testify
package for testing. The test cases are pretty much simple, and
you can understand what we want from our function.
// normalize_url.go
package main
import (
"net/url"
"strings"
)
func normalizeURL(rawURL string) (string, error) {
parsedURL, err := url.Parse(rawURL)
if err != nil {
return "", err
}
fullPath := parsedURL.Host + parsedURL.Path
fullPath := strings.ToLower(fullPath)
fullPath := strings.TrimSuffix(fullPath, "/")
return fullPath, nil
}
Run the test and you should pass the test cases.
❯ go test ./...
ok github.com/Vikuuu/webcrawler 0.002s
Get URLs from HTML
Suppose we got the HTML data, now we want to fetch all the links, that the HTML might have so that we can also crawl those websites.
Let’s write some test cases for that.
// html_parse_test.go
package main
import (
"testing"
"github.com/stretchr/testify/assert"
)
func TestGetURLsFromHTML(t *testing.T) {
htmlBody := `<html>
<body>
<a href="/path/one">
<span>Some page</span>
</a>
<a href="https://other.com/path/one">
<span>some other page</span>
</a>
</body>
</html>`
inputUrl := "https://vikuuu.github.io"
malformedHtml := `<html>
<ch<bankai>>
<body>
</htl`
// Wrong raw base url provided
wrongRawURL := "://bankai.com"
_, err := getURLsFromHTML(htmlBody, wrongRawURL)
assert.Error(t, err)
// Malformed html body passed still no error should be returned
_, err = getURLsFromHTML(malformedHtml, inputUrl)
assert.NoError(t, err)
// get the valid links out
expected := []string{"https://vikuuu.github.io/path/one", "https://other.com/path/one"}
got, err := getURLsFromHTML(htmlBody, inputUrl)
assert.NoError(t, err)
assert.Equal(t, expected, got)
// malformed url in anchor tag
htmlBody = `<html>
<body>
<a href="://other.com/path/one">
<span>some other page</span>
</a>
</body>
</html>`
got, err = getURLsFromHTML(htmlBody, inputUrl)
assert.Zero(t, len(got))
}
Run the test and your test cases will fail, and that is what we want. In TDD we first write the test cases and run the test to see them fail(what a sadist) and then write the functionality that the tests are testing for, and then see them pass(now this is great).
Now, write the code for getURLsFromHTML
.
// html_parse.go
package main
import (
"strings"
"golang.org/x/net/html"
)
func getURLsFromHTML(htmlBody, rawBaseURL string) ([]string, error) {
baseURL, err := url.Parse(rawBaseURL)
if err != nil {
return nil, err
}
htmlReader := strings.NewReader(htmlBody)
doc, err := html.Parse(htmlReader)
if err != nil {
return nil, err
}
urls := []string{}
for n := range doc.Descendants() {
if n.Type == html.ElementNode && n.Data == "a" {
attrs := n.Attr
for _, attr := range attrs {
if attr.Key == "href" {
href, err := url.Parse(attr.Val)
if err != nil {
fmt.Printf("couldn't parse href '%v': %v\n", attr.Val, err)
continue
}
resolvedUrl := baseURL.ResolveReference(href)
urls = append(urls, resolvedUrl.String())
}
}
}
}
return urls, nil
}
In this function we are firstly normalizing the given base url.
We are creating the io.Reader
on the html string data we get, because for parsing the HTML we are using
golang.org/x/net/html
package, and in its Parse
function it takes the io.Reader
as input.
Then we traverse the tree structure created by parsing the html and we specifically looking for
a
tag, and href
attribute inside it, if we get the href
attribute in it we parse it
and resolve it against the base url of the HTML we are parsing, because sometime it may happen that the Url we get
is relative to the Url we are on.
We append the url and at last we return the urls slice.
Run the test and we will be passing all the test case.
Get HTML
Now we want to fetch the HTML content of the given Url. For this we won’t be writing the tests because this function won’t be a pure function, because it will have some side effects, suppose we are fetching a website HTML now and it gives us the data in certain format, but if we fetch it next day, it not a guarantee that we will get the same HTML content that we got yesterday. We write test case for mostly pure functions.
//parse_html.go
import (
// ...
"net/http"
"errors"
"io"
// ...
)
func getHTML(rawURL string) (string, error) {
c := http.Client{
Timeout: 15 * time.Second,
}
res, err := c.Get(rawURL)
if err != nil {
return "", err
}
defer res.Body.Close()
// 400+ Status Code handled
if res.StatusCode >= 400 {
return "", errors.New(res.Status)
}
// Content-Type is not text/html
contentType := res.Header.Get("content-type")
if !strings.Contains(contentType, "text/html") {
return "", errors.New("content type not 'text/html'")
}
body, err := io.ReadAll(res.Body)
if err != nil {
return "", err
}
return string(body), nil
}
With this we can get the HTML content of the Url we define.
Crawl Web Page
Now is the time to create the functionality for crawling the pages. We want that, we define a start url and after that we start by crawling that page, and then extract all the links and then crawl all those pages too.
We want to create concurrent crawlers that can crawl multiple pages simultaneously. For that we define a struct.
// crawl.go
package main
import (
"net/url"
"sync"
)
type config struct {
pages map[string]int
maxPages int
baseUrl *url.URL
mu *sync.Mutex
concurrencyControl chan struct{}
wg *sync.WaitGroup
}
In this config
structure we added the
- pages: map that will count how many times a url is encountered.
- maxPages: max pages crawl limit, if we do not define this we might start crawling the whole web.
- baseUrl: the starting url we are provided with.
- mu: mutex for working with multiple threads
- concurrencyControl: this will tell us how many concurrent crawlers we want?
- wg: for not exiting early and waiting for all the processes to finish.
Now let’s create a starting point from which all the crawling process will start.
// crawl.go
func (cfg *config) crawl() {
// Add new thread or goroutine in our case
cfg.wg.Add(1)
go func() {
// with this we have aquired a place for our
// goroutine to crawl the page.
cfg.concurrencyControl <- struct{}{}
cfg.crawlPage(cfg.baseUrl.String())
}()
// keep waiting for all the goroutines to finish
cfg.wg.Wait()
return
}
Now the function that will crawl the page and then recursively call itself to crawl the pages that it got from crawling the previous page.
// crawl.go
import (
"log"
"net/url"
)
func (cfg *config) crawlPage(rawCurrentURL string) {
// indicate we are done, even in the case of preemtive return
defer cfg.wg.Done()
// freeing up our space to allow another goroutine to start the crawl if waiting.
defer func() { <-cfg.concurrencyControl }()
if len(cfg.pages) >= cfg.maxPages {
// max limit reached, exit!
return
}
parsedCurrUrl, err := url.Parse(rawCurrentURL)
if err != nil {
return
}
if cfg.baseUrl.Host != parsedCurrUrl.Host {
// if we are on different host then we started with
// then stop (you can change it if you want).
return
}
nrmlCurrUrl, err := normalizeURL(rawCurrentURL)
if err != nil {
log.Printf("err normalizing url %s %s\n", rawCurrentURL, err)
}
isFirst := cfg.addPageVisit(nrmlCurrUrl)
// if we have crawled the page, exit!
if !isFirst {
return
}
body, err := getHTML(rawCurrentURL)
if err != nil {
log.Printf("%s\n", err)
}
urls, err := getURLsFromHTML(body, rawCurrentURL)
if err != nil {
log.Printf("%s\n", err)
}
for _, url := range urls {
cfg.wg.Add(1)
go func(c string) {
cfg.concurrencyControl <- struct{}{}
cfg.crawlPage(c)
}(url)
}
}
func (cfg *config) addPageVisit(normalizedURL string) (isFirst bool) {
_, ok := cfg.pages[normalizedURL]
if !ok {
cfg.pages[normalizedURL]++
isFirst = true
}
return isFirst
}
With this we have written the logic of crawling the web page.
Starting point
Now we just need to update the main.go
which will take the the arguments using terminal.
// main.go
package main
import (
"fmt"
"log"
"net/url"
"os"
"strconv"
"sync"
)
func main() {
args := os.Args[1:]
if len(args) < 1 {
fmt.Println("no website provided")
os.Exit(1)
}
rawUrl := args[0]
// set the concurrency control limit
mc, _ := strconv.Atoi(args[1])
// set the max pages to crawl limit
mp, _ := strconv.Atoi(args[2])
parsedUrl, err := url.Parse(rawUrl)
if err != nil {
log.Fatalf("%s\n", err)
}
cfg := config{
pages: make(map[string]int),
maxPages: mp,
baseUrl: parsedUrl,
mu: &sync.Mutex{},
concurrencyControl: make(chan struct{}, mc),
wg: &sync.WaitGroup{},
}
log.Printf("starting crawl of %s\n", rawUrl)
cfg.crawl()
fmt.Printf("=============================\n"+
" REPORT for %s\n"+
"=============================\n", rawUrl)
for k, v := range cfg.pages {
fmt.Printf("Found %d internal links to %s\n", v, k)
}
}
With this you can start your crawling with this command
go run . https://vikuuu.github.io/ 3 10
Be sure to add the log command to see everything in action or you might think that your program is stuck.
Afterword
And voilaaa, with this we have our basic implementation of web crawler.
Until next time.
Code Link: Github
Just know this,
Reinvent the wheel, so that you can learn how to invent wheel
– a nobody