← Home

From linear to binary search

This week, I’ve been playing around with a parsing library I wrote, working out how how I can rewrite it to make (hopefully sensible) use of Go generics.

It has an Input which maintains an Index, and you can Peek characters, but also Take characters which moves the Index - i.e. moves a pointer through the string. You use it by building up bigger parsers from parsers. Each of them rolls back if it can’t parse the whole thing.

type UUID string

var hexParser = parse.RuneIn("0123456789abcdefABCDEF")
var uuidStringParser = parse.StringFrom(
	parse.StringFrom(parse.Times(8, hexParser)),
	parse.Rune('-'),
	parse.StringFrom(parse.Times(4, hexParser)),
	parse.Rune('-'),
	parse.StringFrom(parse.Times(4, hexParser)),
	parse.Rune('-'),
	parse.StringFrom(parse.Times(4, hexParser)),
	parse.Rune('-'),
	parse.StringFrom(parse.Times(12, hexParser)),
)
var uuidParser = parse.Convert(uuidStringParser, func(s string) (UUID, error) {
	return UUID(s), nil
})

func main() {
	uuid := "123e4567-e89b-12d3-a456-426655440000"
	input := parse.NewInput(uuid)
	match, ok, err := uuidParser.Parse(input)
	fmt.Println("match:", match)
	fmt.Println("ok:", ok)
	fmt.Println("err:", err)
	fmt.Println("pos:", input.Position())
}

I wanted to add a Position feature, so that you can get a more human, line index and column position, not just the byte or character index within the file. So I created an array of the character positions of each line ending in the input string and stored it within the InputString struct as newLines.

To get the current line and col from a given index, you can read (forwards or backwards) through the array until you get to the right line. That’s what I did first.

func (in *InputString) Position() (line, column int) {
	var previousLineEnd int
	for lineIndex, lineEnd := range in.newLines {
		if in.charIndex > previousLineEnd && in.charIndex < lineEnd {
			return lineIndex + 1, in.charIndex - previousLineEnd
		}
		previousLineEnd = lineEnd
	}
	return -1, -1
}

Although it was quick to write, I was unhappy that it was a linear search. The speed taken is proportional to your position in the file - i.e. it will get slower for later lines because it has to read through the preceding lines.

A binary search would be much more efficient for large files. Instead of searching each item in the array in order, it will try the item that’s in the middle of all the values, immediately discounting half of the range. The search algorithm next tries a midpoint within the remaining values, repeating this process until it finds the right answer.

I thought about writing a binary search myself, but then I found that there’s one in the standard library called sort.Search - I never thought to look in sort to find that.

func (in *InputString) Position() (line, column int) {
	lineIndex := sort.Search(len(in.newLines), func(lineIndex int) bool {
		return in.charIndex <= in.newLines[lineIndex]
	})
	var previousLineEnd int
	if lineIndex > 0 {
		previousLineEnd = in.newLines[lineIndex-1] + 1
	}
	return lineIndex, in.charIndex - previousLineEnd
}

Is it worth it?

I wrote a quick benchmark.

package bench

import (
	"math/rand"
	"sort"
	"testing"
)

func linearSearch(data []int, value int) (index int) {
	for i, v := range data {
		if v > value {
			return i
		}
	}
	return 0
}

func binarySearch(data []int, value int) (index int) {
	return sort.Search(len(data), func(index int) bool {
		return data[index] < value
	})
}

func TestLinearAndBinary(t *testing.T) {
	// Generate some lines.
	lineIndices := make([]int, 1000)
	var index int
	for i := 1; i < len(lineIndices); i++ {
		index += rand.Intn(200) + 1
		lineIndices[i] = index
	}

	for i := 0; i < index; i++ {
		l := linearSearch(lineIndices, index)
		b := binarySearch(lineIndices, index)
		if l != b {
			t.Fatalf("expected l and b to match, got l=%v, b=%v for input %d", l, b, i)
		}
	}
}

func BenchmarkLinearAndBinary(b *testing.B) {
	// Generate some lines.
	lineIndices := make([]int, 10000)
	var index int
	for i := 1; i < len(lineIndices); i++ {
		index += rand.Intn(200) + 1
		lineIndices[i] = index
	}

	data := lineIndices[:10]
	b.Run("linear: 10", func(b *testing.B) {
		bench(b, data, linearSearch)
	})
	b.Run("binary: 10", func(b *testing.B) {
		bench(b, data, binarySearch)
	})

	data = lineIndices[:100]
	b.Run("linear: 100", func(b *testing.B) {
		bench(b, data, linearSearch)
	})
	b.Run("binary: 100", func(b *testing.B) {
		bench(b, data, binarySearch)
	})

	data = lineIndices[:1000]
	b.Run("linear: 1000", func(b *testing.B) {
		bench(b, data, linearSearch)
	})
	b.Run("binary: 1000", func(b *testing.B) {
		bench(b, data, binarySearch)
	})

	data = lineIndices[:10000]
	b.Run("linear: 10000", func(b *testing.B) {
		bench(b, data, linearSearch)
	})
	b.Run("binary: 10000", func(b *testing.B) {
		bench(b, data, binarySearch)
	})
}

func bench(b *testing.B, data []int, f func([]int, int) int) {
	max := data[len(data)-1]
	for i := 0; i < b.N; i++ {
		for index := 0; index < max; index++ {
			f(data, index)
		}
	}
}

For very small input sets, it costs more to do the setup and call the function than it saves, but somewhere between 50 and 100 values, it starts paying back.

Given my use case, in most cases, there will be more than 100 lines, so it makes sense to keep it.

BenchmarkLinearAndBinary/linear:_10-10            305419              3745 ns/op
BenchmarkLinearAndBinary/binary:_10-10            152731              7827 ns/op
BenchmarkLinearAndBinary/linear:_50-10             23444             51182 ns/op
BenchmarkLinearAndBinary/binary:_50-10             18067             66441 ns/op
BenchmarkLinearAndBinary/linear:_100-10             6616            182412 ns/op
BenchmarkLinearAndBinary/binary:_100-10             7824            152947 ns/op
BenchmarkLinearAndBinary/linear:_150-10             2707            444944 ns/op
BenchmarkLinearAndBinary/binary:_150-10             4393            271768 ns/op
BenchmarkLinearAndBinary/linear:_1000-10              68          17096097 ns/op
BenchmarkLinearAndBinary/binary:_1000-10             500           2400367 ns/op
BenchmarkLinearAndBinary/linear:_10000-10              1        1559057416 ns/op
BenchmarkLinearAndBinary/binary:_10000-10             32          35949938 ns/op