An LRU in Go (Part 2)
So we created a concurrency-safe LRU in the last post, but it was too slow when used concurrently because of all the locking.
Reducing the amount of time spent waiting on locks is actually not trivial, but not undoable. You can use things like the sync/atomic
package to cleverly change pointers back and forth with basically no locking needed. However, our situation is more complicated than that: we use two separate data structures that need to be updated atomically (the list itself and the index.) I don’t know that we can do that with no mutexes.
We can, however, easily lessen the amount of time spent waiting on mutexes by using sharding.
You see, right now we have a single list.List
to hold the data and a map
for the index. This means that when one goroutine is trying to add, say, {"foo": "bar"}
, it has to wait until another finishes updating {"bar" : "baz" }
because even though the two keys are not related at all, the lock needs to protect the entire list and index.
A quick and easy way to go around this is to split the data structures into shards. Each shard has its own data structures:
type shard struct {
cap int
len int32
sync.Mutex // protects the index and list
idx map[interface{}]*list.Element // the index for our list
l *list.List // the actual list holding the data
}
And so now, our LRU looks a bit different:
type LRU struct {
cap int // the max number of items to hold
nshards int // number of shards
shardMask int64 // mask used to select correct shard for key
shards []*shard
}
The methods in LRU are now also much simpler as all they do is call an equivalent method in a given shard:
func (l *LRU) Add(key, val interface{}) {
l.shard(key).add(key, val)
}
So now it’s time to figure out how to select the correct shard for a given key. In order to prevent two different values of the same key to be stored in two different shards, we need a given key to always return the same shard. We do this by passing a byte representation of the key through the Fowler–Noll–Vo hash function to generate a hash as an int32
number. We do a logical AND
between this hash and a shard mask.
The hard part is actually getting a byte representation of a key. It would be trivial if the key was of a fixed type (say, int
or string
) but we actually use interface{}
, which gives us a bit more work to do. The shard()
function actually is a large type switch
that tries to find the quickest way possible to find the byte representation of any given type.
For instance, if the type is int
, we do:
const il = strconv.IntSize / 8
func intBytes(i int) []byte {
b := make([]byte, il)
b[0] = byte(i)
b[1] = byte(i >> 8)
b[2] = byte(i >> 16)
b[3] = byte(i >> 24)
if il == 8 {
b[4] = byte(i >> 32)
b[5] = byte(i >> 40)
b[6] = byte(i >> 48)
b[7] = byte(i >> 56)
}
return b
}
We do similar things to all of the known types. We also check for custom types that provide a String() string
method (i.e. implement the Stringer
interface.) This should allow for getting the byte representation for the vast majority of types people are likely to want to use as a key. If, however, the type is unknown and is not a Stringer (nor has a Bytes() []byte
method), then we fall back to using gob.Encoder, which will work but is very slow to make it realistic usable.
So what does all of this does for us? Now we never lock the entire LRU when doing operations, instead locking only small portions of it when needed, this results in my less time spent waiting for mutexes on average.
We can play around with how many shards we want depending on how much data we store, but we can potentially have thousands of very small shards to provide very fine locking. Of course, since dealing with sharding does have a small overhead, at some point it is not worth adding more.
The following benchmarks were done with 10000 shards and without concurrency —
BenchmarkAdd/mostly_new-4 1000000 1006 ns/op
BenchmarkAdd/mostly_existing-4 10000000 236 ns/op
BenchmarkGet/mostly_found-4 5000000 571 ns/op
BenchmarkGet/mostly_not_found-4 10000000 289 ns/op
BenchmarkRemove/mostly_found-4 3000000 396 ns/op
BenchmarkRemove/mostly_not_found-4 10000000 299 ns/op
And this with 10000 shards with concurrent access —
BenchmarkAddParallel/mostly_new-4 2000000 719 ns/op
BenchmarkAddParallel/mostly_existing-4 10000000 388 ns/op
BenchmarkGetParallel/mostly_found-4 10000000 147 ns/op
BenchmarkGetParallel/mostly_not_found-4 20000000 126 ns/op
BenchmarkRemoveParallel/mostly_found-4 10000000 142 ns/op
BenchmarkRemoveParallel/mostly_not_found-4 20000000 338 ns/op
Still, each shard still locks completely at every access, we might do better than that.