A Lock-Free Dictionary

by Jon

Development Design


Examining how a hash table-based dictionary can be made thread-safe without the use of locks, in .NET

Introduction

Efficiency is a dangerous quarry. It must be pursued, or else applications will frustrate both users and developers while failing to achieve core goals; but an over-emphasis on efficiency is notorious for wasting developer time on tasks that result in obscure code that is hard to maintain, and indeed often a system that is less efficient as a whole.

Still, there are times when such concerns must be placed front and centre. Of the services that Orb provides, aggregated searching is:

  1. A key operation, upon which much of the functionality of the system hangs.
  2. Often the beginning of a customer’s engagement with the application, where commitment and patience is lowest, and potential competition from alternatives greatest.
  3. An inherently time-consuming task, depending upon waiting for results from a variety of third-party systems which will take a varying amount of time to respond.

The collision of the inescapable delays on the one hand, and the core importance on the other, makes this an area in which efficiency deserves considerable, and continuous, attention. We can never escape the delays inherent to the operation, but we can certainly make sure that we don’t add more.

The most obvious step to reducing the delays is to make sure they all happen at the same time. The task of querying different sources is not just an “embarrassingly parallel problem” (as tasks which are relatively easy to break into parallel tasks are called), it’s a blatantly parallel one that’s already broken into such tasks for us. If we do 60 such operations (not the highest number we ever need to do) that each take 1-12seconds averaging at 6, then this immediately moves us from a 6minute wait to a 12second wait.

The next step is to take advantage of some such tasks completing faster than others, so that after 1second we’ve begun processing the early risers rather than waiting the full 12 seconds it takes for the slowest. Here though things become more complicated. These 60 different operations will each return somewhere from a handful to hundreds of items of interest, that have to be combined into a single result.

We can consider each set one at a time, but this undoes much of the benefits we’ve gained from having the results come to us concurrently. Alternatively we can interleave the processing of different sets of results and use locking to prevent the disasters that ensure when two different threads try to write to the same data structure at once. This lock though becomes a bottleneck, and our threads spend more time waiting politely on each other than bringing results to our users. Worse, contested locks result in context-switches and the time taken swapping between different threads grows relative to the time taken delivering results. What we need here is a data structure that can be safely written to by multiple simultaneous threads.

Today, one would probably use the ConcurrentDictionary class provided by the Framework Class Library. At the time we first looked at this problem this was not an option, for it didn’t exist (not being available in a stable release until 2010). When it became available there were still some characteristics that made it less than ideal. In particular, if one or two keys become “interesting” and are hit by several simultaneous threads disproportionately — a scenario that we see quite often as often results begin with the same series of items and “shadow” each other during processing — then the striped locking used by ConcurrentDictionary degenerates to much the same performance as using a single lock to protect the entire dictionary.

Instead, we adapted the work Dr. Cliff Click had done on a lock-free hash-table in Java for use in .NET. Recently, I returned to this work, looking at it again from first principles and considering some issues that had been ignored as irrelevant to our own concerns, producing a further-improved version of the dictionary, Ariadne.ThreadSafeDictionary, that is available as part of the Ariadne library under the EUPL.

How it Works

Dr. Click’s approach is explained well in his Google TechTalk presentation that can be viewed here with the talk’s slides available here. One can adapt this reasonably directly to .NET, but doing so fails to deal with the different characteristics of the platform, well. The version we have therefore differs from Click’s in some key ways, that are explained here.

Conceptually, any dictionary consists of a set of entries, composed of a key and a value, where keys are unique within the entire dictionary. In a hash-based dictionary (that is, a hash table), the location of the keys is a function of a hash code, which is itself a function of the key such that any equivalent key would produce the same hash code. Since it is rare to be able to guarantee unique hash codes,i duplicate hashes are dealt with in a variety of ways, most often by chaining (the hash identifies a linked list or similar structure containing the key-value entries) or by re-probing (other locations in the dictionary are used according to some algorithm).

Strictly, calculation of the hash must only be done once upon insertion and once upon retrieval. However, both chaining and re-probing give increasingly poor performance as the ratio of keys to “buckets” grows, so at some point the hash table is resized, which requires the re-calculation of the hash to determine the entry’s position in the new table. For this reason it is common to memoise the hash within the hash table. Depending on the implementation details, it may be possible to use this as a short-cut on key-comparisons to quickly identify some non matches; since hash(x) != hash(y) entails x != y (but note that this does not hold vice versa — two non-equivalent items may have the same hash), and since calculating x == y may be more expensive than comparing the already-computed hash-codes, we can skip some definite non-matches by examining hashes alone, though we must still examine the keys when hashes are identical, as this could be a hash collision on different keys.
One of the design decisions in ThreadSafeDictionary was to consider this memoised hash to be an inherent part of the entry, rather than a “mere” optimisation. An entry therefore consists of the hash, the key, and the value: {H, K, V}

For reasons that will be discussed later, it is necessary for us to on occasion make use of a special form of key (related to the incremental resizing of the table), and a special form of value (both to note a removed value, and also as part of the incremental resizing). One possibility is to store the key and value as type object. This would allow us to detect the special keys and values (since they need not be of the type used for either). It would however require us to cast to and from the types we are implementing the dictionary for. This cost is minor but not free in the case of reference types, and more considerable in the case of value types (requiring boxing and unboxing operations). For this reason, we use an internal entry type that stores the keys and values. We have a choice here of using separate types for each, or one type which contains both key and value. It was decided to have one type that stored both, which we shall call KV.

Our entry therefore is of the form {H, {K, V}}; that is, it is composed of a composite of a hash code and an object which is in turn composed of a key and a value.

To be able to operate upon the KV object atomically, it must be a reference type, so that we can use it with Interlocked.CompareExchange. The entry structure though could logically be either a reference or value type. In general, for a type which must (as this must) be mutable, we would strongly favour reference types. Here though we must very quickly scan through an array of these types (as we re-probe if our hash function does not immediately identify the correct entry), there is a considerable gain to be made from using a value type. Mutable value types are heavily frowned upon in .NET and Mono (having been called “evil” more than once). However, in this case the gain is great enough to justify its use, and this happens only to internal structures which are not exposed outside of the assembly, where they could cause confusion.

Between the choices of chaining and re-probing, re-probing offers two advantages. One is that chaining requires at least two different types of internal structure to work upon — the collection of chains, and the chains themselves, with added cases that have to be reasoned about in terms of lock-free thread-safety, and hence an added possibility of bugs. The second is that when scanning through the set of entries in a linear probe we have greater locality of reference. When a CPU examines or changes a memory location it does so via several caches that speed its performance. Pieces of memory are loaded into cache in small chunks called cache-lines (64bytes on 32-bit Intel processors and AMD 64-bit and compatible, 128bytes on Itanium processors):


Using an array of value types means that the reference for the KV object is loaded into the cache along with the hash code. Further, in the case of the hash — or clear evidence of it not being present — not being found on the first test, there are up to 3 (if AMD 64-bit) or 7 (if 32-bit or Itanium) subsequent records for examination already loaded into the CPUs cache. In addition, the use of value types means that the framework ensures that the array begins with 0 for all hashes and null for all KVs, which we use as our starting point. Should such optimisations seem esoteric and hard to justify, consider that in tests where reference types were used, the unit-tests took around 13% longer to run, (including the time for those unit tests that don’t use this structure, so 13% underestimates the gain).

An even more classic .NET/Mono rule that we break, is that we manipulate the structures through publicly accessible fields. This allows us to pass the relevant field directly to the Interlocked.CompareExchange method. Ironically, one could argue that there is a safety advantage in doing this — the “evil” of mutable value types comes from their not behaving as per the behaviour one expects from .NET/Mono objects, ignoring the normal conventions .NET/Mono has for OO concepts reminds one that those conventions don’t completely apply.

Since we cannot apply CompareExchange to two disparate fields at the same time, a vital part of the correctness of the algorithm used, depends upon restrictions on what state transitions are allowed for each entry. We will consider first the algorithm without the requirement to handle incremental resizing, as dealing with resizing builds upon that.

Consider that an entry starts in the state {0, ∅}, that is the hash code is zero, and the KV is null. We prohibit zero hash codes for entries (any zero found is replaced with 5555555516), so a hash of zero infallibly identifies an empty record. We may transition a hash of zero to a non-zero hash, but a non-zero hash must never be changed. This produces an entry of the form {H, ∅}. Such an entry may be transitioned to contain a KV where the hash code of the key is identical to H:

 

With this, the entries encountered by the GET operation can be either:

  1. Zero hash: The key is not in the hash table.
  2. Non-matching hash: The key is not in this entry, and never will be, but may be in the next.
  3. Matching hash, null key: Write in progress, as the GET is ahead of its completion, report as not found.
  4. Matching hash: The key may or may not be in this entry, if it is it always will be, if it isn’t, it never will.

The GET operation can therefore continue to look at entries, confident that it cannot be made incorrect by a simultaneous operation, until it finds the key it wants, finds a zero hash (guaranteed key-miss), or reaches the limit for re-probes (either the key is not in the table, or it is in a table that is being resized into, as will be discussed later).

The entries encountered by a PUT operation can be either:

  1. Zero hash: The key is not in the hash table, but we can try to put it here, attempt to write the hash to the table using CompareExchange, then examine the result which will now fit one of the two other categories (note that if the CompareExchange fails, it may due to losing to a thread that is attempting to write the same hash code).
  2. A non-matching hash: The key is not in this entry, and never will be. Move to next entry.
  3. Matching hash, null key: Write in progress, probably started by this operation but possibly by another. Either way, try to write the KV to finish the PUT.
  4. A matching hash: The key may or may not be in this entry. If it is, it always will be and we may attempt to change the value. If it isn’t, it never will and we can move to the next entry.

The PUT operation can therefore continue to look at entries, confident that while it can be made incorrect by a simultaneous operation, such a case is identifiable and it can recover by continuing to apply the same logic until it writes the value, determines that it does not want to write the value (the PUT operation has variants that insist upon certain states for the value before a write is made), or reaches the limit for re-probes, in which case it will write to the next table (creating it if necessary).

PUT operations are allowed to change a KV, but only for another KV with an equivalent key:

 

Deletions are made by placing a tombstone KV which marks the key as not having a value (which leaks keys, but this is corrected for on resizing):

 

In order for resizing to take place without locking, it is necessary that copying entries to the new table happen in a way such that the state of the key being copied is maintained safely, but does not cause a later PUT to the same key to be over-written upon copying (as that later PUT has logically over-written it as far as the state of the dictionary as a whole is concerned).

As with Click’s hash table, this happens in a two-stage manner.

Every entry in the table is examined. Any empty entry, or entry for a deleted value is marked as a dead entry immediately. Note that this includes an entry in the state {H, ∅} because a PUT got as far as writing its hash code, but not as far as setting the value.

Any entry with a value is marked with a primed version of the same KV. Then an attempt is made to write this entry into the new table, but only if an entry with an equivalent key is not present. In either case, the old entry is then marked as dead.

Once a table is completely full of dead entries, we can move to only using its successor, and the previous table can be garbage-collected.

 

All threads that are performing PUT operations will help copy over a small number of entries into a new table, if one exists. It is possible also to have threads performing GET operations do so, but we have decided not to, not so much to optimise for speedy GETs (though it does) but so that GET operations do not have to allocate as much as a single object, so that only PUT operations can possibly result in out-of-memory exceptions, which is likely to be less surprising (though even removing items can do so). As a further optimisation to reduce allocations, a single prime object is used for batches of copying, its values being set appropriately with each use, so reducing the number of objects created by a significant enough degree to have an appreciable effect upon timings.

Dead records increase the number of states a thread can encounter. A dead record with a hash of zero means that the key of interest is not in the current table, though it could be in the next. Any other dead record says that there is a next table that the key of interest could be in, but does not rule out that it could be in a subsequent record in the current table.

Any operation — whether GET or PUT — that finds that the entry it is interested in is primed, will attempt to carry out the task of copying it to the new table and marking the prime as dead, before then continuing its operation upon the new table. Hence even in the case of a thread being long delayed (or even aborted) half-way through the copying operation, the correct value will always be seen by any thread interested in that entry.

Turned up to 11

The dictionary as a whole, is highly engineered towards performance in terms of time. This is after all, clearly a class designed to be used in situations where every cycle counts. Performance improvements that would be hard justify in other cases, such as manual tail-call optimisation using gotoii are brought to bear on squeezing more rapid responses out of the class. Such work is concentrated in the “core” of the class where the most frequently hit code-paths are, rather than the “leaf” methods, and all must prove themselves by delivering real improvements in tests. While we concentrate on the implementation of the above algorithm, all aspects of the class are examined for performance. Notably, maintaining a record of the dictionary’s size can be a bottleneck in a way that might surprise some, as will be examined in a later article.

An advantage of such performance work, is to actually decrease “premature optimisation”. While our aggregation process brings many more techniques to bear on making it perform as rapidly as possible, in less time-critical areas, being able to depend upon the performance of library code allows for clearer and more maintainable code, as there is nothing to gain by replacing clear and simple approaches with those that are more complex.

Comparison with Click’s Approach

The differences between this algorithm and Click’s relate entirely to the differences in the frameworks used. .NET and Mono have the advantage in allowing one to place objects and integral values adjacent in memory so that they will fill the same cache-lines (indeed, Click describes this as ideal but “hard to do in Java”). .NET’s and Mono’s boxing is different and required for more types, while also often less visible (were one to write a more direct port of Click’s hash table, as indeed I have done, then with value types for keys and/or values, it would engage in a lot of hidden boxing and unboxing behind the scenes). Differences in the semantics of compare-and-swap operations mean that Java has an advantage (on some, but not all JVMs) of allowing for write-reorderings that could improve performance, while .NET and Mono have an advantage that in disallowing write reordings of CASs, the order of write-to-hash followed by write-to-key is guaranteed without any other explicit memory barriers. I therefore don’t offer this as an improvement upon Click’s work, but rather as an adaptation more suitable for the .NET and Mono frameworks.

Comparison with ConcurrentDictionary

Since .NET now provides a thread-safe dictionary as part of the framework, it is necessary to consider whether this out-performs our ThreadSafeDictionary and we should consider abandoning it in favour of what is provided to us for free. Pride should never get in the way of making the best engineering choice.

Of course, this is a moving target, both will hopefully improve, though either could suffer from having to fix the handling of a currently unconsidered edge-case, or in adding further functionality. Currently though, tests suggest the following:

In the case of low-contention, where there is normally a single thread, but the possibility of concurrent access must be guarded against, a normal dictionary protected by a lock beats both — these classes are only appropriate tools when there is real contention.

Under contention, ConcurrentDictionary currently performs better in cases where there is a lot of insertions and deletions that balance out to a relatively stable size. ThreadSafeDictionary however, out-performs it in cases where the dictionaries are growing, and considerably so in cases where the same key is being hit by concurrent threads — two conditions that occur in the case we were originally interested in, and in many others.

Update: With recent changes, ThreadSafeDictionary now outperforms ConcurrentDictionary in a wider range of cases.

  1. Such cases where one can guarantee unique hash codes over a small known set of keys are well served in thread-safe dictionaries by completely different approaches, anyway.
  2. Anyone horrified by the use of goto, should remember that when Knuth famously said that “premature optimisation is the root of all evil”, it was in the context of also defending the use of goto to increase performance. [Knuth, D E. 1974, “Structured programming with go to statements”, Computer Surveys, Vol 6, № 4. Upper Saddle River, NJ, USA: Yourdon Press. p. 268]