## Improving Linq2Objects—An efficiency case-study.

When Microsoft open-sourced much of the .NET Framework Class Library as part of .NET Core it was to me as if I was a car-obsessed kid who’d been chauffeured around in a luxury sedan for years when suddenly the driver popped the bonnet and let me tinker with the engine.

One area I made several contributions to, was the efficiency of the `System.Linq.Enumerable`

class, which provides Linq’s `IEnumerable`

-based functionality, along with its helper-classes.

Since these were public changes to public code, a look at some of that work gives an opportunity to explore some of the things that can be done to improve efficiency without being indiscrete about a client’s codebase.

# The Importance of Measuring

Work on efficiency can be both among the most and least enjoyable programming work one can do. On the one hand simple pleasure in shaving off seconds and tuning a component to be as tight as possible is one that has an analogy in almost every craft—every craftsperson wants to produce something that little bit better than before—but on the other it is one of the hardest areas to make any sort of promise about—one is starting with working code that was intended to be at least reasonably efficient the first time, so there’s rarely a guarantee one can go much further.

Another thorn in the side of anyone trying to make such improvements is that one can rarely know *a priori* what will actually help. While some cases will be very obviously suboptimal, there’s generally enough complexity to any code worth making more performant that any given change can end up making no difference or indeed making things worse. Certainly some attempts at improvement got no further than my own machine, after being found to be of negligible value or even detrimental.

It’s also important to consider the difference between *optimisation* and *improvement*. Often *optimisation* is used to refer to any changes that makes code faster, less memory-hungry, or both. Strictly though optimisation is making code better for a particular case. Sometimes that is appropriate; with a website for example we can decide which approach to a given problem will be more efficient for that particular site’s needs, and perhaps change the approach as usage patterns or available hardware changes.

With a library that will be used in multiple contexts what we actually want is the very opposite of *optimisation* in that we want to make general improvements that covers as many possible uses as feasible. This impacts on what must be measured; when optimising we care about the real-world performance of a single application and must measure something that simulates that as closely as possible (ultimately testing with that real-world use), but when improving libraries we must try to cover many different patterns of use and when something benefits one case at a cost to another judge the overall impact of that conflict.

Sometimes improving efficiency can be as simple as trying two roughly-equivalent approaches, measuring each and then going with the quicker. This is one of the more boring things to do, if often necessary, so it will likewise be boring to read about. Let’s look at some other ways we can improve performance.

# Working Smarter not Harder

One of the simplest ways to improve the performance of anything is to do less. If we can simply remove a bit of work then we save the time spent doing that. Sometimes there are some really easy to find redundancies that are a by-product of the incremental development of a component, and cutting them out both simplifies the code for future work as well as removing busy-work that the compiler might not have realised it could cut out. (The compiler does sometimes remove redundancies for you). This tends not to make a noticeable difference though; a single saving a few nanoseconds won’t change much, and really big wastes are going to be rare. A more common case is when we have needed work that is done in a loop but can be moved out of it. Consider:

`public static int Min(this IEnumerable<int> source)`

{

if (source == null) throw Error.ArgumentNull("source");

int value = 0;

bool hasValue = false;

foreach (int x in source)

{

if (hasValue)

{

if (x < value) value = x;

}

else

{

value = x;

hasValue = true;

}

}

if (hasValue) return value;

throw Error.NoElements();

}

This code finds the smallest integer in a sequence. On given input it must throw an exception if the input is null, a different exception if the sequence is empty, and otherwise the lowest element in the sequence.

Every time it goes through the loop it first checks if a value had already been found. If it hadn’t been found it records that one now has, and that becomes the current value, otherwise it checks if the new value is lower than the current, replacing it if it is.

While branch-prediction will help reduce the cost of that check, we’re still checking if we’ve already found a value for ever single element. Compare this to what you would do with a bag of scrabble tiles looking for the tile that was first in alphabetical order: would you ask yourself every time you looked a new tile, “do I already have a tile in my hand?” No, you are going to find out whether the bag is empty or not the first time you try to get a tile from it, and then continue accordingly.

The pattern:

`foreach (int x in source)`

{

DoSomething(x);

}

Is a shorthand for:

`using (var e = source.GetEnumerator())`

{

while(e.MoveNext())

{

int x = e.Current;

DoSomething(x);

}

}

And by breaking the loop out to this we can consider things that are only of concern for the first element just once at the very beginning:

`public static int Min(this IEnumerable<int> source)`

{

if (source == null) throw Error.ArgumentNull("source");

int value;

using (IEnumerator<int> e = source.GetEnumerator())

{

if (!e.MoveNext()) throw Error.NoElements();

value = e.Current;

while (e.MoveNext())

{

int x = e.Current;

if (x < value) value = x;

}

}

return value;

}

Here we’ve moved away from one of the more idiomatic C# structures, which is a downside (be as idiomatic as possible as much as possible), but in doing so we’ve gained in not having to check for the case of an empty sequence every time we get a new element. If the sequence is empty we’ll discover that, and throw an exception, on the first attempt to get an item. From that point on we know we have one.

On testing, this actually made the case where the sequence was empty slightly slower, but only by less than 2% and this both a rare case and one that is relatively fast anyway. Other cases were improved by somewhere between 1% and 11% depending on the input sequence, with randomised inputs (likely the best match to real-word data) coming in at around 5% improvement.

This is modest but, considering that this method may itself be called multiple times, worth the effort.

Let’s consider a slight variant. Starting with the code:

`public static int? Min(this IEnumerable<int?> source)`

{

if (source == null) throw Error.ArgumentNull("source");

int? value = null;

foreach (int? x in source)

{

if (value == null || x < value)

value = x;

}

return value;

}

This is akin to the previous, but here we have nullable integers. Unlike the previous example if the sequence is empty, or if all elements are null, we return null, otherwise we return the lowest element that isn’t null. Here we aren’t checking for an empty sequence every pass, but rather we are checking if the current value is null, either because it was empty or because we have only encountered null values so far.

There’s also a hidden null-check, the comparison `x < value`

is equivalent to `x != null && value != null && x.GetValueOrDefault() < y.GetValueOrDefault()`

so what looks like one comparison is actually three, one of which (`value != null`

) we already know the answer to.

Alternatively:

`public static int? Min(this IEnumerable<int?> source)`

{

if (source == null) throw Error.ArgumentNull("source");

int? value = null;

using (IEnumerator<int?> e = source.GetEnumerator())

{

do

{

if (!e.MoveNext()) return value;

value = e.Current;

} while (!value.HasValue);

int valueVal = value.Value;

while (e.MoveNext())

{

int? cur = e.Current;

int x = cur.GetValueOrDefault();

if (cur.HasValue && x < valueVal)

{

valueVal = x;

value = cur;

}

}

}

return value;

}

At first glance this is a lot more complicated, but mostly that is because we’ve made some hidden things visible and then in fact simplified them.

The biggest change is that there are now two partial loops. First we iterate through until we find a non-null value. If that never happens (including the empty sequence case) we can return null immediately. This is both faster for the all-null case (we never attempt the more complex comparison until we’ve a reason to) and for the empty case (a good 11% faster) but also reduces what we need to do from then on.

If we have found a value then we finish the loop with different logic, this time caring only that the new value found isn’t null and whether its less than the current value; the hidden check on the current value not being null has been removed.

There are two further improvements we can make:

`public static int? Min(this IEnumerable<int?> source)`

{

` if (source == null) throw Error.ArgumentNull("source");`

int? value = null;

using (IEnumerator<int?> e = source.GetEnumerator())

{

do

{

if (!e.MoveNext()) return value;

value = e.Current;

} while (!value.HasValue);

int valueVal = value.GetValueOrDefault();

while (e.MoveNext())

{

int? cur = e.Current;

int x = cur.GetValueOrDefault();

if (cur.HasValue & x < valueVal)

{

valueVal = x;

value = cur;

}

}

}

return value;

}

Here instead of `int valueVal = value.Value`

we have `int valueVal = value.GetValueOrDefault()`

. Logically what we want is the `Value`

property, but that has another hidden null-check. Since we know `GetValueOrDefault()`

will have the same result, and doesn’t have that hidden check this is a micro-optimisation. It only shaves off a few nanoseconds, but it’s easy to do.

The more interesting case is that we’ve changed `if (cur.HasValue && x < valueVal)`

to `if (cur.HasValue & x < valueVal)`

. The difference here is subtle. The first is equivalent to:

`if (cur.HasValue)`

{

if (x < valueVal)

{

…

The second is equivalent to:

`bool temp0 = cur.HasValue;`

bool temp1 = valueVal;

if (temp0 & temp1)

That is to say, the first short-circuits in not examining `x < valueVal`

unless it knows it needs to, the second always examines it. This might seem to be a violation of our principle of trying to have the computer do less work, and indeed it is. The difference comes in with branch cost. Modern computers require more time to branch than they do to perform simple integer comparisons. While branch prediction may help, there’s always a chance of it predicting the wrong branch. As such the saving of not having to do the `x < valueVal`

if unnecessary is less than the cost of the branch. While it remains that `&&`

will be more efficient than `&`

if the operation on the right and side is expensive (and mandatory if the operation on the left hand side determines it is impossible), when the operation on the right is cheap then `&`

wins. Unfortunately because branch-prediction does its best to help here, measuring whether this was the right thing to do or not requires testing with different sets of data, as it will help with some more than with others.

In total these changes always gave at least a slight improvement, and in some cases improved performance by over 30%.

# Quit When You’re Ahead

Doing less is great, but not as good as doing nothing at all. Let’s consider the following:

`public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)`

{

if (source == null) throw Error.ArgumentNull("source");

if (predicate == null) throw Error.ArgumentNull("predicate");

TSource result = default(TSource);

long count = 0;

foreach (TSource element in source)

{

if (predicate(element))

{

result = element;

checked { count++; }

}

}

switch (count)

{

case 0: throw Error.NoMatch();

case 1: return result;

}

throw Error.MoreThanOneMatch();

}

This method has to find the sole element in a sequence that matches some predicate. If none match, that’s an error and if two or more match, that’s also an error. However, note that it does so by keeping count of how many matches it finds. By the time we’ve found the second though, we know what the answer is going to be:

In comparison:

`public static TSource Single<TSource>(this IEnumerable<TSource> source, Func<TSource, bool> predicate)`

{

if (source == null) throw Error.ArgumentNull("source");

if (predicate == null) throw Error.ArgumentNull("predicate");

using (IEnumerator<TSource> e = source.GetEnumerator())

{

while (e.MoveNext())

{

TSource result = e.Current;

if (predicate(result))

{

while (e.MoveNext())

{

if (predicate(e.Current)) throw Error.MoreThanOneMatch();

}

return result;

}

}

}

throw Error.NoMatch();

}

Again the loop is split. First we loop to find a match. If found we loop again from that point to find a second match and throw an exception if it happens. The change in logic from “count them up and check the count is one” to “find a match and then make sure there isn’t another” not only makes a massive difference in the exceptional case, but again means there’s less work for the successful case.

This change took longer to be accepted by the project owners than the first mentioned, because of a very important difference. That this short-circuits is in and of itself a clear win (though the change to the successful case still had to be measured, to be sure), but unlike the previously described changes, this wasn’t purely a change in performance.

It also slightly changed behaviour. If someone had code that somehow depended on a sequence being exhausted by `Single()`

even when it resulted in an exception, then that code would be broken by this change. An improvement to speed isn’t a boon if suddenly applications stop working correctly. Here, it was ultimately decided that the case affected was too obscure to likely cause many problems, but the general principle is important; efficiency always comes second to correctness, and even correctness sometimes has to come second to not breaking what is currently depended upon, even if it shouldn’t have been.

# What Costs and What’s Free

Considering the short-circuiting here, it would be possible to do the same in the `Min()`

earlier. After all if a value of -2147483648 is ever encountered we can return it immediately, since no lower value is possible. This though would require testing for that value in every pass which would optimise that case at the cost of slowing down every sequence that uses it. Even though a handful of cases could become magnitudes faster, it wouldn’t be worth slowing down the vast majority of cases to benefit the minority.

On the other hand consider:

`public static double Min(this IEnumerable<double> source)`

{

` if (source == null) throw Error.ArgumentNull("source");`

double value = 0;

bool hasValue = false;

foreach (double x in source)

{

if (hasValue)

{

if (x < value || System.Double.IsNaN(x)) value = x;

}

else

{

value = x;

hasValue = true;

}

}

if (hasValue) return value;

throw Error.NoElements();

}

This is another example of the first method we looked at, but for `double`

values, and there’s an important difference. Generally with floating-point values we consider NaN (“not a number” values, such as the result of dividing zero by zero) as neither less or greater than any other number, including itself. That makes sense mathematically (as far as such values do make sense mathematically, anyway) but is no good for ordering something to consider something the minimum or maximum value. For that we need a “total ordering” (everything has a place) and the rule used is that NaN is lower than everything, even negative infinity.

To enforce that rule, we need to explicitly test for it. And since we need to test for it to be correct, then there is no cost in using that test to short-circuit. So we include that when doing similar improvements as the first we examined:

`public static double Min(this IEnumerable<double> source)`

{

if (source == null) throw Error.ArgumentNull("source");

double value;

using (IEnumerator<double> e = source.GetEnumerator())

{

if (!e.MoveNext()) throw Error.NoElements();

value = e.Current;

while(e.MoveNext())

{

double x = e.Current;

if (x < value) value = x;

else if (double.IsNaN(x)) return x;

}

}

return value;

}

It doesn’t hurt that we test for `double.IsNaN(x)`

because we had to do that anyway. As far as performance improvements go, the test is “free”. Note that if the first value is NaN we don’t short-circuit. This is a trade-off; while this penalises the case where a sequence has NaN first and then doesn’t have another NaN for a long time (if at all), covering it would require another explicit test. Since we still get the correct behaviour without test it’s better to do what’s fastest in the most common case.

In between the obviously bad idea of short-circuiting on -2147483648 and the zero-cost idea of short-circuiting on NaN, there’s this:

`public static int? Max(this IEnumerable<int?> source)`

{

if (source == null) throw Error.ArgumentNull("source");

int? value = null;

using (IEnumerator<int?> e = source.GetEnumerator())

{

do

{

if (!e.MoveNext()) return value;

value = e.Current;

} while (!value.HasValue);

int valueVal = value.GetValueOrDefault();

if (valueVal >= 0)

{

while (e.MoveNext())

{

int? cur = e.Current;

int x = cur.GetValueOrDefault();

if (x > valueVal)

{

valueVal = x;

value = cur;

}

}

}

else

{

while (e.MoveNext())

{

int? cur = e.Current;

int x = cur.GetValueOrDefault();

if (cur.HasValue & x > valueVal)

{

valueVal = x;

value = cur;

}

}

}

}

return value;

}

Note that path followed if the first non-null value found is zero or higher. In this case the value found is never tested for nullity, because the result of `GetValueOrDefault()`

will in that case be zero and hence not affect the result.

This adds some extra cost up front (so much that empty and single-item sequences are actually slower now, but they are never very expensive anyway), but when hit allows the rest of the work to gain an extra boost. We could switch to this logic part-way through, but again that would add extra cost to the cases not covered. It’s a balancing act. In particular, this is only done with nullable integers and nullable long integers, not with any other type. Why? Because positive-only sequences of integral numbers is a very common pattern in real-life data. We want to improve not just what could in theory be faster, but what will likely come up a lot in real-world use.

# Algorithmic Change and Identifying Patterns

All of the above has so far tweaked existing code without really changing the basic approach. It’s also only considered single methods in isolation. For a more complicated, but more powerful, performance improvement, let’s look at `OrderBy()`

. This method (along with `OrderByDescending()`

, `ThenBy()`

and `ThenByDescending()`

) returns a wrapper on a sequence that will return that sequence ordered according to some criteria.

When an operation is done on it, that object:

- Loads all elements into a buffer.
- Does a stable quicksort on the elements based on the criteria.
- Enumerates the elements in order.

Step 1 takes O(n) space (that is memory proportional to the number of elements) and O(n) time and step 2 then takes O(n log n) time (that is, time proportional to the number of elements multiplied by the logarithm of ifself, so if e.g. 8 elements took 24ns then 16 would take 64ns).

This is about as good as one can do, especially since it promises a stable sort (two equal objects won’t be swapped around) though some tweaks could be (and were) made.

But lets consider some frequent combinations. One of these is with `First()`

; we might order something to get a single element by a given order with `….OrderBy(…).First()`

.

Logically, this is a variant of `Min()`

, and so the object was treated to be a special-case for `First()`

. When done the logic now becomes:

- Examine each element based on the order criteria, keeping track of the “lowest” found.
- Return that element.

This is O(1) in space (same amount of memory, no matter how large the sequence) and O(n) in time (time proportional to the number of elements, so if 8 elements took 24ns then 16 would take 48ns) as well as just plain faster even for a single item!

Because we can change the algorithm used, we don’t just shave off a small, or even a large, percentage of time taken, we change the time taken by many magnitudes, and with greater savings the larger the input. Single-item sequences are more than twice as fast, but sequences with 10,000 items are over 16 times as fast, and larger sequences are improved even more.

There’s extra cost in detecting this as a special case for `First()`

, but finding other cases where it could be improved and having the relevant classes use the same interface allows for several such cases to be covered in a single test.

Another common pattern is to follow an ordering with `Skip()`

and `Take()`

so as to get a page of results: `….OrderBy(someCriteria).Skip((pageNumber - 1) * pageSize).Take(pageSize)`

.

This also offers us a chance for a significant improvement. The quicksort algorithm works as follows. Say we have the following items:

3 2 8 4 1 9 5 7 6

Pick one of these and move elements around so that all lower elements are before it, all greater elements are after. We’ll take the first item for example, which is 3 (pivot-choice has some deep considerations that also affect performance, but we’ll just pick the first item each time in this post for simpicity):

2 1 3 4 8 9 5 7 6

The value 3 now in its correct place we have two still-unsorted sub-arrays to each side of it. Sorting that to the left gives us:

1 2 3 4 8 9 5 7 6

Now let’s consider just what we have to the right:

4 8 9 5 7 6

Picking 4 to pivot we don’t move it, and there’s nothing to the left, so we’re now sorting:

8 9 5 7 6.

Picking 8 to pivot we end up with:

5 7 6 8 9.

The 9 is obviously also in the right place so we then need to sort the 5 7 6. Pivoting on 5 does nothing, but sorting the remaining two 7 6 gives us 6 7 and the whole sequence is now:

1 2 3 4 5 6 7 8 9

Note that the entire number of pivoting operations was six (seven if we include considering that the 9 couldn't be wrong). Let’s consider if we only want to get the 4th and 5th elements. That is those that would be in the shaded area when sorted:

3 2 8 4 1 9 5 7 6

After the first sort we have:

2 1 3 4 8 9 5 7 6

But we know that the elements before the 3 can’t end up in the area we care about. Sorting them is busywork we don’t need to do, so we skip it. Again pivoting on the 4 does nothing so then we pivot on the 8:

2 1 3 4 5 7 6 8 9

Pivoting on the 5 again does nothing, but we now know it’s in the correct position. Since we’ve sorted everything that’s going to end up in the area we care about, we’re done. The sequence as a whole is still out of order, but we’re not going to use those elements, so we don’t care. This time we only did four pivot operations. In general if we want `k`

elements from a sequence then while sorting the whole sequence will take O(n log n) time, finding the sub-sequence we care about will take O(n + k log k) which will be much smaller, especially when `k`

is small compared to `n`

. (If we’re really unlucky it can end up taking as much as O(n²) but that’s true of the previous approach too).

And so this is precisely what is done. Calling `Skip()`

or `Take()`

on the results of an `OrderBy()`

returns an object that not only keeps track of how ordering should be done, but on the sub-sequence we care about. Calling `Skip()`

or `Take()`

on that object returns another such object with the sub-sequence limits changed. When we finally operate on the sequence produced the partial quicksort described above is used to get the results in a fraction of the time. A similar technique allows a single element to be selected by index in O(n) time.

# Takeaway

I did plenty of other work on improving the performance of `Enumerable`

, and other contributors covered a lot of other cases and continue to do so, but examining most of these further would be mean concentrating on specific to details of `Enumerable`

and/or .NET collections, rather than general efficiency work. Instead I’m just going to reinterate the points covered here:

- Always measure performance improvements to be sure they really are improvements and are worth any decrease in readability.
- Always try to make measurements match real-world conditions as much as possible. When applicable, have measurements cover several different scenarios.
- Look for ways the code can do the same with less.
- Look for places the code can quit early.
- Consider what your improvement for one case may cost in another case, and whether it might mean an overall deterioration.
- Consider what costs you have to pay anyway, and what you might do with them.
- It’s not a real improvement if it doesn’t work!
- It’s not a real improvement if it breaks code using it, even if they shouldn’t have used it that way. Avoid such changes unless you can fix all the poor uses of it (it's private code only you our your team uses).
- Only fast-path a case that one can process faster if that case will indeed come up often. Otherwise the cost of testing whether one can fast-path will be greater than the gain made.
- The biggest gains can come not by doing the same thing better, but by taking an entirely new approach to something.
- Look for common use cases and combinations of operations, and see if they offer a chance to make an improvement.