Ayende @ Rahien

Jun 19 2026

"Optimizing" concurrent regexes

time to read 6 min | 1131 words

Tags:

I am looking into some regex work, and I ran into a performance problem. I need to run a particular regex over a large number (millions) of strings. That caused my spidey sense to… tingle.

The code in question looked something like this:

long matches = CountMatchingEntries(new Regex(@"\s+user id\s+"));

Creating a regex for each invocation is… expensive. That is why we have the RegexOptions.Compiledflag, after all. And the Regex class is thread-safe, so I did the equivalent of this code:

// class level
static Regex s_regex = new Regex(@"\s+user id\s+", RegexOptions.Compiled);
// inside a method
long matches = CountMatchingEntries(s_regex);

The performance of the system immediately took a big, stinking performance regression all over my benchmarks. At first, I was sure that I wasn’t doing something properly, so I re-wrote this using the modern approach, with source generators, like this:

public partial class MyClass
{


    [GeneratedRegex(@"\s+user id\s+", 
        RegexOptions.Compiled | RegexOptions.CultureInvariant |
        RegexOptions.IgnoreCase)]
     public static partial Regex UserIdRegex();
}

That had the exact same behavior, a major performance regression.

We are talking about this making the code six times slower. That is a huge cost for something that every fiber in my being tells me should be faster. As I started digging into things, I managed to reproduce this in an isolated manner.

The following benchmark shows the core of the problem:

using System.Diagnostics;
using System.Text.RegularExpressions;


var lines = Enumerable.Range(0, 100_000).Select(i => $"The user id #{i}").ToArray();


var regex = new Regex(@"\s+user id\s+", RegexOptions.Compiled | RegexOptions.CultureInvariant | RegexOptions.IgnoreCase);
// This is fast
Exec(() => lines.Count(l => regex.IsMatch(l)));
// This is slow
Exec(() => lines.AsParallel().Count(l => regex.IsMatch(l)));


static void Exec(Func<int> run)
{
    var before = GC.GetTotalAllocatedBytes();
    var sw = Stopwatch.StartNew();
    //var count = lines.AsParallel().Count(l => MyClass.UserIdRegex().IsMatch(l));
    var count = run();
    sw.Stop();
    var after = GC.GetTotalAllocatedBytes();
    Console.WriteLine($"Count: {count}, Time: {sw.ElapsedMilliseconds} ms, Memory: {after - before} bytes");
}

I create the sameRegex instance and run it 100,000 times. The first time, I do that using a single thread, and the second time I’m using multiple threads.

The only difference between those runs is the addition of AsParallel() to force it to use multiple threads. My code isn’t actually using this or explicit threads, but it is using a single cached Regex instance in a web environment, so under load, it is being used concurrently.

What is going on here? The parallel code is much slower because it allocates. It turns out that deep in the guts of .NET’s Regex engine we have this block of code:

RegexRunner runner = Interlocked.Exchange(ref _runner, null) ?? CreateRunner();
try
{
    // do work
}
finally
{
   _runner = runner;
}

In other words, the actual RegexRunner needs to keep some mutable state, which doesn’t play nicely with threads. In order to maintain itself under concurrent usage, the Regex “checks out” the instance when it is being used, so it can be the sole owner.

Other callers on the same instance at the same time will allocate a new runner instance. That is what is causing the massive slowdown. If we are using the Regex instance in a single-threaded manner, the code will check it in & out as needed, with zero allocations and quite fast.

If you are using that from concurrent code, you’ll allocate like crazy and can expect your performance to drop by as much as five times.

I created an issue for that, since I believe this is quite a tripping hazard in terms of performance.

The current fix that I found was to use ThreadLocal<Regex> for this, ensuring that there is no actual concurrent usage, at the cost of higher memory usage and repeated initializations.

Non obvious optimization with divide and conquer

time to read 8 min | 1546 words

0 comments

Tags:

Deep in the heart of Corax (RavenDB’s querying engine), everything deals with something called a Posting List. Posting Lists are a way for the engine to say “all of those documents have the term Fast for the field Speed”. Conceptually, a Posting List is an ordered set of document ids. In this case (and this is important, the ids in question are numeric, not RavenDB’s document id, which is a string).

An interesting problem with Posting Lists is that a term can be unique (such as a GUID) - only a single document will ever have this value. They can have a small set of values (for example, CreationDate, where only the items created on that day will share the same value). Or they can have many values (for example, Status = ‘Shipped’ for Orders).

The reason those details matter is that we can deal with the three (very) different modes in a distinct manner to reduce the cost of storage that Corax uses for Posting Lists. Internally, Corax has the notion of a PostingListId, which is just a 64-bit number. We use the least significant 2 bits to tell us what kind of Posting List this is.

Typical code to consume this looks like this:

while ((read = provider.FillPostingListIds(plIds)) > 0)
{
    for (int i = 0; i < read; i++)
    {
        var postingListId = plIds[i];
        var termType = (TermIdMask)postingListId & TermIdMask.EnsureIsSingleMask;


        switch (termType)
        {
            case TermIdMask.Single:
            {
                // Accumulate the single decoded entry ID into the entryBuffer
                // Flush to bitmap if the buffer is full
                break;
            }


            case TermIdMask.SmallPostingList:
            {
                // Decode inline via FastPForBufferedReader into the entryBuffer
                // Flush to bitmap if the buffer is full
                break;
            }


            case TermIdMask.PostingList:
            {
                // Flush any accumulated entries from the buffer first
                // Iterate through the full large posting list via FillFromPostings
                break;
            }


            default:
                throw new InvalidOperationException($"Unknown TermIdMask type: {termType}");
        }
    }
}

The code here iterates through a list of Posting List ids, handling each one of the options. This is part of handling queries such as: where CreatedAt > $lastQuarter. We have to get all the distinct dates in the range, and for each date, we need to get all the matching documents for it.

The code here is pretty simple, right? It is fairly obvious what it does, but it has a pretty big problem. It processes each one of the Posting Lists manually & separately. There are two problems with this approach:

We end up doing a lot of branching, based on which Posting List type we are processing.
We lose the ability to handle the Posting Lists in bulk.

Because we use the lowest two bits to store the type of the Posting List, we can do better. The insight behind the new approach is that (id & EnsureIsSingleMask) naturally yields 0, 1, or 2 — a perfect index into an array of buckets. Instead of a switch statement, we partition the batch in one pass with no per-item branches:

var buckets = new List<long>[3];
while ((read = provider.FillPostingListIds(plIds)) > 0)
{
   for (int i = 0; i < read; i++)
   {
       buckets[(int)(plIds[i] & 0b11)].AddUnsafe(pid);
   }
}

Of course, you’ll note that we aren’t actually processing those, just putting them in buckets. The fact that we are able to split them into groups in a branchless manner is a fun optimization task, but it doesn’t matter as much as the next stage… we can now deal with a list of Posting Lists that are already divided by type.

For example, for the list holding only a single unique value, I can run the following:

var singlesSpan = CollectionMarshal.AsSpan(buckets[0]);
EntryIdEncodings.DecodeAndDiscardFrequency(singlesSpan, singlesSpan.Length);
var singlesLen = Sorting.SortAndRemoveDuplicates(singlesSpan);
bitmap.AddRange(singlesSpan[..singlesLen]);

This is great, because now I can process the entire list using SIMD in a highly efficient manner. But things get better when we look at the other lists. In the case of the small Posting Lists, for example, the ids that I’m reading are not the actual values, but point to where the values actually reside.

This gives me the chance to actually load them from disk in an optimal, batched manner, like so:

var smallsSpan = buckets[1].ToSpan();
EntryIdEncodings.DecodeAndDiscardFrequency(smallsSpan, smallsSpan.Length);
var smallLen = Sorting.SortAndRemoveDuplicates(smallsSpan);
Container.GetAll(llt, smallsSpan[..smallLen], containerItems.ToSpan(), long.MinValue, pageLocator);


for (int i = 0; i < smallLen; i++)
{
   var item = containerItems[i];
   // read the small posting list and deal with it
}

The key here is the Container.GetAll() call, which takes the list of unique Postings List locations and loads them in an optimized manner. In the previous code style, that was actually pretty hard to do. In this manner, it falls naturally from the way we write the code.

There are also advantages to the fact that we run tight loops with the same code, instead of branching for each type. We also ended up with less code overall, which is also nice.

Oren Eini

Oren Eini

CEO of RavenDB

"Optimizing" concurrent regexes

Non obvious optimization with divide and conquer

FUTURE POSTS

RECENT SERIES

RECENT COMMENTS

Syndication

Main feed
Comments feed