Thread Safety for Performance

Safe But Not Sorry: Thread Safety for Performance

By Kirk J Krauss

There are toolkits for developing fast and reliable multithreaded applications. Whether or not you use them, some thread safety basics are always good to keep in mind. Thread safety is not, in itself, a performance matter. But when you’re using threads to distribute your computing workflow across processors, you get the best results if you make thread safety a priority.

Fighting Starvation

If software is ever going to help solve world hunger, it might best begin at home by making sure every thread gets the processor time it needs. Thread starvation can be one of the most insidious causes of poor performance. For example, a thread whose task is to clean up outdated entries from a skip list might allow the number of entries to grow out of control, in case it doesn’t get to run enough to cover all the entries.

Imagine that threads with relatively high priorities are tasked with adding, updating, and searching a skip list of time-stamped entries, such as the skip list that’s partially shown in Fig. 1. Imagine, too, that a thread with a relatively low priority is designated to delete any outdated entries. The low-priority thread may find those entries only by walking through them all. When the higher-priority threads get busy enough to keep the low-priority thread from running, the skip list becomes free to grow unsustainably as its outdated entries accumulate. If the system begins swapping pages to accommodate the growing skip list, then the higher-priority threads may become busier than ever. Their attempt to handle an ongoing workload on an increasingly overtaxed system may eventually fail, simply because the low-priority thread remains starved.

A situation like this need not involve a low-priority thread. Sometimes lock contention, even among threads of similar priority, can allow one thread to run more often than another. Whenever that happens, the effects of thread starvation can appear. However these effects come about, they often become really harmful only after a long time, and possibly just in a situation that’s intermittent and difficult to recreate.

Debugging a situation like this can be tricky, not only because a debugger doesn’t indicate thread starvation directly, but also because the ill effects of starvation that occur in the field can be challenging to elicit in a debugging scenario. A program whose memory footprint grows out of control is effectively suffering from a kind of memory leak — albeit not the kind of classic leak that involves a lost memory reference or a programmer’s forgetting to code a needed cleanup step. In just about any condition where a leak is suspected, it can be helpful to make arrangements to track down culprits — which can mean tracking how memory gets allocated in the first place.

Finding Unnecessary Heap Memory

To find out how much heap memory gets allocated for a particular data structure, one approach is to track allocated memory blocks that hold instances of it. The code in Listing 1 includes a block tracking structure with declarations for methods (implemented in Listing 2) that can be invoked when a block is allocated or deallocated along with a method for getting a total allocated memory amount.

Listing One

#include "windows.h"
#include "stdafx.h"
#include "stdio.h"

// A data structure can be used for tracking memory allocated by a custom 
// allocator.  All allocations can be tracked, or the tracking can be done 
// just for certain crucial data structures, such as skip list entries, 
// which can store a reference to its own allocator.  To keep tracked 
// allocations current, a matching custom deallocator can look up the 
// relevant Caller structure to account for the deallocation.
//
// A std::map or other standard container can be used for the same purpose 
// and would be declared something like...
//
// std::map‹int, int› callers;
//
// It's possible to store an entire call chain here, or in a related 
// structure, instead of just a caller address.  On Windows, the system 
// can collect a call chain, which can be stored in a separate structure, 
// and identify it with a hash that can be stored in lieu of the address 
// here.
//
// Also, on some systems, there's no simple conversion between an int and an 
// address.  In that case, a pointer to void, or some other numeric data 
// type, can be used to represent the address.
//
struct Caller {
    int     address;
    size_t  size;
};

// Just for our example...
#define MAX_CALLERS 0x10
#define SKIP_LIST_ALLOC_THRESHOLD 0x10000

struct memTracker {
    Caller  callers[MAX_CALLERS];  // A simple array of Caller structures.
    bool    recordAlloc(int address, size_t size);
    bool    recordDealloc(int address, size_t size);
    size_t  getAllocSize(int address);
};

static memTracker MemTracker;
// The object defined above is used in the code below.

This memory tracking code can be called for blocks that represent a scalable data structure, such as the partial skip list implementation of Listing 2, or for a module as a whole. Either way, a custom allocator and deallocator pair, such as the overloaded new and delete operators of Listing 2, can hook the memory tracking code into the necessary allocation and deallocation pathways. To keep memory tracking thread safe in an implementation like this, a lock that protects the skip list also can protect the memory tracking data.

Listing Two

// For a given caller address, records an amount of newly allocated memory.
// Returns true if successful, or false otherwise.
//
// This simple routine isn't much more than a stub, though it actually can 
// perform reasonably for tracking allocations of a single type of data 
// structure performed by just a handful of callers.  A skip list of Caller 
// structures would be a better choice for full-on tracking of allocated 
// memory.
//
bool memTracker::recordAlloc(int address, size_t size)
{
    int index = 0;

    while (callers[index].address)
    {
        if (callers[index].address == address)
        {
            callers[index].size += size;
            return true;
        }
    }

    if (index < MAX_CALLERS)
    {
        callers[index].address = address;
        callers[index].size = size;
        return true;
    }

    return false;
}

// Records a deallocation associated with an address of the caller of an 
// allocator.  In this example code, the allocator's caller address is 
// recorded in the SkipListEntry for lookup at deallocation time.
//
// Returns true if the deallocation is successfully recorded, false otherwise.
//
bool memTracker::recordDealloc(int alloc_address, size_t size)
{
    size_t finalSize = 0;
    int index = 0;
    bool bConsolidate = false;

    while (callers[index].address)
    {
        if (callers[index].address == alloc_address)
        {
            finalSize = callers[index].size - size;

            if (finalSize < 0)
            {
                return false;
            }
            else if (finalSize == 0)
            {
                bConsolidate = true;
            }
            else
            {
                callers[index].size = finalSize;
            }
        }
    }

    if (bConsolidate)
    {
        while (callers[1 + index].address)
        {
            callers[index].address = callers[1 + index].address;
            callers[index].size = callers[1 + index].size;
            ++index;
        }

        callers[index].address = callers[index].size = 0;
    }

    return true;
}

// For a given caller address of an allocator, returns the amount of 
// currently allocated memory.
//
size_t memTracker::getAllocSize(int address)
{
    size_t size = 0;
    int index = 0;

    while (callers[index].address)
    {
        if (callers[index].address == address)
        {
            size = callers[index].size;
            break;
        }
    }

    return size;
}

// A skip list is an ordered list of key/value entries.  Each entry includes 
// a set of one or more references to subsequent entries.  The number of 
// these "next entry" references, for an entry, is determined by the entry's 
// "level" field.  Like any other data structure, an entry also can include a 
// reference to the caller function or the call chain of the function that 
// allocated it.
//
struct SkipListEntry {
    int  key;
    int  value;
    int  level;
    SkipListEntry **setOfNextEntries;
#if defined(DEBUG)
    int  allocatorCaller;
#endif
    void* operator new(size_t, int, int);
    void operator delete(void*);
};

// This overloaded allocator for a skip list entry captures the addresses of 
// its own callers.  Breaks when a debugger is attached, in case a caller has 
// allocated more than a threshold number of entries that still exist.
//
void* SkipListEntry::operator new(size_t count, int key, 
                                                int value)
{
    int iLevel = 0;

    // Can grab a lock here that will protect both the skip list and the 
    // debug build's memory tracking structures.  Make sure any lock is set up
    // in advance of reaching this code.

    // This is just an example implementation; can instead call operator new.
    // Allocation also can be conditional as to whether an existing entry has 
    // the provided key.
    SkipListEntry *entry = (SkipListEntry *) malloc(
                                 sizeof(SkipListEntry) * count);
    entry->key = key;
    entry->value = value;
    entry->level = GetRandomSkipListLevel();

    // Can go on to find the new entry's place in a skip list, and to set up 
    // references from previous entries to it and from it to subsequent 
    // entries.

#if defined(DEBUG)
    int address;

    // Custom allocator can store entire stack traces per list entry.  This 
    // Windows-specific example grabs just the caller address and stores it in
    // one of the Caller structs declared above.
    if (CaptureStackBackTrace(/* FramesToSkip = */ 0, 
                              /* FramesToCapture = */ 1, 
                              (PVOID *) &address, NULL))
    {
        // Can implement a find() routine for an array or map of the Caller 
        // structs declared above, but in this example we just record a 
        // pointer to that struct here in the skip list entry.
        if (MemTracker.recordAlloc(address, count * sizeof(SkipListEntry)))
        {
            ((SkipListEntry *) entry)->allocatorCaller = address;
        }
        else
        {
            // Stop in the debugger if we get more than an expected  
            // allocation total.
            DebugBreak();
        }
    }
#endif
    // Can release the skip list + memory tracking lock here.

    return entry;
}

void SkipListEntry::operator delete(void* entry)
{
    // Can grab the skip list + memory tracking lock here.

    // Can go on to find the entry in a skip list, and to update references  
    // to it from the previous entries, so that they instead refer directly 
    // to the subsequent entries.

#if defined(DEBUG)
    MemTracker.recordDealloc(((SkipListEntry *) entry)->allocatorCaller, 
                          sizeof(SkipListEntry));
#endif

    free(entry);
    // Can release the skip list + memory tracking lock here.
    return;
}

Implementing a custom allocator to track memory can be useful, though this approach easily may become complex and bug-prone. An alternative is to use a runtime analysis tool, such as a memory profiler, which can catch overused heap memory even if it’s not part of a data structure you’ve had in mind. Tools that perform memory leak detection also typically provide memory-in-use reports. These reports, like the results of a memory profiler, can inform you about the most useful prospective memory cleanup you can investigate. For example, the code of Listing 3 allocates a significantly large memory block that doesn’t get fully initialized, as indicated via a technique I helped invent:

Listing Three

#include ‹iostream›
using namespace std;

int PartiallyInitSomeMemory(int bytes)
{
    if (bytes)
    {
        // Allocate some memory.
        void *theBytes = malloc(bytes);
        int y;
        cout << "Bytes allocated: " << bytes << endl;

        // Initialize a third of the allocated memory.
        for (int i = 0; i <= bytes / 3; i++)
        {
             *(char *) ((char *) theBytes + i) = 'a';
        }

        // Get a memory in use report here.
        cin >> y;
        free(theBytes);
        return bytes;
    }

    return 0;
}

int main(void)
{
    void * x = malloc(0x20);
    PartiallyInitSomeMemory(0x6000);
    return 0;
}

Returning to our discussion of thread safety, most runtime analysis tools that detect and report memory leaks in native code also can detect and report memory in use. The difference between the two boils down to whether the program is storing any pointers to the memory. If there’s no pointer to a block, then there’s apparently no way to deallocate it, meaning there’s a leak. But if there’s a pointer to it, it still may never get deallocated unless both:

There’s a routine designated to deallocate it, and
There’s a thread that gets a chance to invoke that routine.

By tracking memory using code like what’s shown in the listings above, or by using a memory profiler or a tool that can show memory in use by your code as it runs, you can catch onto the effects of thread starvation. When your program is growing its memory footprint unsustainably, consider the data structures that require periodic cleanup, find the routines responsible for performing it, and make sure they’re getting invoked as necessary. If they’re not, then you could have a thread starvation issue that’s hard to catch without doing some runtime analysis.

Running Fast Without Racing

Another problem that can occur when multiple threads share objects in memory is known as a race condition. This can happen when an object is not adequately protected by a lock. It tends to be another intermittent kind of problem. The more locks your code uses, the trickier debugging race conditions can get.

Race conditions often result in bizarre memory contents revealed only after the race is long over. Debugging the cause requires more than just observing the effects. You may attach a debugger and find an ordered data set that’s mysteriously out of order, an array in disarray, or values that simply aren’t what they’re supposed to be. Problems like these aren’t always caused by race conditions. You may want to rule out other potential causes before assuming that your threads aren’t sufficiently safe.

Common causes of race conditions include these:

Not recognizing that an object is shared — a reminder, directly in the object’s class name or in a namespace related to it, can help.
Using a system-provided or third-party object accessed via API calls — if you’re not sure there’s a lock inherently protecting that object, then you may need to arrange lock protection of your own around each of those calls.
Inconsistent locking — if the object isn’t consistently protected by a lock, or if the lock that’s supposed to be protecting it isn’t consistently the same lock, you can have a false sense of security.

There are a couple of steps you can take toward discovery and correction of inconsistent locking. The first step is to ensure that any object in question is always associated with an appropriate lock. A straightforward way to do this is by exclusively using an accessor routine for the object — one that always grabs the lock on the way into accessing the object and then releases it afterwards.

This can sometimes result in the lock being acquired more than once, in case, for example, one accessor relies on another in certain scenarios. When that happens, some locks will track that fact via a recursion count, so they’ll consistently need to be unlocked as many times as they’re locked. The consistent use of these accessors that always lock on the way in and unlock on the way out, no matter what path is taken, and whether they get nested or not, can help you make sure that your code is always acquiring and releasing the lock as needed.

When that first step isn’t enough, a second step is to code a consistency check. A simple check like this one can go into an object’s accessor routine:

// Compare a thread id from a Windows critical section with
// the DWORD that identifies the current thread.
//
if (cs.threadId != GetCurrentThreadId())
{
// Can output a diagnostic here, or
DebugBreak(); // or while(1) Sleep(0);
}

Deploying a lock consistency check like this one, which helps the debugger catch any object arrangement that’s getting confused, can flush out race conditions where a lock is held by an inappropriate thread.

Keeping Components Fast

A more sophisticated consistency check may be needed, in order to help pin down where the inappropriate thread is getting hold of the lock in the first place. This is particularly true when a lock can be released to make a call between components, and then reacquired while preserving the order in which all of its owning threads have released it. Consider using a data structure like this one for performing consistency checks on critical sections or mutexes; an alternate or extended version may be used if your code includes other kinds of locks.

Listing Four

#define USE_CRITICAL_SECTION 1 // Just for this example

typedef struct lock
{
    DWORD dwThreadId;
    int recursionCount;
#if defined(USE_CRITICAL_SECTION)
    CRITICAL_SECTION *cs;
#else if defined(USE_MUTEX)
    HANDLE hMutex;
#endif
} *lock_ptr;

lock_ptr lock_array[MAX_LOCKS] = {0};

A discrete lock structure, such as the structure of the example code above, can be associated with each of your locks. You can initialize an array of them at program startup. The size of the array can match the number of locks that you have.

static lock myLock1; // A lock to track in the array
static lock myLock2; // Another lock to track...

As you initialize your locks, you can track them in the array.

Listing Five

// Here, lock can be any of myLock1, myLock2, ...
//
if (lock_array[lock] == NULL)
{
    lock_array[lock] = myLock1;  // For example
    InitializeCriticalSection(lock->cs);
    lock->dwThreadId = lock->recursionCount = 0;
}

Your code can update the relevant array entry wherever it acquires a lock.

EnterCriticalSection(lockToAcquire->cs);
lockToAcquire->dwThreadId = GetCurrentThreadId();
lockToAcquire->recursionCount++;

It then can perform a consistency check against the contents of the lock (e.g. critical section) itself. The following can go right after the above code, or a similar check can be done in any code that determines the lock’s state before accessing an object in memory.

Listing Six

int i = 0;

while (i < MAX_LOCKS)
{
    if (lock_array[i] == lockToAcquire)
    {
        if (lockToAcquire->cs->OwningThread != 
            lockToAcquire->dwThreadId)
        {
            // Can output a diagnostic here, or
            DebugBreak(); // or while(1) Sleep(0);
            // Also can compare the lock’s recursion count 
            // against the recursion count tracked for it.
        }
    }
}

Don’t forget to clean up the tracked values when the lock is released.

if (--(lockToRelease->recursionCount) == 0)
{
lockToRelease->dwThreadId = 0;
}
LeaveCriticalSection(lockToRelease->cs);

The above code can help you catch situations where data structures get confused because of a race condition. This is most likely to happen when you have multiple locks per data structure, such as a read lock and a write lock. If your data structure is subject to more reads than writes, then any write whose completion is subject to a call between components (or other prospectively slow code) can grab a write-specific lock and hold it throughout the sequence. It can grab a read-specific lock just for portions of the sequence that are in your own component. The data structure will remain thread-safe, so long as...

it’s left in a consistent state whenever the read-specific lock is released, and
multiple threads executing a sequence that releases a lock to make a call are made to complete the sequence in the order they entered it.

This performance optimization is more easily done with just one lock, of a special kind I invented, which takes care of the needed thread sequencing. It’s usually cleaner yet to get any information needed during any write sequence up front, prior to grabbing any lock, and only then complete the sequence with minimal lock usage. But such a clean approach isn’t always feasible, or even in case it is, implementing it may cause a performance hit. In any case, the above code snippets illustrate a useful approach for debugging race conditions that may crop up in situations where threads need to acquire locks in a certain order.

Preventing Hangs

A classic deadlock occurs when one thread grabs one lock and waits for another one, while another thread that’s grabbed that other lock must wait for the first one. This can hang every thread in the process, unless those threads don’t need either of the locks involved in the deadlock, or any other results from the hung threads. So a deadlock is usually fatal. Fortunately, it’s also usually easy to debug. Still, a fatal hang is a fatal hang, and it may be a sign of deeper thread safety drawbacks.

Well-designed code should never hang. Here are some design considerations for deadlock prevention:

Can you arrange your objects so that each is protected by no more than one lock? There may be the possible exception of a read-specific and write-specific lock for the performance reasons outlined in the above section. In most cases, though, one lock per object class should be enough.
If multiple locks must be held at once, can you make sure they always get acquired in the same order?
Can you avoid holding locks when calling API functions, and when calling the routines of any other components, in order to prevent re-entrance into your component further down the stack while one of your locks is held?
If a lock must be held during a call out of your component, can you make sure that every entry point for your component checks on whether the lock is held and acts accordingly? You might use thread-local storage, or a field associated with the lock itself, to perform this check.

The classic deadlock scenario is just one way in which locks can involve themselves in a hang. If a thread that holds a lock goes into an indefinite loop or very long wait state, this too can effectively hang the entire process; for practical purposes the condition acts the same as a classic deadlock. If a thread that holds a lock becomes starved, the effect can be similar. It’s a good idea to leave your process a graceful way out of these situations — that is, a way in which open handles for system objects are closed, unsaved data is saved, and so on.

One way to gracefully get out of a lock is to leave a thread running — one that acquires no locks — that either handles an end process request from the user, or checks for an emergency condition (like all the other threads hanging) and ends the process on its own initiative. If a thread like this provides the user with a way to get an end process request through, then effectively your component can recognize that the process is exiting. It then can clean up accordingly. Also, with a few simple arrangements, you might even break some or all of your hung threads out of their wait state, in case that would help realize a clean exit.

Those arrangements might go something like this. Most components, including dynamically loadable modules, have code that gets invoked specifically at process exit time. For example on Windows, when a process is exiting, every DLL’s DllMain() routine runs, with that routine’s dwReason parameter’s value set to DLL_PROCESS_DETACH. When that condition arises, you can set an event, for example using code like this:

SetEvent(hEndProcessEvent);

How can setting an event help your threads out of a situation where they’re endlessly waiting on locks? By changing your locks to wait for this event, and/or on any other event of your choice, in addition to whatever synchronization object makes them wait, you can get your waiting threads to recognize the event. All your threads that hold those locks will get unlocked once that event is set. Such a lock can be based on a synchronization object that can be acquired via the same call that waits on the event...

lock_array[lock] = myLock2; // For example
lock_array[lock].hMutex = CreateMutex(/* attributes = */ NULL, /* acquire now = */ FALSE, /* name = */ NULL);

... so the code that implements the lock can wait on both objects, something like this:

Listing Seven

HANDLE hMultipleObjects[NUM_MULTIPLE_OBJECTS];
hMultipleObjects[0] = lockToAcquire->hMutex;
hMultipleObjects[1] = hEndProcessEvent;
// the list of multiple objects can continue

while (InterlockedCompareExchange(lockToAcquire->dwThreadId, 
       GetCurrentThreadId(), /* no owning thread = */ 0)
{
    // Wait on whatever other thread holds the lock, or on 
    // any other events among the multiple objects.
    WaitForMultipleObjects(NUM_MULTIPLE_OBJECTS, 
          hMultipleObjects, /* wait all = */ FALSE, 
          /* milliseconds = */ INFINITE);
}

A complementary lock release routine can be coded something like this:

InterlockedExchange(lockToRelease->dwThreadId,
/* no owning thread = */ 0);
ReleaseMutex(lockToRelease->hMutex);

In what case might you need a longer list of multiple objects? If you have unsaved data to be saved, you might want to unhang just the thread that’s going to save that data, before letting go of all the other locks. To make this work, you can set up an additional event to be triggered to get the save all done, prior to triggering the event that releases the rest of the locks to let the process exit more or less nicely.

You can add recursion count management, a TLS flag that indicates the entering/leaving lock status, a call out to an emergency cleanup-at-exit routine, and any other logging or debugging code that will help you diagnose the emergency. In any case, at the expense of potentially corrupting memory that will be released as soon as the process exits, you can arrange for that exit to be more graceful than requiring the user to sit there trying to kill your program while it’s hung.

Playing It Safe

There’s a lot more that can be said about thread safety, ranging from such introductory comments as what you’ll find here, here, and here, to discussions delving into re-entrant code, existing and proposed compiler-based techniques, and optimizations specific to certain cases. Spend time researching the techniques that make sense for you and your project, to make your multithreaded code as fast and safe as can be.

Develop for Performance