08 June, 2011

Concurrency and lock-free basics

Concurrency is the property where multiple computations are executed simultaneously. In the domain of modern game programming, that usually means having multiple threads running on multiple cores, potentially located on multiple physically separated processors. One way to ensure progression in the executing threads is to avoid critical sections and locks on shared resources, and design your code to use lock-free algorithms.

The definition of the term "lock-free" has changed over time and is now used to denote the property of an algorithm where the suspension of one thread will not block the potential progress of other threads. It is important to note though that lock-free does NOT imply wait-free. A thread can still stall in a lock-free algorithm, but only if at least one other thread has made some progress. In comparison, using a lock on a shared resource will stall all other threads trying to access that resource if the thread holding the lock is suspended during that critical section.

Lock-free algorithms are usually implemented using atomic read-modify-write primitives, such as CAS (compare-and-swap). An atomic instruction is instantaneous, or at least appears that way to the rest of the system, and is usually defined with a succeed-or-fail interface where a failure to perform the operation has no visible side-effects on the system. In the case of a CAS, a failure occurs when another thread has changed the value, i.e made progress. A quick pseudo-C-code example of updating a shared resource using an atomic CAS:

var my_shared_resource;

var new_value;
do {
  old_value = my_shared_resource;
  new_value = calculate_something( old_value );
} while( !atomic_cas( &my_shared_resource, old_value, new_value ) );

Basically the thread busy-loops while waiting for the atomic operation to succeed. The CAS operation only succeeds and stores the new_value if the current value of the resource equals the old_value passed in. It's not wait-free, but if the thread stalls (loops), it means that some other thread has succeeded in the operation it's performing on the shared resource.

A potential pitfall of the CAS operation is known as the ABA problem. If some other thread performs some operation that stores the value B and then restores the value A, the thread performing the CAS will think the value is not modified and the CAS will succeed. If the value is some bit pattern that has some other instrinsic meaning (like a memory address, or a wrapping counter), you could be in serious problem. The solution in these cases is usually to use a double-length variable, for example using a 64-bit value for a 32-bit counter, and using the remaining bits as an "operation counter" which is increased every time an operation is performed. This will solve the ABA problem since the restoration of value A will have a different bit pattern due to the increasing counter in the top 32 bits.

So, now that we have a basic atomic operation, what can we do with it? As an example, let's compare the implementation of a LIFO/stack. A quick lock-based algorithm would look like:

var top
var mutex

function push( item ) {
  lock( mutex )
  item->next = top
  top = item
  unlock( mutex )
}

function pop() {
  lock( mutex )
  var item = top
  top = item->next
  unlock( mutex )
  return item
}

In comparison, a lock-free implementation would look like

var top

function push( item ) {
  var old_top
  do {
    old_top = top
    item->next = old_top
  } while( !atomic_cas( &top, old_top, item ) )
}

function pop() {
  var item, next_top
  do {
    item = top
    next_top = item->next
  } while( !atomic_cas( &top, item, next_top ) )
  return item
}

05 June, 2011

Designing an API (part 2)

This is the second part of my API design post, and part three in my series of posts about building a profiling library. The initial post covered the base code for measuring elapsed time, and the second the design of the API. I'll go through some of the details about the implementation, and if you want the full code it's available at github (released to the public domain). The code for the final implementation will be pushed to this repository tomorrow. This post is very code-oriented, but I promise to have more pretty pictures next time!

To recap, these are the API design guidelines I think are important to keep in mind during the implementation:
  • Orthogonal. A function should not have any side effects, and there should be only one way to perform an operation in the system.
  • Contained. Following from the rant in my previous post, we should avoid third party dependencies and prefer to use primitive or well-defined data types.
  • Specialized. A function in an API should perform a single task. Don't create functions that do completely different unrelated tasks depending on the contents of the variables passed to the function.
The first thing that I think about is memory management. Most games tend to have their own memory management and debugging systems, which in my book immediately makes raw new/delete/malloc/free calls invalid in a library. If you need to make allocations, you should provide means in the API to set callbacks (or similar constructs) for memory management to let the library use the same implementation as the game using the library.

In this small library however, I decided to take the easy route and simply require a pre-allocated memory buffer and limit all in-library memory requirements to that buffer. The drawback is of course that the library can only handle a fixed amount of data, but since it's also continuously writing the data to an output stream the required buffer size is proportional to the rate at which the data is being written. I find that an 8k buffer is plenty of space for this purpose. The internal data structures are laid out as follows:
typedef struct _profile_block_data   profile_block_data;
typedef struct _profile_block        profile_block;

struct _profile_block_data
{
 uint32_t           id;
 uint32_t           parentid;
 uint32_t           processor;
 uint32_t           thread;
 uint64_t           start;
 uint64_t           end;
 char               name[20];
}; //sizeof( profile_block_data ) == 52

struct _profile_block
{
 profile_block_data data;
 uint32_t           previous;
 uint32_t           sibling;
 uint32_t           child;
}; //sizeof( profile_block ) == 64
The blocks can be arranged in a tree structure using the child, sibling and previous pointers (stored as an index which could be 16 bits, but stored as 32 bits to allow atomic swap instructions). The array of blocks is initialized in the library entry point:
void profile_initialize( char* identifier, void* buffer, profilesize_t size )
{
 profile_block* block = buffer;
 profilesize_t num_blocks = size / sizeof( profile_block );
 profilesize_t i;
 
 for( i = 0; i < ( num_blocks - 1 ); ++i, ++block )
  block->child = ( i + 1 );
 block->child = 0;

 //[...]
}
The library keeps a single block as the root of all currently queued blocks, and the operation of adding a block with no parent becomes a simple task of swapping the child pointer (index) of this root block. To make it thread safe we need to use an atomic compare-and-swap instruction:
static void _profile_put_root_block( profile_block* block )
{
 uint32_t self = (uint32_t)((uintptr_t)( block - _profile_root ));
 do
 {
  block->sibling = _profile_root.child;
 } while( !_atomic_cas_32( &_profile_root.child, self, block->sibling ) );
}
Each block can also be a subblock inside another block. To push such a block we implement a thread-local variable storing the current active block, and can perform a swap of pointers without worrying about thread safety:
static void _profile_put_simple_block( profile_block* block )
{
 //Add to current block, or if no current add to array
 profile_block* masterblock = _thread_data();
 if( masterblock )
 {
  uint32_t self = (uint32_t)((uintptr_t)( block - _profile_root ));
  block->previous = masterblock;
  block->sibling = masterblock->child;
  if( masterblock->child )
   masterblock->child->previous = self;
  masterblock->child = self;
 }
 else
 {
  _profile_put_root_block( block );
 }
}
Once we want to flush the queued blocks we simply iterate the list of queued blocks in the root block, and process each tree of blocks. Again, we can get away without a lock by doing a atomic compare-and-swap on the single child pointer of the root block:
do
{
 block = _profile_root.child;
} while( !_atomic_cas_ptr( &_profile_root.child, 0, block ) );
A recursive function can then traverse the block tree and rearrange the child/sibling pointers in order to create a single list of blocks arranged through the child pointer. Since only closed blocks are added to the root list, the traversal can be done without worrying about other threads modifying the data:
static profile_block* _process_profile_block( profile_block* block )
{
 profile_block* leaf = block;

 //[...] process block and write to output stream 

 if( block->child )
  leaf = _process_profile_block( _profile_blocks[ block->child ] );
 if( block->sibling && !block->child )
 {
  block->child = block->sibling;
  block->sibling = 0;
  leaf = _process_profile_block( _profile_blocks + block->child );
 }
 if( block->sibling )
 {
  profile_block* subleaf = _process_profile_block( _profile_blocks + block->sibling );
  subleaf->child = block->child;
  block->child = block->sibling;
 }
 return leaf;
}
In the end, we get a stream of profiling block data detailing the time spent in selected parts of the code. The implementation goals are maintained. The code is lock-free and has very low overhead, and no third-party dependencies through the use of a single pre-allocated buffer for all storage. Each function is specialized and has no side effects through the use of thread-local storage and atomic swap instructions.

In the next post I'll present a simple UI for analyzing the data.