YetiSim Blog

Blogs about simulation and developing YetiSim.

Implementation of Parallel C++ Lists: Part 1

In my last blog I outlined primitive data types to enable easy parallel programming using my C++ library.  This weekend I began to implement my “lists with a twist”.  The implementation has an elegant thought behind it, but there are inevitable messy low-level details.  In this blog, I will outline these messy low-level details and in my next blog I will demonstrate the completed interface.

My concept of a list is heavily influenced by Python, with an exception that when an element is deleted no other elements will be moved and the list is homogeneous.  Instead, the deleted element will be internally marked to be invalid.  The intention is to eliminate internal shuffling within the list, and to provide a representation close to the hardware.  This list abstraction is intended to directly represent a collection of contiguous cache lines.  Consider a list of integers, and let’s represent empty space (i.e. deleted places) by X.  Suppose I start off with the list [1,2,3,4] and delete 2, the final list will be [1,X,3,4].  Proceeding, I delete 1 and 4 to produce the list [X,X,3,X].

To implement my list construct, I require both a bitmask which stores valid/invalid element positions, and the actual data.  For parallel efficiency, I want to ensure that there will be no cache lines shared across processors where each processor will want to write to a shared cache line.  That is, I want to be able to break up the list into chunks such that each processor can operate on a distinct piece without interfering with other processors.  I also want to ensure that positional information is associated with these list chunks, so that each processor will have the list data and positional data hot in its cache.  Some information may be shared across processors, but this information should be read-only.  This way processors can operate on sublists without contention for shared data.

In order to meet these goals, I have broken up the implementation of the list into two strategies based on the size of the type stored within the list.

Strategy #1: sizeof(T) < cache line size

In this case, at least one element of some type T can be stored in a single cache line.  Maybe it’s one, maybe it’s ten.  In this case, we store positional information at the end of the cache line.  We pad the space between the element data and the bitmask with zeros if required.

Strategy #2: sizeof(T) >= cache line size

This means that at least one entire cache line is required to store an instance of this type.  In this case, I’ve decided to store some number of elements of T, say k.  After k elements, an entire cache line will be dedicated to positional information.  This can lead to wasted space, in particular on my machine a cache line is 64 bytes.  If sizeof(T) is 65 bytes, then 63 bytes have been wasted in padding.  This serves to reduce the potential for fasle sharing.  It may also be a waste of precious cache space, however I have not been able to think of a better method.

How will k be chosen?  I’m really not sure: for now I will fix it to be a constant number, likely 10.  This number represents the granularity of data for parallel operations.  I do not want to allow the user to specify the granularity themselves, this detail should be taken care of by the library automatically.  I suspect that there is a large interval for k within which the performance will be quite efficient.  I’m reminded of a graph provided in the Threading Building Blocks book which indicates that the granularity selected isn’t terribly important (so long as it’s not too big, and not too small).

The Messy Details

At runtime the cache line size is determined by a cpuid assembly call.  I’ve provided a constant class machine which will hold information about the system gathered at runtime.  For now, this just includes the size of a cache line in bytes.  This is done at runtime because an application dependent upon my library may be run on a different machine than the one on which it was compiled.  Although a template solution would have been more elegant, runtime portability is more important.

Internally, the list data structure will contain a pointer to unsigned chars.  The unsigned char array will be allocated with TBB’s scalable_malloc.  When an item is inserted into the list, its copy constructor will be called.  Similarly when it is removed from the list, the destructor will be called for that element.  When an element is requested, it will be cast to the appropriate type and a reference returned.  Internally however only unsigned chars are managed.  This is definitely messy, but with careful thought and implementation (along with heavy documentation) I’m confident this can be done efficiently.  Thankfully, I believe this is the only data structure which requires such low level attention.


Posted by AJ Guillon  (October 19, 2008)

Leave a Reply