YetiSim Blog

Blogs about simulation and developing YetiSim.

Archive for the 'Threading Building Blocks' Category

Implementation of Parallel C++ Lists: Part 2

Last time I began to outline how I am implementing parallel lists in C++. In this blog, I will show some of the features of the completed implementation, and will discuss some refinements that could be made.

Read the rest of this entry »


Posted by AJ Guillon  (November 14, 2008)    |    Comments (0)

Implementation of Parallel C++ Lists: Part 1

In my last blog I outlined primitive data types to enable easy parallel programming using my C++ library.  This weekend I began to implement my “lists with a twist”.  The implementation has an elegant thought behind it, but there are inevitable messy low-level details.  In this blog, I will outline these messy low-level details and in my next blog I will demonstrate the completed interface.

Read the rest of this entry »


Posted by AJ Guillon  (October 19, 2008)    |    Comments (0)

Three Months? What’s Been Going On?

Three months since the last post?  Where are you?  Is YetiSim dead, or just one of those projects that starts but never goes anywhere?  Well the short answer is, I’ve taken a strategic retreat to really update my skills to make YetiSim even cooler.  Now with some updated C++ and parallel skills, it’s time to get down to making YetiSim super cool.

Read the rest of this entry »


Posted by AJ Guillon  (July 2, 2008)    |    Comments (1)

Pointer Containers ptr_vector and ptr_set

I have been extracting components from YetiSim and contributing them to tbbcommunity.org, which is a site that will store extensions to TBB’s own code.  The intention of tbbcommunity.org is to form a symbiotic relationship with the main TBB site.  Ideally tbbcommunity.org will host extra components which are not of sufficient generality or stability for inclusion with official TBB.  In my mind, I would like tbbcommunity.org to become something like the Boost libraries.  Libraries which are not included with the standard releases, but which are indispensable resources.

Anyways, enough about that and on to pointer containers!  The basic motivation for pointer containers within YetiSim is one of exception safety.  I needed a place to safely store pointers to allocated memory, without the use of smart pointers.  Boost shared_ptr is an excellent resource, however it is too slow for YetiSim.  I decided to create ownership classes which will actually own the allocated memory, and so be responsible for its proper destruction.  These classes are low-cost so long as they are lightly contested in parallel code since there are some mutexes within the code.  It has been pointed out to me that this is similar in concept to Boost pointer containers, although my classes are slightly difference since they use TBB mutexes internally.

The basic operations are all documented within the source code (available from tbbcommunity.org’s svn), and I will not repeat them here.  One item worth mentioning is the lack of copy constructor or assignment operator.  It simply does not make sense to copy a container of owned objects from one variable to another.  I could do silly things when the object is assigned to or copied, but this would result in confusing code.  Instead, I simply forbid these operations and instead provide swap().  A user can easily call swap, and its meaning is always clear.  However a user can “assign” one pointer container to another by clearing one container using deleteAll() to ensure memory is properly freed, then calling swap.

I do not provide a method of iterating over the contents of the pointer containers, simply because I do not need it right now.


Posted by AJ Guillon  (March 7, 2008)    |    Comments (1)

TBB execution_graph - Part 4: Introducing parallel_set_for

Yesterday I discussed a problem with processing interdependent entities with TBB. This doesn’t just pertain to simulations and to the execution_graph, but in general. There is no provided high-level construct which allows you to easily process elements of a range with interdependencies. Let me see if I can say this more clearly, say I have to process a collection of entities, but I also have to reprocess some of those entities over again in the algorithm. Perhaps I am processing 2D cells on a map, and have to complete a second pass to process the boundaries. This could certainly be done with the task_scheduler, and likely could be done with just the range and partitioner.

I am interested in a high level construct which would allow me to partition a set of elements into units which could be processed in parallel. This construct should then allow me to partition the set again to make another pass, and so on until I deem that the processing is completed. In this way, the problems discussed in Part 3 of my blog series on the execution_graph could be resolved. I know I promised code yesterday, but I’ll just show an idea today.

In order to ensure I’m conveying my ideas effectively, I’ve prepared some diagrams. Suppose I have a set of elements to process in some manner. I’ll represent this set as blobs of ink on a digital page, as shown below.

Set of objects to be processed

In order to process these elements, in the first pass I might partition the space into three segments. I do not have to process the entire space at once, just the segments which interest me. I’ve drawn lines around the elements I want to process. These three partitions should be processed in parallel.

First partition of space

Again after processing these segments I might not have completed my task. Next I will partition the space into another three segments which can be processed independently. Note that this time one of the segments only contains a single element to process, this is fine and entirely possible.

Second partition of space

For the last step, I will partition the space one final time into two large segments. Once again these two segments will be processed in parallel.

Third partition of space

I call the construct which will support this (tentatively), parallel_set_for. This conveys that we have a set of elements which are to be processed. We process subsets of this larger set independently, then partition the larger space into subsets again for further processing until the task is completed (as opposed to partitioning the subsets into smaller ones).

In contrast with the TBB parallel_for construct which simply divides elements for individual processing, the parallel_set_for construct might not even process all elements. Some might not affect computation so they might be omitted. Furthermore, the parallel_for construct is designed to divide work to keep all processors busy and simply complete after all elements have been processed. The parallel_set_for construct would proceed in steps, where independent subsets are selected and then processed.

In comparison to parallel_for, this processing method could form a bottleneck. If partitions are not selected properly, effort might be wasted because some partitions could have been processed in a previous step. In addition, a single element which is being processed might hold up multiple independent partitions which could process next without affecting results. I think that this could be compensated for with good interface design.

Well, that is what I’ve thought up so far. I welcome your comments. This is not a fully thought out idea, just an idea which sprung to mind today that I wanted to share with you. Good night!


Posted by AJ Guillon  (January 30, 2008)    |    Comments (1)

TBB execution_graph - Part 3: More Concept Revisions, Dividing Work, and Arguments

I hoped to have written some skeletal code by now to show you, and start to point out the classes.  However I find when I take on a software design, that I start to think about it a lot.  I ponder how things interact while I’m in the shower, or find myself visualizing the program interactions, daydreaming while I’m cooking dinner or doing dishes.  The past weekend has been filled with these moments, and now I’ll share my thoughts with you.

First, the class name transition is again misleading.  I need a better name to represent the connections between nodes in the execution graph.  The name link, or execution_path best convey the meaning.  This is a small matter, but I wanted to give you the heads up.

Now, I’ll describe my thoughts over the weekend.  Prepare yourself for a rant, and of course post comments if you have some ideas.  The question is, what does the execution_graph execute… when it is “executed” something should be passed as an argument so that the execution_graph can base its decisions upon something.  This might be an open file descriptor, an entity in a simulation, or perhaps absolutely nothing.  Perhaps the decisions to make are programmed within the links themselves so arguments are not required.  So let’s just pass a single thing, and even better let’s template this argument so a list of things could be passed.  The state machine (I mean execution_graph) might be associated with one particular type, say ExecutionArgumentType.

This idea works well if we have a collection of things to process individually, say file descriptors for sockets.  Each socket might be passed to an execution_graph for operations to take place.  We can do even better, and use tbb::parallel_for with the auto_partitioner to divide up a vector of sockets and run their execution_graph in parallel.  This is great for such easy examples, but what happens when we can’t do such a trivial division?  What if we are processing something that requires fuzzy boundaries when we partition a space.  For example, say we divide up a 2D plane into rectangles, and we have an algorithm which requires input from neighbors in a 2D cell for proper execution, like Conway’s game of life.

This is something that really ought to be handled by an execution_graph, but it’s not so easy unless the entire 2D board is passed as an argument.  So the execution_graph can’t really do this kind of processing yet.  But I’m not here to write Conway’s game of life in the execution_graph, I really don’t know how somebody would do it yet… but this doesn’t change the fact I need to give them the tools to do so.  In my mind, I need to find a clever way of letting people smarter than me to do clever things with my constructs.

This is an open problem, I’d love to say I knew how to do this.  The reality is, I will likely abstract the problem away into a class which breaks apart a larger chunk into smaller ones for processing, just like the TBB Range concept.  It may be that I use that, but I’m not sure the interfaces are sufficient.  People can then use this class in clever ways.

This brings me to the other subtopic in my blog title: dividing work.  I need to provide users with a method by which they can determine which execution_graph gets what job, or at least dividing up a pile of work to do.  I have a feeling that the solution to this problem will be incorporated with the former problem I stated, which is how to determine boundaries during processing.

That’s it for now, I’ll post another blog tomorrow night to keep you up to date with my design thoughts, and perhaps some code tidbits!  Once things get more concrete, conversations should be a bit easier.  It’s hard to convey my thoughts when I haven’t thought them all yet.


Posted by AJ Guillon  (January 29, 2008)    |    Comments (1)

TBB execution_graph - Part 2: Revising Concepts, Class Design Outline

I decided today that the name state_machine is too deceptive, and resolved to find a better name. The term state machine is particularly deceptive because nodes are not required to be states, but can execute complex logic internally. After some brainstorming, I came up with the new name execution_graph. This conveys that the construct is a graph which is associated with program execution. I have also decided to remove the state node type, and replace it with two types: compute and IO. The intention behind this is that some execution stages will be IO intensive whereas others will be computationally intensive. Differentiating behind the type of complexity entailed will be of benefit when TBB supports parallel IO operations. For now, there will be little if any difference.

Class Design

In my first implementation of a state_machine (the predecessor to the execution_graph) I relied heavily on polymorphism and dynamic_cast. I discovered two problems with this approach. First off, I only required the polymorphism in the state node class which contained logic to execute. Second, I used dynamic_cast to evaluate node types at run time which degrades performance with unnecessary overhead. In reality separate classes are not required if each node type is permitted to contain functions which are executed upon visitation of that node. This is a bug disguised as a feature. I don’t know why somebody would want a join node or fork node to execute logic upon visitation, but it simplifies the interface. I’m getting ahead of myself, let me identify the critical components I see and discuss how I envision an implementation.

class execution_graph

This will be a template class, but more on that later. This class will be the main user interface and will provide facilities for the construction of a graph in which nodes contain functions that will be executed during evaluation at runtime. It will own all transitions, nodes, and execution instances. These owned objects will all be destroyed if the instantiated execution_graph goes out of scope, so it is up to the user to ensure it remains in scope for proper execution. Users will be provided with a function like getStart() which will return a start node for them to use to build the execution graph. The logic for the evaluation of the graph itself (that is, how the graph is traversed and executed) will be provided as a template argument for maximum flexibility, along with some other classes.

class node

The node class represents a particular node within the execution graph. It could be a compute, join, fork, IO, start, final, or error node. Each node instance may only be created by another node instance, so users may not arbitrarily create them. Also, nodes will contain a pointer to the execution graph which they are associated with, so that transitions may not be created between nodes in separate execution graphs. The node class stores several functions which are called with appropriate arguments during graph execution. The declaration for the variable type which stores the function is to be provided as a template argument for maximum flexibility (e.g. I might want to use boost.function, others might want to use raw function pointers). The argument type which is provided to the stored functions is also provided as a template argument.

class transition

This class will represent a transition between nodes in the execution graph, which might be guarded or unguarded. The transition class will be provided as a template argument to the execution_graph, since others might want their execution_graphs to have special transitions. For example, YetiSim will alter the execution_graph transition logic so that simulation time is advanced by special transitions. Rather than adjusting internal implementations of a transition class just for YetiSim, by providing it as a template argument others may customize the behavior of the execution graph.

class execution_logic

This class encapsulates graph execution logic, and is provided as a template argument to execution_graph. This class will determine the parallelization strategy used in evaluation of the execution_graph. It will also determine how transitions are handled, so if a user overrides the default transition class they will also have to adjust the execution_logic class. The main idea for this class, is that it will provide certain required functions to the execution_graph which drive actual execution.


Posted by AJ Guillon  (January 25, 2008)    |    Comments (0)

TBB execution_graph - Part 1: Basic Ideas and Motivations

Not so long ago, I wrote a blog about the fact that I am separating the YetiSim state machine architecture into a generic component for contribution to the TBB project. I have no idea whether or not this component will be accepted by TBB officially, however others may find it useful so I am blogging about my design as I go. I use the term state machine to describe the construct I am building, but this is misleading. In August 2007, I started the research that lead to the YetiSim state machine construct, however it has evolved into something more sophisticated and without a proper name. Let’s start from the beginning and illustrate the initial motivations for the construct and what it does for us now.

I first looked toward TBB as a platform independent C++ threading library which could be used as the foundation of a simulation framework. Initially I was searching for a method to save and restore the stack, hence emulating coroutines for simulation. I was disappointed to learn that not only does TBB not provide any such low-level method, the usage of such methods is actually discouraged. After much thought, I realized that I could implement a state machine to provide the functionality which I required. The state machine would consist of start, final, error, fork, join and state nodes. I’ll give a brief description of the intended role of each node.

  • Start Node: This is where execution of the state machine always begins, only one is permitted per state machine.
  • Final Node: The normal end of an execution pathway, multiple nodes are permitted in a single state machine.
  • Error Node: The abnormal termination of an execution pathway, the entire state machine will be terminated as a result. This node is used as a form of assertion.
  • Fork Node: Splits a single execution pathway into multiple for potential parallel execution. Each pathway is executed independently of the others.
  • Join Node: Merges multiple execution pathways back into one.
  • State Node: Executes a C++ function or class member.

Each node has a corresponding symbol which I will not include here. These node types are based off state machine nodes found within UML. The nodes are joined by transitions which are either guarded or unguarded. The basic motivation is that an unguarded transition is always taken, whereas whether or not to proceed along a guarded transition is determined at runtime. Already I have omitted a critical detail, execution of the state machine proceeds literally. I will describe execution next.

When a state machine is executed, a pointer is created which points to the single start node within the state machine. Execution logic determines how execution proceeds by evaluating transitions and following them accordingly. Each node has a special meaning, and execution logic must take this into account while traversing the state machine. When a fork node is encountered, multiple pointers are created which might traverse the state machine in parallel. Similarly, join nodes indicate that multiple execution pathways should be tied together before execution proceeds. The join node will wait patiently for each execution pathway that can reach the join node to do so, and proceeds after that point.

In my opinion, the execution of a state machine will not have a significant runtime cost. The execution of a state machine provides a functionality that replaces the requirement for coroutines within simulation, since a transition might indicate that execution should not proceed until the simulation clock has been advanced by a certain amount. Similarly transitions might indicate probabilities, providing a functionality like that of a Markov chain. Since the logic of the state machine is dictated by the connections of nodes, they may be altered arbitrarily and potentially at runtime. This permits a highly dynamic programming environment with minimal additional cost. Keep in mind that saving and restoring the runtime stack would also have a significant runtime cost, and might require operating system calls. The state machine construct does not require any such calls, and should execute faster.

Earlier I described the name state machine to be misleading. This is because each state node may embed any logic, and not just represent a particular state. This construct is thus a superset of a state machine. Execution may proceed in many possible manners. Furthermore this state machine is easier to maintain than state machines implemented with switch logic. The adjustment of state machine logic does not have a high cost here, because only the representation of the state machine must be altered. A good design which incorporates the state machine concept will not require a reimplementation of existing code, rather just adjustment of transitions.

This construct is highly useful for simulations, but equally useful in other contexts. For example, this state machine might be highly useful for controlling parallel I/O when TBB itself supports it. It might also be the foundation for such an I/O library. Each state node might be labeled as computationally or I/O constrained and TBB could then manage the low-level details to ensure efficient execution with these hints. It also provides a simple interface to the TBB scheduler, and might permit designers to exploit parallelism in constructs that otherwise might be more complex. In fact, complex logic could be dissected into smaller chunks with this construct and represented easily as a system of nodes and transitions. This allows close inspection of the logic, and discussion with others to ensure that the system designed and requirements match properly. It is not difficult to construct a graphviz representation of the state machine, so it is fairly simple to inspect this logic visually to ensure proper execution.

I realize that programming languages and the computer execution itself may be considered to be a state machine. This is certainly true at a theoretical level, however adjustments to logic generally require a recompilation of source code. The state machine construct I propose has an easier method of dynamic adjustment. The state machine construct provides something like lego blocks, and thus allows users to assemble the pieces in any way they see fit.

Finally a theoretical discussion. The state machine construct I propose is deterministic. Although it appears similar to a Finite State Automaton (FSA), it is not. This state machine is capable of computation within its nodes, and has more node types than a FSA as well. The two are related, but are not identical.


Posted by AJ Guillon  (January 24, 2008)    |    Comments (0)

State Machine Node Types: enums vs. dynamic_cast

In the original YetiSim state machine implementation, I used polymorphism to represent the various types of nodes that could occur within the state machine. Each possible node type (Start, Final, Error, Join, Fork and State) was derived from a generic Node class. I implemented convenience functions like bool isStart(Node* node) { return dynamic_cast<Start*>(node); } to determine node types to drive the state machine execution algorithms.

I later realized that polymorphism was not required, and that I could implement a single Node class which contained an enum which identified the type of node. This could eliminate the need for calls to dynamic_cast with a relatively small increase in programming complexity. The curious cat that I am, I wondered if this implementation would have better performance than dynamic_cast for large data sets.  I wrote a sample application and timed the dynamic_cast against just looking up with an enum, and found that dynamic_cast indeed was slower.  However, with the -O2 optimization flag passed to gcc dynamic_cast seemed just as efficient.

The question now is whether or not all compilers will optimize dynamic_cast in the same way as gcc.  Could it be that this simple application was easily optimized, whereas a large scale program will show a performance hit with dynamic_cast?  These are questions I’d like to find answers to, however at this point the simple fact of the matter is that dynamic_cast must be executed at runtime to ensure the types are correct, whereas static_cast does not require any runtime overhead.  So, instead of relying upon compiler optimizations, I’ve decided to go ahead with the enum and static_cast solution.


Posted by AJ Guillon  (January 17, 2008)    |    Comments (0)

Complete State Machine Redesign

I first created the state machine construct for YetiSim in August 2007. Since that time, I’ve learned more about C++ and Threading Building Blocks. I’ve also had time to think about the design I chose, and read posts on the TBB forums from others who could benefit from a state machine construct too. I have realized that the state machine construct could be generalized for others to use, and that YetiSim could benefit from the separated component too. I would hope that if others find the state machine useful they would help write test cases, and improve the state machine.

I have undertaken the task of completely redesigning the state machine construct for contribution to official TBB.  After this has been done, YetiSim should be reduced to a few template arguments to the new state machine construct.  This also permits me to focus on the simulation aspects of YetiSim, and not on the implementation details of the state machine since it is packaged as a separate project.

I will be blogging here with more details as the redesign progresses.  Stay tuned for more updates.


Posted by AJ Guillon  (January 17, 2008)    |    Comments (0)