Three months since the last post? Where are you? Is YetiSim dead, or just one of those projects that starts but never goes anywhere? Well the short answer is, I’ve taken a strategic retreat to really update my skills to make YetiSim even cooler. Now with some updated C++ and parallel skills, it’s time to get down to making YetiSim super cool.
YetiSim Blog
Archive for the 'YetiSim Development' Category
TBB execution_graph - Part 4: Introducing parallel_set_for
Yesterday I discussed a problem with processing interdependent entities with TBB. This doesn’t just pertain to simulations and to the execution_graph, but in general. There is no provided high-level construct which allows you to easily process elements of a range with interdependencies. Let me see if I can say this more clearly, say I have to process a collection of entities, but I also have to reprocess some of those entities over again in the algorithm. Perhaps I am processing 2D cells on a map, and have to complete a second pass to process the boundaries. This could certainly be done with the task_scheduler, and likely could be done with just the range and partitioner.
I am interested in a high level construct which would allow me to partition a set of elements into units which could be processed in parallel. This construct should then allow me to partition the set again to make another pass, and so on until I deem that the processing is completed. In this way, the problems discussed in Part 3 of my blog series on the execution_graph could be resolved. I know I promised code yesterday, but I’ll just show an idea today.
In order to ensure I’m conveying my ideas effectively, I’ve prepared some diagrams. Suppose I have a set of elements to process in some manner. I’ll represent this set as blobs of ink on a digital page, as shown below.

In order to process these elements, in the first pass I might partition the space into three segments. I do not have to process the entire space at once, just the segments which interest me. I’ve drawn lines around the elements I want to process. These three partitions should be processed in parallel.

Again after processing these segments I might not have completed my task. Next I will partition the space into another three segments which can be processed independently. Note that this time one of the segments only contains a single element to process, this is fine and entirely possible.

For the last step, I will partition the space one final time into two large segments. Once again these two segments will be processed in parallel.

I call the construct which will support this (tentatively), parallel_set_for. This conveys that we have a set of elements which are to be processed. We process subsets of this larger set independently, then partition the larger space into subsets again for further processing until the task is completed (as opposed to partitioning the subsets into smaller ones).
In contrast with the TBB parallel_for construct which simply divides elements for individual processing, the parallel_set_for construct might not even process all elements. Some might not affect computation so they might be omitted. Furthermore, the parallel_for construct is designed to divide work to keep all processors busy and simply complete after all elements have been processed. The parallel_set_for construct would proceed in steps, where independent subsets are selected and then processed.
In comparison to parallel_for, this processing method could form a bottleneck. If partitions are not selected properly, effort might be wasted because some partitions could have been processed in a previous step. In addition, a single element which is being processed might hold up multiple independent partitions which could process next without affecting results. I think that this could be compensated for with good interface design.
Well, that is what I’ve thought up so far. I welcome your comments. This is not a fully thought out idea, just an idea which sprung to mind today that I wanted to share with you. Good night!
TBB execution_graph - Part 3: More Concept Revisions, Dividing Work, and Arguments
I hoped to have written some skeletal code by now to show you, and start to point out the classes. However I find when I take on a software design, that I start to think about it a lot. I ponder how things interact while I’m in the shower, or find myself visualizing the program interactions, daydreaming while I’m cooking dinner or doing dishes. The past weekend has been filled with these moments, and now I’ll share my thoughts with you.
First, the class name transition is again misleading. I need a better name to represent the connections between nodes in the execution graph. The name link, or execution_path best convey the meaning. This is a small matter, but I wanted to give you the heads up.
Now, I’ll describe my thoughts over the weekend. Prepare yourself for a rant, and of course post comments if you have some ideas. The question is, what does the execution_graph execute… when it is “executed” something should be passed as an argument so that the execution_graph can base its decisions upon something. This might be an open file descriptor, an entity in a simulation, or perhaps absolutely nothing. Perhaps the decisions to make are programmed within the links themselves so arguments are not required. So let’s just pass a single thing, and even better let’s template this argument so a list of things could be passed. The state machine (I mean execution_graph) might be associated with one particular type, say ExecutionArgumentType.
This idea works well if we have a collection of things to process individually, say file descriptors for sockets. Each socket might be passed to an execution_graph for operations to take place. We can do even better, and use tbb::parallel_for with the auto_partitioner to divide up a vector of sockets and run their execution_graph in parallel. This is great for such easy examples, but what happens when we can’t do such a trivial division? What if we are processing something that requires fuzzy boundaries when we partition a space. For example, say we divide up a 2D plane into rectangles, and we have an algorithm which requires input from neighbors in a 2D cell for proper execution, like Conway’s game of life.
This is something that really ought to be handled by an execution_graph, but it’s not so easy unless the entire 2D board is passed as an argument. So the execution_graph can’t really do this kind of processing yet. But I’m not here to write Conway’s game of life in the execution_graph, I really don’t know how somebody would do it yet… but this doesn’t change the fact I need to give them the tools to do so. In my mind, I need to find a clever way of letting people smarter than me to do clever things with my constructs.
This is an open problem, I’d love to say I knew how to do this. The reality is, I will likely abstract the problem away into a class which breaks apart a larger chunk into smaller ones for processing, just like the TBB Range concept. It may be that I use that, but I’m not sure the interfaces are sufficient. People can then use this class in clever ways.
This brings me to the other subtopic in my blog title: dividing work. I need to provide users with a method by which they can determine which execution_graph gets what job, or at least dividing up a pile of work to do. I have a feeling that the solution to this problem will be incorporated with the former problem I stated, which is how to determine boundaries during processing.
That’s it for now, I’ll post another blog tomorrow night to keep you up to date with my design thoughts, and perhaps some code tidbits! Once things get more concrete, conversations should be a bit easier. It’s hard to convey my thoughts when I haven’t thought them all yet.
TBB execution_graph - Part 2: Revising Concepts, Class Design Outline
I decided today that the name state_machine is too deceptive, and resolved to find a better name. The term state machine is particularly deceptive because nodes are not required to be states, but can execute complex logic internally. After some brainstorming, I came up with the new name execution_graph. This conveys that the construct is a graph which is associated with program execution. I have also decided to remove the state node type, and replace it with two types: compute and IO. The intention behind this is that some execution stages will be IO intensive whereas others will be computationally intensive. Differentiating behind the type of complexity entailed will be of benefit when TBB supports parallel IO operations. For now, there will be little if any difference.
Class Design
In my first implementation of a state_machine (the predecessor to the execution_graph) I relied heavily on polymorphism and dynamic_cast. I discovered two problems with this approach. First off, I only required the polymorphism in the state node class which contained logic to execute. Second, I used dynamic_cast to evaluate node types at run time which degrades performance with unnecessary overhead. In reality separate classes are not required if each node type is permitted to contain functions which are executed upon visitation of that node. This is a bug disguised as a feature. I don’t know why somebody would want a join node or fork node to execute logic upon visitation, but it simplifies the interface. I’m getting ahead of myself, let me identify the critical components I see and discuss how I envision an implementation.
class execution_graph
This will be a template class, but more on that later. This class will be the main user interface and will provide facilities for the construction of a graph in which nodes contain functions that will be executed during evaluation at runtime. It will own all transitions, nodes, and execution instances. These owned objects will all be destroyed if the instantiated execution_graph goes out of scope, so it is up to the user to ensure it remains in scope for proper execution. Users will be provided with a function like getStart() which will return a start node for them to use to build the execution graph. The logic for the evaluation of the graph itself (that is, how the graph is traversed and executed) will be provided as a template argument for maximum flexibility, along with some other classes.
class node
The node class represents a particular node within the execution graph. It could be a compute, join, fork, IO, start, final, or error node. Each node instance may only be created by another node instance, so users may not arbitrarily create them. Also, nodes will contain a pointer to the execution graph which they are associated with, so that transitions may not be created between nodes in separate execution graphs. The node class stores several functions which are called with appropriate arguments during graph execution. The declaration for the variable type which stores the function is to be provided as a template argument for maximum flexibility (e.g. I might want to use boost.function, others might want to use raw function pointers). The argument type which is provided to the stored functions is also provided as a template argument.
class transition
This class will represent a transition between nodes in the execution graph, which might be guarded or unguarded. The transition class will be provided as a template argument to the execution_graph, since others might want their execution_graphs to have special transitions. For example, YetiSim will alter the execution_graph transition logic so that simulation time is advanced by special transitions. Rather than adjusting internal implementations of a transition class just for YetiSim, by providing it as a template argument others may customize the behavior of the execution graph.
class execution_logic
This class encapsulates graph execution logic, and is provided as a template argument to execution_graph. This class will determine the parallelization strategy used in evaluation of the execution_graph. It will also determine how transitions are handled, so if a user overrides the default transition class they will also have to adjust the execution_logic class. The main idea for this class, is that it will provide certain required functions to the execution_graph which drive actual execution.
TBB execution_graph - Part 1: Basic Ideas and Motivations
Not so long ago, I wrote a blog about the fact that I am separating the YetiSim state machine architecture into a generic component for contribution to the TBB project. I have no idea whether or not this component will be accepted by TBB officially, however others may find it useful so I am blogging about my design as I go. I use the term state machine to describe the construct I am building, but this is misleading. In August 2007, I started the research that lead to the YetiSim state machine construct, however it has evolved into something more sophisticated and without a proper name. Let’s start from the beginning and illustrate the initial motivations for the construct and what it does for us now.
I first looked toward TBB as a platform independent C++ threading library which could be used as the foundation of a simulation framework. Initially I was searching for a method to save and restore the stack, hence emulating coroutines for simulation. I was disappointed to learn that not only does TBB not provide any such low-level method, the usage of such methods is actually discouraged. After much thought, I realized that I could implement a state machine to provide the functionality which I required. The state machine would consist of start, final, error, fork, join and state nodes. I’ll give a brief description of the intended role of each node.
- Start Node: This is where execution of the state machine always begins, only one is permitted per state machine.
- Final Node: The normal end of an execution pathway, multiple nodes are permitted in a single state machine.
- Error Node: The abnormal termination of an execution pathway, the entire state machine will be terminated as a result. This node is used as a form of assertion.
- Fork Node: Splits a single execution pathway into multiple for potential parallel execution. Each pathway is executed independently of the others.
- Join Node: Merges multiple execution pathways back into one.
- State Node: Executes a C++ function or class member.
Each node has a corresponding symbol which I will not include here. These node types are based off state machine nodes found within UML. The nodes are joined by transitions which are either guarded or unguarded. The basic motivation is that an unguarded transition is always taken, whereas whether or not to proceed along a guarded transition is determined at runtime. Already I have omitted a critical detail, execution of the state machine proceeds literally. I will describe execution next.
When a state machine is executed, a pointer is created which points to the single start node within the state machine. Execution logic determines how execution proceeds by evaluating transitions and following them accordingly. Each node has a special meaning, and execution logic must take this into account while traversing the state machine. When a fork node is encountered, multiple pointers are created which might traverse the state machine in parallel. Similarly, join nodes indicate that multiple execution pathways should be tied together before execution proceeds. The join node will wait patiently for each execution pathway that can reach the join node to do so, and proceeds after that point.
In my opinion, the execution of a state machine will not have a significant runtime cost. The execution of a state machine provides a functionality that replaces the requirement for coroutines within simulation, since a transition might indicate that execution should not proceed until the simulation clock has been advanced by a certain amount. Similarly transitions might indicate probabilities, providing a functionality like that of a Markov chain. Since the logic of the state machine is dictated by the connections of nodes, they may be altered arbitrarily and potentially at runtime. This permits a highly dynamic programming environment with minimal additional cost. Keep in mind that saving and restoring the runtime stack would also have a significant runtime cost, and might require operating system calls. The state machine construct does not require any such calls, and should execute faster.
Earlier I described the name state machine to be misleading. This is because each state node may embed any logic, and not just represent a particular state. This construct is thus a superset of a state machine. Execution may proceed in many possible manners. Furthermore this state machine is easier to maintain than state machines implemented with switch logic. The adjustment of state machine logic does not have a high cost here, because only the representation of the state machine must be altered. A good design which incorporates the state machine concept will not require a reimplementation of existing code, rather just adjustment of transitions.
This construct is highly useful for simulations, but equally useful in other contexts. For example, this state machine might be highly useful for controlling parallel I/O when TBB itself supports it. It might also be the foundation for such an I/O library. Each state node might be labeled as computationally or I/O constrained and TBB could then manage the low-level details to ensure efficient execution with these hints. It also provides a simple interface to the TBB scheduler, and might permit designers to exploit parallelism in constructs that otherwise might be more complex. In fact, complex logic could be dissected into smaller chunks with this construct and represented easily as a system of nodes and transitions. This allows close inspection of the logic, and discussion with others to ensure that the system designed and requirements match properly. It is not difficult to construct a graphviz representation of the state machine, so it is fairly simple to inspect this logic visually to ensure proper execution.
I realize that programming languages and the computer execution itself may be considered to be a state machine. This is certainly true at a theoretical level, however adjustments to logic generally require a recompilation of source code. The state machine construct I propose has an easier method of dynamic adjustment. The state machine construct provides something like lego blocks, and thus allows users to assemble the pieces in any way they see fit.
Finally a theoretical discussion. The state machine construct I propose is deterministic. Although it appears similar to a Finite State Automaton (FSA), it is not. This state machine is capable of computation within its nodes, and has more node types than a FSA as well. The two are related, but are not identical.
State Machine Node Types: enums vs. dynamic_cast
In the original YetiSim state machine implementation, I used polymorphism to represent the various types of nodes that could occur within the state machine. Each possible node type (Start, Final, Error, Join, Fork and State) was derived from a generic Node class. I implemented convenience functions like bool isStart(Node* node) { return dynamic_cast<Start*>(node); } to determine node types to drive the state machine execution algorithms.
I later realized that polymorphism was not required, and that I could implement a single Node class which contained an enum which identified the type of node. This could eliminate the need for calls to dynamic_cast with a relatively small increase in programming complexity. The curious cat that I am, I wondered if this implementation would have better performance than dynamic_cast for large data sets. I wrote a sample application and timed the dynamic_cast against just looking up with an enum, and found that dynamic_cast indeed was slower. However, with the -O2 optimization flag passed to gcc dynamic_cast seemed just as efficient.
The question now is whether or not all compilers will optimize dynamic_cast in the same way as gcc. Could it be that this simple application was easily optimized, whereas a large scale program will show a performance hit with dynamic_cast? These are questions I’d like to find answers to, however at this point the simple fact of the matter is that dynamic_cast must be executed at runtime to ensure the types are correct, whereas static_cast does not require any runtime overhead. So, instead of relying upon compiler optimizations, I’ve decided to go ahead with the enum and static_cast solution.
Complete State Machine Redesign
I first created the state machine construct for YetiSim in August 2007. Since that time, I’ve learned more about C++ and Threading Building Blocks. I’ve also had time to think about the design I chose, and read posts on the TBB forums from others who could benefit from a state machine construct too. I have realized that the state machine construct could be generalized for others to use, and that YetiSim could benefit from the separated component too. I would hope that if others find the state machine useful they would help write test cases, and improve the state machine.
I have undertaken the task of completely redesigning the state machine construct for contribution to official TBB. After this has been done, YetiSim should be reduced to a few template arguments to the new state machine construct. This also permits me to focus on the simulation aspects of YetiSim, and not on the implementation details of the state machine since it is packaged as a separate project.
I will be blogging here with more details as the redesign progresses. Stay tuned for more updates.
YetiSim 0.2.0 Released: Major Speed Improvements
Today I committed a substantial number of changes to YetiSim to the SVN repository, and have incremented the release version to 0.2.0. This release boasts a 2X - 3X speed increase over 0.1.0. You can download this snapshot from the home page.
A number of interfaces both at the user and internal library level have been changed. The use of boost::shared_ptr has been entirely eliminated due to performance reasons. The MemoryManager class has been introduced, and all memory management operations are confined to this class. Admittedly, this class is fairly simple right now but that is because the tbb::scalable_allocator already does pooling for us, and probably does a pretty good job of memory management already.
This release is a step backward in terms of exception safety, however this will be addressed soon. Enjoy the new and improved YetiSim. Further performance improvements are on their way, so stay tuned.
Pointer and Reference Semantics
I have been refactoring YetiSim to move away from boost::shared_ptr to the new MemoryManager class, and replacing the shared_ptrs with references and pointers. Where should references be returned or passed as arguments, versus pointers? What meaning should be inferred by the programmer based on this? For example, what does the prototype X& foo(Y& y) mean vs. X* foo(Y* y). There must be a global standard for how to interpret pointers and references in prototypes in YetiSim, otherwise chaos will ensue.
After some discussions on #tbb, I decided to create and adhere to specific conventions within YetiSim. These are the conventions for arguments:
- References as constructor arguments: A reference that is used within a constructor argument may be stored by the constructed class. The class does not assume ownership of the referenced data, but may store an internal reference to it for future use. Be careful of this situation for class destruction.
- References as function arguments: If a reference has been passed as an argument to a function, then the function is free to use that argument for the duration of the call. Unless explicit in the documentation, the reference will not be saved for future use. There should be very few, if any, cases of functions saving references for later use.
- Pointers as a constructor argument: When a pointer is used in a constructor, ownership of the data is transferred to the constructed class. This class is now responsible to calling appropriate MemoryManager functions to delete the data. The pointer may naturally be saved for future use, however when passing the data to other functions or classes it should now be done by reference to convey ownership is not being passed.
- Pointers as a function argument: The function which has been called via a pointer assumes responsibility for ownership. The function may save the pointer for future use, but the function (or class if a member function) is now responsible for deleting the data by calling appropriate MemoryManager member functions.
Similarly there are conventions for the return part of the function prototype.
- References as a return type: If a reference is returned, then the object may be freely used, but ownership is not acquired. The reference may be stored for future use, but this has to be done with care in case the reference disappears. There must be a guarantee that the object cannot hold the reference if the referenced object disappears. A good example of this, is the ThreadOfExecution class. It holds StateMachineExecution references, but part of the garbage collection of a StateMachineExecution is to delete the ThreadOfExecution objects. Thus, a ThreadOfExecution could never reference a defunct StateMachineExecution object.
- Pointers as a return type: This represents a transfer of ownership from the called function. It means that the called function will no longer be responsible for the destruction of the object. If you called the function that returned a pointer, you now own that memory. You are now responsible for its destruction, or passing ownership to someone else.
Another complexity is the use of references to access an object that might have been destroyed after throwing an exception. This complexity must be part of the exception handling strategy, and must be considered within the design itself. For now, I ignore this problem because it will have to be dealt with in the exception handling strategy.
So in summary, a pointer means transfer of ownership. A reference means that you can just use the object freely, storing references to it or otherwise so long as you know that it won’t be destroyed when you don’t intend for it to be. These are the global semantics for YetiSim references and pointers. If there are changes, I will edit this blog or post a comment.
YetiSim State Machine Branch Prediction
A few years ago, I read about the Intel Itanium processor. One of the ideas that stuck with me was branch prediction. The concept in my head is likely different now than what the processor actually does, but the idea is what is important here. When a branch is encountered if we guess an execution path, we can start processing as if we should proceed down that path. If our guess is wrong, we take a minor hit in performance. A correct guess could result in a substantial performance gain.
The YetiSim state machine has guarded transitions, that are a lot like branches. If we are able to guess which branch we will proceed along, then we could begin processing as if that is the correct path. This is an example of how YetiSim performance can improve by further refinement of the implementation, without affecting the user interface. I like this capability within YetiSim, since I believe simulation users (like processor users) should not be overly affected by the new architecture. Readers knowledgeable about the Itanium could certainly point out problems with my analogy right now, but let’s just stay focused on YetiSim.
Branch prediction within YetiSim has a natural mapping to the state machine execution. Statistics would have to be gathered during execution to improve the predictions made. Fortunately, YetiSim will be bundled already with just the tools required. This is not something that I would do however, I would have to ask someone with more statistics experience than me to build the models for the branch predictions. This is something that I would like to experiment with, after YetiSim becomes more stable. The goal is to construct a stable simulator right now, not to wander around with a never ending list of desired features. Still, this is a glimpse of the possible future of YetiSim.
A more immediate use of the branch statistics, is to dynamically adjust the order of branch evaluation so that the list of guards to consider is sorted by most frequently taken transitions. This will reap some benefits at minor cost, and is easily implemented.
I’m not entirely sure about the details of the branch prediction, I have some ideas that I will outline but will not discuss in detail right now. One of the ideas I have been toying with, is to separate out data from YetiSim so that it could be stored in a database. Alternatively, it could fall back to just ordinary memory allocated from the heap. The main idea is that if users restricted themselves to using YetiSim data storage constructs, that data could be saved elsewhere. The execution of state machines could be captured with a data snapshot, and a database which provides transaction support could be used. Only the correct path is committed to the database, the other transactions are silently discarded. I’m not a database expert, someone else would have to provide the details here… but this is a general idea which could be of use. Unfortunately it has disadvantages too… classes must use YetiSim data storage constructs.
Anyways, here’s a glimpse of what I’m thinking about.