VTune still does not work, but I’m working on that. In the mean time, Zach showed me how to use gprof. The results for a simulation run, of 100,000 clocks gave:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
14.99 17.35 17.35 137532699 0.00 0.00 boost::detail::atomic_exchange_and_add(int*, int)
12.22 31.49 14.15 114224959 0.00 0.00 boost::detail::atomic_increment(int*)
4.48 36.67 5.18 31662490 0.00 0.00 boost::detail::atomic_conditional_increment(int*)
4.06 41.37 4.70 std::bad_alloc::bad_alloc()
2.95 44.79 3.42 117077426 0.00 0.00 boost::detail::shared_count::~shared_count()
2.94 48.20 3.41 5342911 0.00 0.00 yeti::ThreadOfExecution::executeNodeLogic()
2.54 51.14 2.95 97465810 0.00 0.00 boost::detail::shared_count::shared_count(boost::detail::shared_count const&)
2.36 53.87 2.73 131417618 0.00 0.00 boost::detail::sp_counted_base::release()
The data is probably pretty messy to read, but the important finding is that the majority of execution time is spent on shared_ptr operations. This indicates that the internals of YetiSim have to be adjusted, so that as Zach suggests, we use const shared_ptr<T>& where possible.
I’ll be looking at the code in detail to determine how this problem can be addressed, and how performance can be improved. I suspect the solution will be to create containers which actually own the object, and use const references everywhere else. This solution would give the benefits of a shared_ptr, however it should help performance time.