So, ignoring whether the code is good or not, what's the correct way to do this benchmark? Is there any way measure the timers atomically, or to prevent your code from getting preempted in a tiny section, or have the kernel notify you if it gets preempted? Maybe something can be hacked in with transactional memory infrastructure?
This benchmark runs 32 software threads on a machine with 8 CPU threads. Getting preempted a lot is a major part of what's being measured. But samples that got preempted at exactly the wrong spot need to be filtered out, which is tricky.
I'm sure there's probably some succinct way to do so using perf_event in C but off the top of my head instead of just looking at the wall clock in isolation you could also compare it to the task clock to see if there was any pause of execution. Just make sure that you're actually including the time spent in the kernel during whatever syscalls you're trying to include but IIRC that's just a flag that you can set when opening the event.