Hi, > I was recently comparing commstime values for my python implementation of > CSP-style primitives and kroc and came away with some surprising (to me > anyway :o) results. I tried the commstime metrics on two different > machines, a 2.4 GHz Compaq laptop and a 1.0 GHz Dell. > > for the python implementation (using threading.Thread): > > Compaq: 935 millseconds for the commstime loop > Dell: 895 milliseconds > > for kroc (1.4.0-pre2): > > Compaq: 440 nanoseconds > Dell: 385 nanoseconds > > It wasn't terribly suprising to me that the results for the python > implementation would be similar (no chance it would fit in cache, using OS > scheduled threads, etc), but it was suprising to me that results for kroc > were similar on both machines and that in both cases the slower machine had > better results. > > (The reasons could certainly be different; when there are that many orders > of magnitude difference between implementations it would be naive to assume > that the same factors *must* account for superficially similar results.) > > What factors (other than cache effects) might account for the lack of > scaling for kroc? (Fred, you might have the best insight here.) > > My tentative theory for the slower machine getting better results is that i= > t > has fewer background processes (they started out life as hard drive image > clones, but I turned off a few background processes because the dell also > has only 256 MB Ram, and for some of the things I'm doing it was paging to > disk enough to be disruptive), but I'm also open to other possibilities at > this point and need to test that theory more fully. > > I'd also be interested in hearing about counterexamples, if anybody has > them. I've also seen these kinds of discrepancies from commstime, but still building a list of reasons (I'm pretty sure it's a combination of factors). KRoC 1.4.0-pre2 out of the box is probably more optimised for a Pentium-3 than higher processors (I've only just got myself a P4 -- not HT -- so that may change before long). One of the things that really kills commstime is its output -- the extra CPU used by scrolling an xterm made a significant impact. It's probably worth noting that KRoC TIMERs report wall-clock times, so anything else that happens on the machine will impact it (I guess this is the case for python too). Other terminals that aren't xterms (gnome and KDE flavours) use even more CPU for output and scrolling. Redirecting the output to a file (and probably hard-wiring SEQ or PAR delta) will give more accurate figures. Hyperthreaded machines fair better here, since the output overhead is often soaked up by the hyperthread (in a fairly random way, whatever Linux decided to schedule and where, plus to which "virtual CPU" interrupts get delivered). On particular CPUs, the length of the pipeline and cost of a cache miss definitely have an effect. Although it's a bit counter-intuitive, shoving in a couple of "nop"s in the generated assembly often makes things go faster (because, presumably, it avoids a pipeline stall -- there are status registers in some CPUs that can tell me this, but haven't investigated that yet). In theory, commstime shouldn't cause many cache misses, but all the extra stuff that happens during output might trigger that. Cache effects are fairly hard to pin down; since the CPU caches on physical and not virtual addresses, what's going to get evicted and when is pretty hard to tell. Memory arrangement can also have some impact -- two DIMMs go faster than one DIMM on many >= P4 systems (because accesses can be interleaved). This is possibly a feature of the motherboard and the memory controller in the chipset. On interrupt overhead processing, often a good idea to unplug the network cable before running a benchmark (more network traffic == more overhead, and windoze machines seem to enjoy broadcasting to the network, multiplayer games especially). In my experience, if you want a good baseline for occam benchmarks, make sure it's the only thing running.. This typically involves fixing the length of the benchmark [1], stuffing in a sched_setscheduler() call to use a FIFO scheduler, and running as root. You won't see any response from the machine until the benchmark's done, but.. [1] if the benchmark doesn't quit on its own, Ctrl-C won't kill it (neither will SysRq afaik). Just to add, the times we've reported in various papers are for normal environments (not SCHED_FIFO'd ones). Finally, certain applications and daemons that are running really damage benchmarks, particularly those that wake up from select() periodically, do nothing (or very little) then go back to sleep. Animated GIFs on webpages, clock applications in the dock/wharf/etc., things that poll removable devices periodically to see if you've put something in, etc. Cheers, -- Fred
Attachment:
pgpnBySDM2r7O.pgp
Description: PGP signature