[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: commstime not scaling



Hi,

> I was recently comparing commstime values for my python implementation of
> CSP-style primitives and kroc and came away with some surprising (to me
> anyway :o) results.  I tried the commstime metrics on two different
> machines, a 2.4 GHz Compaq laptop and a 1.0 GHz Dell.
>
> for the python implementation (using threading.Thread):
>
> Compaq: 935 millseconds for the commstime loop
> Dell: 895 milliseconds
>
> for kroc (1.4.0-pre2):
>
> Compaq: 440 nanoseconds
> Dell: 385 nanoseconds
>
> It wasn't terribly suprising to me that the results for the python
> implementation would be similar (no chance it would fit in cache, using OS
> scheduled threads, etc), but it was suprising to me that results for kroc
> were similar on both machines and that in both cases the slower machine had
> better results.
>
> (The reasons could certainly be different; when there are that many orders
> of magnitude difference between implementations it would be naive to assume
> that the same factors *must* account for superficially similar results.)
>
> What factors (other than cache effects) might account for the lack of
> scaling for kroc?  (Fred, you might have the best insight here.)
>
> My tentative theory for the slower machine getting better results is that i=
> t
> has fewer background processes (they started out life as hard drive image
> clones, but I turned off a few background processes because the dell also
> has only 256 MB Ram, and for some of the things I'm doing it was paging to
> disk enough to be disruptive), but I'm also open to other possibilities at
> this point and need to test that theory more fully.
>
> I'd also be interested in hearing about counterexamples, if anybody has
> them.

I've also seen these kinds of discrepancies from commstime, but still
building a list of reasons (I'm pretty sure it's a combination of
factors).

KRoC 1.4.0-pre2 out of the box is probably more optimised for a Pentium-3
than higher processors (I've only just got myself a P4 -- not HT -- so
that may change before long).  One of the things that really kills
commstime is its output -- the extra CPU used by scrolling an xterm made
a significant impact.  It's probably worth noting that KRoC TIMERs report
wall-clock times, so anything else that happens on the machine will impact
it (I guess this is the case for python too).  Other terminals that aren't
xterms (gnome and KDE flavours) use even more CPU for output and scrolling.
Redirecting the output to a file (and probably hard-wiring SEQ or PAR
delta) will give more accurate figures.  Hyperthreaded machines fair
better here, since the output overhead is often soaked up by the
hyperthread (in a fairly random way, whatever Linux decided to schedule
and where, plus to which "virtual CPU" interrupts get delivered).

On particular CPUs, the length of the pipeline and cost of a cache miss
definitely have an effect.  Although it's a bit counter-intuitive, shoving
in a couple of "nop"s in the generated assembly often makes things go
faster (because, presumably, it avoids a pipeline stall -- there are
status registers in some CPUs that can tell me this, but haven't
investigated that yet).  In theory, commstime shouldn't cause many cache
misses, but all the extra stuff that happens during output might trigger
that.  Cache effects are fairly hard to pin down;  since the CPU caches
on physical and not virtual addresses, what's going to get evicted and
when is pretty hard to tell.

Memory arrangement can also have some impact -- two DIMMs go faster than
one DIMM on many >= P4 systems (because accesses can be interleaved).  This
is possibly a feature of the motherboard and the memory controller in the
chipset.

On interrupt overhead processing, often a good idea to unplug the network
cable before running a benchmark (more network traffic == more overhead,
and windoze machines seem to enjoy broadcasting to the network, multiplayer
games especially).

In my experience, if you want a good baseline for occam benchmarks, make
sure it's the only thing running..  This typically involves fixing the
length of the benchmark [1], stuffing in a sched_setscheduler() call to
use a FIFO scheduler, and running as root.  You won't see any response
from the machine until the benchmark's done, but..  [1] if the benchmark
doesn't quit on its own, Ctrl-C won't kill it (neither will SysRq afaik).
Just to add, the times we've reported in various papers are for normal
environments (not SCHED_FIFO'd ones).

Finally, certain applications and daemons that are running really damage
benchmarks, particularly those that wake up from select() periodically,
do nothing (or very little) then go back to sleep.  Animated GIFs on
webpages, clock applications in the dock/wharf/etc., things that poll
removable devices periodically to see if you've put something in, etc.


Cheers,

-- Fred

Attachment: pgpnBySDM2r7O.pgp
Description: PGP signature