[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: CommsTime times?

Dyke/Oyvind et al.,

Enclosed is the "commstime.occ" file from the latest KRoC/Linux/PC release.
It lets the user choose whether to use the delta process from the course
library (that contains the correct PAR output) or to use a special seq.delta
process (that does its outputs in SEQ).  It makes heavy use of our course
I/O processes for communicating with the runner of the program - these would
need to be modified if another I/O library were being used.  It picks up
prefix and succ from the course library as well - those PROCs are listed
separately at the end of this email.


  1. commstime, running as compiled code for a general-purpose processor,
     spends almost all its time in the run-time kernel - time spent in
     user application code is negligeable.  What it measures, therefore, are
     overheads imposed by that kernel - primarilly those for channel
     communication, context switching and process startup/shutdown.

  2. commstime has a tiny memory footprint.  It should fit entirly within
     the fastest cache on most target processors.  Accordingly, it gives
     the lowest results for those overheads you are going to see.  [In real
     programs, the overheads introduced by cache misses need to be added.
     There is another program to benchmark those overheads (as reported
     in the "KRoC" paper in WoTUG-19). I can make this available to any
     who ask.  That benchmark indicates roughly a factor of 3 increase
     in those overheads if you miss the cache every time (but depends,
     of course, upon your caching architecture and relative memory speeds).]

  3. when quoting commstime figures, you must say whether the delta has
     PAR output or SEQ output.

  4. the simplest time to quote is the cycle time (i.e. the time for
     commstime to produce and consume each natural number).
  5. For the SEQ version of delta, there are 4 channel communications
     (of one INT) per cycle.  So, dividing the cycle time by 4 gives that
     communication overhead.  However, each channel communication has 2 ends
     - input and output.  In KRoC, each input or output event results in
     a context switch, so dividing the cycle time by 8 gives the context
     switch (plus half an INT communication) overhead.  On the transputer,
     the second process arriving at a communication point did the transfer
     and rescheduled the first - but then carried on itself without
     switching context (I think).  We could easily do this in KRoC but
     choose not to (because the overhead of scheduling another process,
     rather than oursleves, is so low and it seems somehow fairer).
  6. With the PAR version of delta, the communication loads are the same
     as for the SEQ delta, but there is the additional overhead of starting
     up and shutting down one new process per cycle.  As with the transputer,
     KRoC spawns a new process for all PAR components except for the last
     one (which is run by the process that executes the PAR).  So, if we
     subtract the commstime cycle time with a PAR-delta from the same for
     a SEQ-delta, we get the startup/shutdown overhead for a process.


  KROC version 1.0 targeting sparc-sun-solaris2.5.1 (driver V1.36)
  Ultra Sparc II - 400 MHz.  (Many users)

    cycle time (SEQ delta) =  850 nanoseconds (+/- 20)
    cycle time (PAR delta) =  875 nanoseconds (+/- 20)
    context switch time    =  107 nanoseconds (+/-  2)
    startup/shutdown       =       (too noisy to tell)

  KROC version 1.2.3a targeting i386-pc-linux (driver V1.35)
  Pentium III - 500 Mhz.  (Single user)

    cycle time (SEQ delta) = 1272 nanoseconds (+/-  1)
    cycle time (PAR delta) = 1472 nanoseconds (+/-  1)
    context switch time	   =  159 nanoseconds (+/-  0)
    startup/shutdown	   =  200 nanoseconds (+/-  2)

The following is for the upcoming Linux, release that has a more efficient
interface between user code and the occam kernel:

  KROC version 1.3.0beta targeting i386-pc-linux (NOT RELEASED YET)
  Pentium III - 500 Mhz.  (Single user)

    cycle time (SEQ delta) =  724 nanoseconds (+/-  1)
    cycle time (PAR delta) =  868 nanoseconds (+/-  1)
    context switch time	   =   91 nanoseconds (+/-  0)
    startup/shutdown	   =  144 nanoseconds (+/-  2)

  KROC version 1.3.0beta targeting i386-pc-linux (NOT RELEASED YET)
  Compiled with a new in-lining flag (==> slightly larger executables)
  Pentium III - 500 Mhz.  (Single user)

    cycle time (SEQ delta) =  450 nanoseconds (+/-  1)
    cycle time (PAR delta) =  557 nanoseconds (+/-  1)
    context switch time	   =   56 nanoseconds (+/-  0)
    startup/shutdown	   =  107 nanoseconds (+/-  2)

For comparison, here are the JCSP results (the CommsTime.java code is
one of the jcsp-demos in the release).  Please note that the timings
are downgraded to *microseconds* (not *nanoseconds*).  JCSP channels
are currently implemented on top of the standard Java threads model -
which uses native OS threads (which have high overheads).  This accounts
for the 2000-fold (roughly) higher overheads.  These JCSP timimgs were
also conducted on a slower machine - which accounts for the further
factor of 2.

  JCSP version 1.0-rc2 (running under Sun's JDK1.2.2)
  Pentium II - 266 Mhz.

    cycle time (SEQ delta) =  212 microseconds (+/-  2)
    cycle time (PAR delta) =  226 microseconds (+/-  2)
    context switch time	   =   27 microseconds (+/-  0)
    startup/shutdown	   =   24 microseconds (+/-  4)

It's not that the Java version is slow - it's the occam kernel that's so
quick!  Work is progressing on building a special JVM that includes the
basics of the occam kernel to support directly the JCSP primitives.
This was presented by Jim Moores at the recent WoTUG-23 conference
and is in the proceedings.

I don't have CCSP figures (Jim?).  But these will be much the same as
those for occam - CCSP and the Linux/PC KRoC share the same kernel.


#USE "course.lib"

--{{{  PROC seq.delta (CHAN OF INT in, out.0, out.1)
PROC seq.delta (CHAN OF INT in, out.0, out.1)
    INT n:
      in ? n
      out.0 ! n
      out.1 ! n

--{{{  PROC consume (VAL INT n.loops, CHAN OF INT in, CHAN OF BYTE out)
PROC consume (VAL INT n.loops, CHAN OF INT in, CHAN OF BYTE out)
  TIMER tim:
  INT t0, t1:
  INT value:
    --{{{  warm-up loop
    VAL INT warm.up IS 16:
    SEQ i = 0 FOR warm.up
      in ? value
        tim ? t0
        --{{{  bench-mark loop
        SEQ i = 0 FOR n.loops
          in ? value
        tim ? t1
        --{{{  report
        VAL INT microsecs IS t1 MINUS t0:
        VAL INT64 nanosecs IS 1000 * (INT64 microsecs):
          out.string ("Last value received = ", 0, out)
          out.number (value, 0, out)
          out.string ("*c*n", 0, out)
          out.string ("Time = ", 0, out)
          out.number (microsecs, 0, out)
          out.string (" microsecs*c*n", 0, out)
          out.string ("Time per loop = ", 0, out)
          out.number (INT (nanosecs/(INT64 n.loops)), 0, out)
          out.string (" nanosecs*c*n", 0, out)
          out.string ("Context switch = ", 0, out)
          out.number (INT ((nanosecs/(INT64 n.loops))/8), 0, out)
          out.string (" nanosecs*c*n*n", 0, out)

--{{{  PROC comms.time (CHAN OF BYTE keyboard, screen, error)
PROC comms.time (CHAN OF BYTE keyboard, screen, error)

  BOOL use.seq.delta:


    --{{{  announce
      out.string ("*c*nCommstime in occam ...*c*n*n", 0, screen)
      out.string ("Using the SEQ-output version of the delta process*c*n", 0, screen)
      out.string ("yields a more accurate measure of context-switch time*c*n*n", 0, screen)
      out.string ("Using the PAR-output version carries an extra overhead*c*n", 0, screen)
      out.string ("of one process startup/shutdown per Commstime loop*c*n*n", 0, screen)
      out.string ("By comparing **loop** times between the SEQ and PAR versions,*c*n", 0, screen)
      out.string ("the process startup/shutdown overhead may be deduced*c*n*n", 0, screen)

    ask.bool ("Sequential delta? ", use.seq.delta, keyboard, screen)
    out.string ("*nCommstime starting ...*c*n*n", 0, screen)

    CHAN OF INT a, b, c, d:
      prefix (0, b, a)
          seq.delta (a, c, d)    -- the one defined above
          delta (a, c, d)        -- the one that does a parallel output
      succ (c, b)
      consume (1000000, d, screen)


demo_cycles.occ (in the course library):
--	Demo cycles
--	Copyright (C) 1984 P.H. Welch
--	This library is free software; you can redistribute it and/or
--	modify it under the terms of the GNU Lesser General Public
--	License as published by the Free Software Foundation; either
--	version 2 of the License, or (at your option) any later version.
--	This library is distributed in the hope that it will be useful,
--	but WITHOUT ANY WARRANTY; without even the implied warranty of
--	Lesser General Public License for more details.
--	You should have received a copy of the GNU Lesser General Public
--	License along with this library; if not, write to the Free Software
--	Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307  USA

-- COPYRIGHT : P.H.Welch, 1984

--{{{  basic cycles
PROC id (CHAN OF INT in, out)
    INT x:
      in ? x
      out ! x

PROC succ (CHAN OF INT in, out)
    INT x:
      in ? x
      out ! x PLUS 1  -- let's ignore overflow

PROC plus (CHAN OF INT in.1, in.2, out)
    INT x.1, x.2:
        in.1 ? x.1
        in.2 ? x.2
      out ! x.1 PLUS x.2  -- let's ignore overflow

PROC delta (CHAN OF INT in, out.1, out.2)
    INT x:
      in ? x
        out.1 ! x
        out.2 ! x

PROC prefix (VAL INT n, CHAN OF INT in, out)
    out ! n
    id (in, out)

PROC tail (CHAN OF INT in, out)
    INT any:
    in ? any
    id (in, out)