From pa.dec.com!decwrl!uunet!sparky!kent Thu Jul 25 08:58:56 PDT 1991
Article: 2538 of comp.sources.misc
Newsgroups: comp.sources.misc
Path: pa.dec.com!decwrl!uunet!sparky!kent
From: Martin Fouts <fouts@clipper.ingr.com>
Subject:  v21i038:  cloops - Livermore Loops in C, Part03/03
Message-ID: <1991Jul25.020740.28375@sparky.IMD.Sterling.COM>
Summary: The Livermore loops, rewritten from Fortran into C
X-Md4-Signature: 66bbb34c53801c38f1ef1b86799174b7
Keywords: Fortran, C, Benchmarks, Livermore Loops
Sender: kent@sparky.IMD.Sterling.COM (Kent Landfield)
Reply-To: Martin Fouts <uunet!clipper.clipper.ingr.com!fouts>
Organization: Sterling Software, IMD
References: <csm-v21i036=cloops.210138@sparky.imd.sterling.com>
Date: Thu, 25 Jul 1991 02:07:40 GMT
Approved: kent@sparky.imd.sterling.com
Lines: 714

Submitted-by: Martin Fouts <fouts@clipper.ingr.com>
Posting-number: Volume 21, Issue 38
Archive-name: cloops/part03
Environment: Cray2, Alliant Convex Amdahl, Sun, SGI  

-----     cut here for Part03
#! /bin/sh
# This is a shell archive.  Remove anything before this line, then unpack
# it by saving it into a file and typing "sh file".  To overwrite existing
# files, type "sh file -c".  You can also feed this as standard input via
# unshar, or by typing "sh <file", e.g..  If this archive is complete, you
# will see the following message at the end:
#		"End of archive 3 (of 3)."
# Contents:  cloops.tex
# Wrapped by fouts@bozeman on Sun Jul 21 18:23:21 1991
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'cloops.tex' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'cloops.tex'\"
else
echo shar: Extracting \"'cloops.tex'\" \(28304 characters\)
sed "s/^X//" >'cloops.tex' <<'END_OF_FILE'
X% C-Loops
X%
X% A Paper Describing the C version of the Livermore Loops
X% Copyright (C) 1991 Martin Fouts
X%
X% This program is free software; you can redistribute it and/or modify
X% it under the terms of the GNU General Public License as published by
X% the Free Software Foundation; either version 1, or (at your option)
X% any later version.
X%
X% This program is distributed in the hope that it will be useful,
X% but WITHOUT ANY WARRANTY; without even the implied warranty of
X% MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
X% GNU General Public License for more details.
X%
X% You should have received a copy of the GNU General Public License
X% along with this program; if not, write to the Free Software
X% Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.
X%
X
X\documentstyle[12pt,twoside,fleqn,openbib]{article}
X\pagestyle{myheadings}
X\markboth{DRAFT of \today}{DRAFT of \today}
X
X\title{ The Livermore Loops in C
X\\ DRAFT of \today }
X
X\author{
XMartin Fouts \\
XMathematician \\
XNAS Systems Development Branch \\
XNASA Ames Research Center}
X
X
X\begin{document}
X\maketitle
X%\tableofcontents
X%\vfill
X%\pagebreak
X\markboth{DRAFT of \today}{DRAFT of \today}
X
X\begin{abstract}
X
XThe Livermore Loops were transliterated from Fortran into C.  The
Xresulting program was run on several machines which support both a C
Xand Fortran compiler.  The effort taken in porting the loops is
Xdiscussed, and the results are presented.  Some of the more
Xinteresting results are analyzed. Problems which arise from
Xusing the Loops as a benchmark suite are discussed. Some conclusions
Xabout numerical computation in C are drawn.
X
X\end{abstract}
X
X\section{Background}
X
XNumerical calculations, especially those involving vector operations
Xhave long been the territory of Fortran and it is widely accepted that
Xno other high level language offers the same opportunity for compiler
Xoptimization of source code.
X
XThe Livermore Loops\cite{McMahon86} are widely used as a benchmark of
Xnumerical intensive calculation in Fortran and results are available
Xfor a wide range of machines.  The loops were transliterated from
XFortran into C and results were gathered for several machines.  On
Xall machines the C compiler produces faster overall performance.
XBoth languages calculated the same results.
X
XIn performing the conversion, some difficulties using the Livermore
XLoops as a benchmark suite were uncovered.  They include the ease with
Xwhich some compilers optimize away timing loops and the lack of
Xsensitivity in the validation routines.  These difficulties are
Xdiscussed in section 3.
X
XSome semantic differences between C and Fortran were uncovered which
Xhave an effect on the precision of the results generated by the
Xprogram on certain machines.  These differences are discussed in
Xsection 3.
X
XThis experiment makes an interesting comment on the state of
Xoptimizing compilers, the ability to write Fortran in any language,
Xand the utility of the Livermore Loops as a vector benchmark.
X
XFor the purpose of this paper, Fortran refers to the Fortran language
Xas described in the Fortran-77 standard\cite{ANSI77} and C refers to
Xthe programming language as described in \cite{Kern78}.  References to
Xspecific dialects will be labeled as such.
X
X\section{Overview of Transliteration}
X
XIn converting the Livermore Loops from Fortran to C, the major goal
Xwas to  preserve the original Fortran semantics.  Some changes were
Xnecessary, including:
X
X\begin{itemize}
X
X\item The routine \tt SPACE \rm was deleted.  Since the pointer
Xassignment isn't used in the Fortran implementation, this has no
Xeffect on the final results.
X
X\item The variable \tt lp \rm which controls the number of passes was
Xmade an input parameter.  The default is 100 passes.  When the default
Xis used, standard results are obtained.
X
X\item In kernel 15, arithmetic if statements were modified to
Xif-then-else forms.
X
X\item All occurrences of the form \tt x**2 \rm were replaced by \tt
Xx*x.
X
X\item \tt REPORT \rm was truncated.  Only the flop rates and averages
Xare reported.  Since the C version is only intended as a comparison to
XFortran on the same machine, the analysis portion is not needed.
X
X\item \tt SIGNAL \rm was reimplemented as several separate routines to
Xwork around differences in C and Fortran parameter passing.  Further,
X\tt VECTOR \rm was modified to call \tt lsignalN \rm for each array,
Xsince C has no equivalent to a \tt COMMON \rm block.  \tt SIGNAL \rm
Xwas renamed to \tt lsignalN \rm where \tt N \rm is replaced by 0, 1,
X2, or 3 to represent the number of dimensions in the array being
Xinitialized.  The separate routines are not strictly needed, but
Xclarify the intent of the code.
X
XSince \tt SIGNAL \rm is not part of the measured code, this has no
Xeffect on the reported performance.
X
X\item The number generation routine used by \tt SIGNAL \rm was
Xcarefully reimplemented to duplicate Fortran numeric semantics.
X
X\item \tt SUMO \rm was reimplemented as \tt sumo, sumo2, \rm and \tt
Xsumo3 \rm, depending on the number of dimensions in the array being
Xinitialized.  As with signal, this was done for clarity and is in code
Xwhich is not part of that being measured.
X
X\item A command line option was added allowing only specified loops to
Xbe run.  This was done for debugging.  The reported results are from a
Xsingle run in which all loops were performed.
X
X\item All array references were converted from the 1 based subscript
Xused by Fortran to 0 based subscripts as used by C.  This was done to
Xavoid holes in the C arrays which would have resulted in different
Xmemory use patterns.  This would have an occasional small adverse
Ximpact on the C version, since some array references now involve
Xsubtracting one from the indices.
X
X\end{itemize}
X
XA major change which wasn't made was transposing array indices.
X
X\section{Language Differences Encountered}
X
XThree major differences between Fortran and C were encountered which
Xhave an impact on the transliteration.
X
X\begin{itemize}
X
X\item Different array index basis are used by the languages.  
X\item Fortran uses \tt COMMON \rm blocks to allocate global data,
Xwhile C merely requires declaring it.  
X\item C and Fortran have different rules for arithmetic precision.  Some
Xminor differences also exist.
X
X\end{itemize}
X
XEach of these will be discussed in turn.
X
XThe array index basis difference has two implications.  One of these is
Xthe way in which indices relate to memory locations in multiple
Xdimension arrays, and the other is the difference in array basing.
XC arrays are always indexed from 0 to \tt N - 1 \rm where \tt N \rm is
Xthe number of elements declared for that index of the array.  The
Xdefault basing in Fortran is to index from 1 to \tt N, \rm although 0
Xbasing is an option.  All of the arrays in the Livermore Loops are 1
Xbased.  In the transliteration, this was explicitly adjusted for.  In
Xthe main, this consists of adjusting loop indices to run from 0 to \tt
XN - 1 \rm rather than from 1 to \tt N \rm and subtracting 1 from
Xconstant indices as part of the transliteration.
X
XThis can also be a problem when integer values are calculated by a
Xprogram and then used as loop indices.  An attempt was made to find
Xall such places and convert them appropriately.  In a few cases, the
Xcalculations were left alone, and 1 was subtracted when the integer
Xvariable was used as a subscript.
X
XThe other difference in arrays has to do with multiple dimension
Xarrays.  For example, the Fortran array
X
X\begin{verbatim}
X      DIMENSION A(2,2)
X\end{verbatim}
X
X\noindent
Xis stored in ascending memory addresses as \tt A(1,1), A(2,1), A(1,2),
XA(2,2) \rm from low address to high address.  The equivalent C array
X
X\begin{verbatim}
X  float a[2][2];
X\end{verbatim}
X
X\noindent
Xis stored as \tt a[0][0], a[0][1], a[1][0], a[1][1] \rm from low
Xaddress to high address.  A complete transliteration would transpose
Xindices from left to right to preserve the actual memory ordering.
XThis was not done in this transliteration.  Assuming that the original
XFortran had been carefully coded for memory access, this would cause a
Xdegradation of performance of the C code on some systems.
X
XThe Livermore Loops utilize the fairly common Fortran practice of
Xincluding a number of related items together in a \tt COMMON \rm
Xblock in order to make global references to those items possible.
X
XOne of the side effects of this practice is that Fortran requires that
Xitems which are adjacently listed in a common block are adjacently
Xallocated in memory.  This makes possible the practice of initializing
Xan entire \tt COMMON \rm block by treating it as a single array in an
Xinitialization routine.  Figure~1 shows an example in Fortran.
X
X\begin{figure}[ht]
X\small
X\caption{\tt COMMON \rm usage in Fortran}
X\begin{verbatim}
X
XC
XC THIS ROUTINE USES THREE ELEMENTS OF A COMMON BLOCK BY SOME NAME
XC
X      COMMON /CBLOCK/ A1, B1, C1
X      SUBROUTINE AROUTINE
XC
XC USE A1, B1, C1
XC
X      RETURN
X      END
XC
XC THIS ROUTINE INITIALIZES THE COMMON BLOCK USING AN ARRAY
XC
X      COMMON /CBLOCK/ A(3)
X      SUBROUTINE INIT
X      DO 10 I = 1, 3
X   10 A(I) = I
X      RETURN
X      END
X
X\end{verbatim}
X\end{figure}
X
XThe initialization routines in the Livermore Loops take advantage of
Xthis.  There are several \tt COMMON \rm blocks, each of which contains
Xmany arrays.  There is an initialization routine which treats each \tt
XCOMMON \rm block as a single array, and initializes it in a single
Xloop.
X
XIt is possible to duplicate the adjacent storage effect of \tt COMMON
X\rm blocks in C.  Figure~2 shows an example of this usage.
X
X\begin{figure}[ht]
X\small
X\caption{Translating \tt COMMON \rm usage to C}
X\begin{verbatim}
X
Xstruct cblock {
X  float A_[3];
X} CBLOCK;
X
X#define A1 CBLOCK.A_[0]
X#define B1 CBLOCK.A_[1]
X#define C1 CBLOCK.A_[2]
X#define A  CBLOCK.A_
X
Xvoid init()
X{
X  int i;
X  for (i = 0; i < 3; i++)
X    A[i] = i;
X}
X
Xmain()
X{
X  init();
X  fprintf(stdout,"A1 = %f, B1 = %f, C1 = %f\n", A1, B1, C1);
X}
X
X\end{verbatim}
X\end{figure}
X
XIt is possible to use this technique within the \tt struct \rm to duplicate
Xany possible naming which comes up within a \tt COMMON \rm block.  In
Xparticular, multiple arrays can be implemented by using either unions
Xand preprocessor defines, if they do not overlap, or names and
Xpreprocessor macro functions if they do overlap.
X
XBecause the only uses of the \tt COMMON \rm blocks within the Livermore
XLoops is to provide global data and to simplify initialization and
Xchecksum calculation, this approach was not taken, but rather each
Xvariable was declared as a C global, and the initialization routines
Xwere modified to explicitly initialize each routine separately.  This
Xhas an impact on both the initialization and checksum software.
X
XOn machines for which \tt double \rm provides more precision than \tt
Xfloat \rm in C, numerical calculations are done differently in C than
Xin Fortran.  When all of the variables involved in a calculation are
Xof the same precision, Fortran calculates all intermediate results to
Xthat precision.  C promotes all variable to the precision of \tt
Xdouble, \rm does all calculations in double and converts to the
Xprecision of the result.  This yields different numerical results when
Xthere are subexpressions, for instance in the case
X
X\begin{verbatim}
Xresult = a * b - c * d
X\end{verbatim}
X
XFortran would calculate the two products, convert to the appropriate
Xprecision, perform the addition and store the result.  C would
Xcalculate the two products in \tt double \rm precision, perform a \tt
Xdouble \rm precision add, and then convert the result.  This can yield
Xquite different numbers.
X
XThe routine which generates the initial data was carefully coded to
Xmimic Fortran conversion, so that the initial data would be the same
Xin the C case as in the Fortran case, and care was taken to see that
Xsubroutine calls were performed with the correct argument types, but
Xthe test routines calculate the results using default C promotion
Xrules.
X
X\section{Verifying the Transliteration}
X
XThe Convex compilers provide statement counting as a profiler option.
XThe Fortran and C versions of the loops were run with no optimization
Xand with statement profiling enabled.  The statement counts were
Xcompared on a statement by statement basis for the two programs.  All
Xloops were found to agree exactly.
X
XOnce the statement counts were completed, the two programs were run on
Xthe Cray 2, which has the same precision for \tt float \rm and \tt
Xdouble \rm in C.  The Fortran version was compiled with CFT77 release
X3.0 with all optimization disabled.  The C version was compiled with
Xthe UniCos 4.0 C compiler.
X
XLoops 1 through 5, 7, 11, 12, 13, 14, 16, 17, 20, and 24 produce
Xidentical checksums to 12 decimal places, which is nearly the limit of
Xfloating point precision on the Cray 2.  Loops 6, 8, 9, 10, 15, 18,
X21 and 23 all match to at least five decimal places.  Careful
Xinspection of the loops was made by running the Fortran and C versions
Xsimultaneously under a source level debugger on a Silicon Graphics
XIris 4D workstation.  Several additional bugs were corrected, but no
Xdifference in the numerical results was found.  It is suspected that
Xthe remaining differences are due to bugs in the way in which the
Xchecksums are calculated under C.
X
X\section{Overview of Results}
X
XThe various tables show the reported megaflop rates for the C and Fortran
Xversion of the Livermore Loops on several machines which were
Xavailable to the author.  Of the machines which were available, the
XIris and the Sun were chosen as representative of the current
Xgeneration of engineering workstation, the Amdahl as representative of
Xcurrent mainframes, the Convex and Alliant as representative of two
Xdifferent approaches to minisupercomputers, and the Cray 2 as a
Xrepresentative supercomputer.  All of the machines are running recent
Xreleases of the vendor's version of the Unix operating system.
X
XFor all machines, the overall performance of the C compiler was better
Xthan the overall performance of the Fortran compiler.  For the mainly
XUnix systems, this was not surprising, but it was surprising on the
XCray 2.  A detailed comparison of the optimizations done by the two
XCray compilers on loops which were significantly different is
Xpresented below.
X
X\begin{table}[htbp]
X\caption{Results in Megaflops}
X\small
X\begin{tabular}{|r|rr|rr|rrr|} 
X\hline
X{\bf Manufacture} & \multicolumn{2}{|c|}{Alliant}  &
X\multicolumn{2}{|c|}{Amdahl} & \multicolumn{3}{|c|}{Cray} \\
X{\bf Model} & \multicolumn{2}{|c|}{FX/8} & \multicolumn{2}{|c|}{5880} &
X\multicolumn{3}{|c|}{2} \\
X{\bf Compiler} & cc & fortran & cc & f77 & cc (vector) & cft77 (vector) &
Xcft77 (scalar) \\ 
X\hline
X1 & 10.72 & 15.02 & 15.02 & 1.88 & 764.77 & 96.89 & 13.53 \\
X2 & 7.76 & 0.78 & 11.64 & 1.16 & 30.28 & 18.45 & 6.40 \\
X3 & 7.51 & 1.00 & 15.02 & 1.72 & 21.39 & 69.57 & 11.29 \\
X4 & 6.55 & 0.72 & 9.00 & 1.03 & 18.32 & 32.10 & 4.16 \\
X5 & 7.06 & 0.71 & 12.00 & 1.09 & 20.36 & 9.85 & 10.67 \\
X6 & 5.95 & 0.49 & 10.82 & 1.01 & 21.29 & 9.08 & 3.90 \\
X7 & 15.92 & 19.10 & 18.73 & 2.51 & 929.28 & 147.31 & 16.35 \\
X8 & 8.07 & 14.26 & 11.56 & 1.07 & 41.67 & 113.39 & 17.55 \\
X9 & 14.72 & 10.30 & 25.75 & 1.72 & 748.91 & 113.07 & 13.88 \\
X10 & 4.54 & 5.45 & 9.09 & 0.55 & 21.40 & 35.32 & 6.68 \\
X11 & 4.29 & 0.43 & 7.50 & 0.75 & 11.00 & 9.02 & 7.57 \\
X12 & 4.00 & 5.99 & 6.66 & 0.60 & 246.53 & 30.61 & 5.24 \\
X13 & 1.68 & 0.27 & 4.48 & 0.24 & 6.44 & 1.75 & 1.66 \\
X14 & 3.06 & 1.16 & 5.46 & 0.52 & 14.61 & 7.22 & 3.88 \\
X15 & 5.29 & 2.54 & 5.86 & 0.63 & 23.05 & 6.94 & 6.66 \\
X16 & 4.54 & 0.45 & 10.60 & 1.59 & 15.28 & 4.00 & 3.86 \\
X17 & 7.79 & 1.82 & 13.63 & 2.73 & 36.74 & 8.08 & 7.81 \\
X18 & 10.62 & 7.26 & 9.07 & 0.88 & 64.29 & 100.59 & 11.61 \\
X19 & 7.27 & 0.73 & 12.12 & 1.21 & 27.08 & 12.40 & 11.34 \\
X20 & 10.91 & 1.47 & 17.93 & 2.40 & 50.41 & 12.33 & 11.39 \\
X21 & 7.13 & 1.82 & 6.07 & 0.60 & 15.16 & 23.43 & 6.13 \\
X22 & 6.06 & 5.15 & 8.59 & 0.86 & 47.05 & 90.76 & 5.94 \\
X23 & 7.10 & 0.93 & 9.33 & 0.91 & 30.69 & 9.41 & 8.99 \\
X24 & 3.53 & 6.00 & 10.00 & 1.00 & 8.32 & 1.95 & 1.87 \\
XHarmonic Mean & 5.59 & 1.09 & 9.37 & 0.87 & 22.38 & 9.26 & 5.7 \\
X\hline
X\end{tabular}
X\end{table}
X
X
X\begin{table}[htbp]
X\caption{Results in Megaflops}
X\small
X\begin{tabular}{|r|rr|rr|rr|}
X\hline
X{\bf Manufacture} & \multicolumn{2}{|c|}{Convex} &
X\multicolumn{2}{|c|}{Silicon Graphics} & \multicolumn{2}{|c|}{Sun} \\
X{\bf Model} & \multicolumn{2}{|c|}{C1} & \multicolumn{2}{|c|}{Iris 4D}  &
X\multicolumn{2}{|c|}{Sun 3/250} \\
X{\bf Compiler} & vc & fc & Mips cc & Mips f77 & cc & f77 \\
X\hline
X1 & 1102.90 & 175.23 & 8.94 & 1.32 & 1.35 & 0.18 \\
X2 & 8.49 & 1.57 & 7.76 & 1.29 & 1.11 & 0.12 \\
X3 & 5.82 & 9.57 & 6.90 & 1.33 & 1.10 & 0.20 \\
X4 & 7.37 & 1.34 & 6.00 & 0.71 & 0.99 & 0.11 \\
X5 & 6.81 & 1.32 & 5.00 & 0.69 & 1.05 & 0.11 \\
X6 & 4.16 & 3.80 & 5.67 & 0.86 & 0.97 & 0.12 \\
X7 & 1811.50 & 331.52 & 13.38 & 1.75 & 1.93 & 0.18 \\
X8 & 10.82 & 247.95 & 12.73 & 1.25 & 1.41 & 0.16	 \\
X9 & 1428.90 & 195.82 & 13.21 & 1.56 & 1.66 & 0.16 \\
X10 & 2.57 & 7.15 & 8.26 & 0.91 & 0.74 & 0.09 \\
X11 & 4.72 & 2.18 & 3.57 & 0.63 & 0.72 & 0.08 \\
X12 & 2.55 & 16.34 & 3.57 & 0.62 & 0.75 & 0.08 \\
X13 & 1.73 & 0.56 & 2.24 & 0.14 & 0.45 & 0.05 \\
X14 & 4.24 & 1.34 & 4.27 & 0.37 & 0.73 & 0.08 \\
X15 & 3.94 & 1.65 & 5.14 & 0.61 & 1.41 & 0.23 \\
X16 & 5.18 & 0.88 & 7.57 & 0.88 & 1.18 & 0.17 \\
X17 & 7.55 & 1.31 & 11.36 & 1.82 & 0.99 & 0.16 \\
X18 & 8.50 & 11.38 & 9.99 & 1.12 & 1.35 & 0.16 \\
X19 & 7.89 & 1.83 & 5.51 & 1.01 & 0.96 & 0.12 \\
X20 & 7.17 & 1.60 & 10.83 & 1.42 & 1.72 & 0.25 \\
X21 & 3.97 & 7.66 & 6.93 & 0.91 & 0.84 & 0.12 \\
X22 & 5.60 & 101.18 & 6.87 & 0.90 & 1.94 & 0.32 \\
X23 & 6.59 & 2.52 & 9.07 & 1.05 & 1.21 & 0.14 \\
X24 & 90.46 & 9.16 & 4.35 & 0.63 & 1.13 & 0.14 \\
XHarmonic Mean & 5.62 & 2.37 & 6.09 & 0.72 & 1.02 & 0.13 \\
X\hline
X\end{tabular}
X\end{table}
X
X\section{Alliant Performance}
X
XThe Alliant Fortran compiler performs analysis to determine if a loop
Xis vectorizable or can be executed concurrently on multiple
Xprocessors.  If the loop can be executed concurrently and in vector
Xmode, it tries to find a way to use both vectorization and
Xconcurrently simultaneously.  The Alliant C compiler performs neither
Xvectorization nor concurrency analysis and always produces scalar
Xsingle processor code.  The Alliant Fortran compiler suffers from
Xtrying to hard.  Sometimes it produces vector or concurrent code for
Xwhich the overhead of the code exceeds the gain from the facility.  In
Xthese case, such as loops 3, 9 and 13, the scalar C code performs more
Xefficiently than the optimized Fortran code.  In other cases, such as
Xloops 1 and 7, the Fortran code out performs the C code.  In no case
Xdoes the optimization give a speedup of more than 2 over the scalar
Xcode.
X
X\section{Amdahl Performance}
X
XThe Amdahl Fortran compiler is a straightforward port of the BSD Unix
Xf77 compiler.  It does not appear to have been tuned to the Amdahl
Xarchitecture at all and generates very poor code.  The Amdahl C
Xcompiler is more tuned to the architecture and in some cases, Amdahl
Xscalar code performance approaches that of the Cray 2. Because of the
Xlarge difference between the compilers, a detailed comparison was not
Xundertaken. 
X
X\section{Cray 2 Performance}
X
XThere are four loops (1, 7, 9, and 12) for which C achieves
X``impossible'' performance on the Cray 2, as a result of exceeding the
Xmaximum performance of a single processor.  These are the result of an
Xartifact of the benchmark.  C outperforms the Fortran code on most
Xscalar loops, (5, 11, 13, 16, 17, 19, 20, and 24) with one exception.
X(22) Fortran recognizes more loops as vectorizable than C, and
Xproduces  better performance on the one loop (18) which both compilers
Xvectorize. There is mixed performance on those loops which Fortran
Xvectorizes and C does not.  In some cases, (2, 6, 14, 15, and 23) the
Xscalar C code is more efficient that the vector Fortran code. In the
Xremaining cases (3, 4, 8 10 and  21) the vector Fortran code is more
Xefficient than the scalar C code. 
X
XThe impossible performance of C is a result of an artifact of the
Xbenchmarks. To obtain a large enough time interval for meaningful
Xmeasurement, all of the loops are of the form:
X
X\begin{verbatim}
X
XFOR I = 1 TO NUMBER_OF_PASSES DO
X  BEGIN
X    do_the_loop_in_question
X  END
X
X\end{verbatim}
X
XIn the impossible cases, the loop in question is of form similar to:
X
X\begin{verbatim}
X
XFOR J = 1 to ARRAY_LENGTH DO
X  BEGIN
X    X[J] := Y[J] op Z[J];
X  END
X
X\end{verbatim}
X
XIn this very simple case, it is possible for a reasonable compiler to
Xdetermine that the entire inner loop is invariant with respect to the
Xouter loop, move the inner loop outside the outer loop, and then throw
Xthe outer loop away because it doesn't compute anything.  If the
Xvariable on the left side of the equation never appears on the right
Xside, the loop is invariant and can be collapsed.  Close inspection
Xshows that the performance advantage of C on Loops 1, 7, 9, and 12 can
Xbe explained as a result of this optimization of the outer loop.  The
XFortran compiler misses this optimization.
X
XThe original Fortran version of Loop 2 contains a compiler directive
Xto force vectorization.  The C version does not force vectorization,
Xbut instead produces scalar code.  Further, the vector length is 101,
Xwhich is fairly short compared to the overhead introduced by setting
Xup the vector loop.  C produces faster code for this case, as a result
Xof not vectorizing the loop.  Loop 6 shows a similar phenomena.  In
Xthis case, Fortran recognizes a conditional dependency and generates
Xcode to use vector instructions when it doesn't occur. Again the
Xoverhead is not justified and the C version is faster.  This also
Xhappens in loops 14 and 23. 
X
XIn other cases, the Fortran compiler vectorizes and the C compiler
Xdoes not and the Fortran code code runs faster than the C code.  The
Xperformance improvement varies between 1.5 and 3.2 times.
X
XIn all but one of the cases in which both languages generate scalar
Xcode, C generates more efficient code than Fortran.  The exception is
Xloops 22.  Fortran generates  inline code for the \tt exp \rm
Xfunction, while C generates calls to a subroutine.  Although current C
Xcompilers don't usually generate inline code, the ANSI draft standard
Xprovides a mechanism for defining and implementing inlines, and some
Xcompilers do support the mechanism.
X
XOn the Cray 2, C generates shorter less convoluted loops and uses more
Xregister\footnote{Actually, local memory, which C treats as a large
Xregister file.} references than Fortran.  Overall, this appears to
Xgive C a scalar performance advantage over Fortran.  Compiling the
XFortran loops with vectorization disabled shows C scalar performance
Xto vary from 2 to 5 times faster than Fortran scalar performance.
X
X\section{Convex Performance}
X
XThe Convex compilers produce considerable documentation of their
Xactions.  Figure~3 shows part of the vectorization summary from the C
Xcompiler for the file containing the loops.  The Convex compiler
Xrecognizes a wide range of language constructs, including goto driven
Xloops as potential for vectorization.  It is able to realize in some
Xcases that it should not vectorize a loop and frequently able to give
Xgood information about why a loop cannot be vectorized.
X
X\begin{figure}[ht]
X\small
X\caption{Vectorization using Convex C}
X\begin{verbatim}
X
XSource Iter.                                 Vector-       
XLine   Var.         Start     Stop     Step  ization     Reason
X---------------------------------------------------------------------------
X   42  k           *EXPR*    ipntp        2  None  Not enough vector code
X   53  k           *EXPR*        n        1  PART  Vector (50%)
X   99  ky          *EXPR*        n        1  FULL  Vector
X  149  l           *EXPR*       lp        1  None  Unable to distribute
X  167  ip          *EXPR*        n        1  PART  Vector (92%)
X  213  l           *EXPR*       lp        1  None  Not trip counted
X  266  k           *EXPR*        n        1  None  Inner loop: exit
X  329  l           *EXPR*       lp        1  None  Inner loop: unknown trips
X
X\end{verbatim}
X\end{figure}
X
XBoth the Convex C and Fortran compilers recognized the same loops as
Xvectorizable for the same reasons, except for loop 22, which Fortran
Xrecognized and 24, which C recognized.  The compiler which recognized
Xa loop as vectorizable produced much faster code.  If both compilers
Xproduced scalar code or both produced vector code, the C compiler
Xproduced more efficient code.  The C compiler appears to do a better
Xjob of managing register allocation and memory references and so
Xproduces code with less memory bandwidth requirements.
X
X\section{Silicon Graphics Performance}
X
XThe Silicon Graphics Iris 4D uses a MIPS R2000\cite{Kane87} RISC
Xprocessor chip set, with a floating point accelerator.  RISC
XArchitecture machines depend heavily on compiler technology to
Xgenerate efficient code in all cases, and it is not surprising that the
XMIPS machine performs well.  That Fortran did not fare as well as C
Xcan be attributed to the primary market place of the Iris, which uses
XC as its major programming language.  It appears that more effort has
Xgone into C compiler than has gone into the Fortran compiler.
X
X\section{Sun 3 Performance}
X
XThe Sun 3 numbers show a classic example of difference in code
Xgeneration quality.  Sun has spent a lot of time on the C compiler,
Xwhich is the mainstay of their product and not as much time on the
XFortran compiler.  As a result, the C compiler produces more efficient
Xcode than the Fortran compiler in all cases.
X
X\section{Conclusions}
X
XThese results demonstrate that on certain machines, using certain
Xcompiler implementations, it is possible for a C compiler to produce
Xmore efficient code than a Fortran compiler for a specific numerical
Xapplication.   Which compiler generates more efficient code appears to
Xdepend more on the maturity of the compiler than on the nature of the
Xmachine. From the way in which the loops were transliterated, it is
Xsafe to say that it is possible to write C code which can be easily
Xoptimized, as it is possible to write Fortran code which can be easily
Xoptimized.
X
XThis is not to say that C has all of the features which make Fortran a
Xgood choice for numerical calculation.  An informal poll of numerical
Xprogrammers who are familiar with both Fortran and C produced a
Xpartial list of features seen to be important which are not present in
Xthe C language.  The list included the integer exponentiation operator,
Xautomatic dimensioning of parameter arrays, complex numbers and
Xthe repeat count in formatted i/o.  Dynamic memory allocation and
Xpointers were viewed as an advantage of C over Fortran.
X
XCurrent compiler technology is making it more difficult to develop
Xreasonable benchmarks which measure what they intend to measure.
XAlthough the Livermore Loops have held up well over time, they are
Xstarting to fall to optimizing compilers.  In particular, the ability
Xto detect invariance of the outer loop has led to invalid results
Xbeing returned for particular loops.
X
XAs McMahon pointed out, performance can be as much a factor of the
Xmaturity of compiler technology as of the machine on which the code
Xruns.  This is as true for codes written in C as it is for Fortran.
XIt is also true for the relative performance of various compilers from
Xthe same vendor on a given machine.
X
XIt is possible to ``overoptimize.''  Vector operations include some
Xoverhead, such as pipeline start up, which must be accounted for by
Xthe optimizer.  Concurrent operations also introduce overhead.  Some
Xcompilers are more sophisticated about this than others.  The Convex
Xcompiler will not vectorize a loop if it believes that there is
Xinsufficient code to benefit.
X
XSome vendors appear to have paid more attention to vector optimization
Xin their Fortran compilers than to scalar, although many of these
Xloops could benefit from scalar optimization.  This is most apparent
Xin the difference between C and Fortran scalar performance on the Cray 2.
X
XIt is possible to write numerical codes in C and have them compiled
Xinto efficient code which produces correct results.  Any code which can
Xbe written in Fortran can also be written in C to produce the same
Xresults, and a good C compiler can generate efficient results from the
Xspecified code.
X
X\bibliography{unix,misc}
X\bibliographystyle{plain}
X
X\end{document}
END_OF_FILE
if test 28304 -ne `wc -c <'cloops.tex'`; then
    echo shar: \"'cloops.tex'\" unpacked with wrong size!
fi
# end of 'cloops.tex'
fi
echo shar: End of archive 3 \(of 3\).
cp /dev/null ark3isdone
MISSING=""
for I in 1 2 3 ; do
    if test ! -f ark${I}isdone ; then
	MISSING="${MISSING} ${I}"
    fi
done
if test "${MISSING}" = "" ; then
    echo You have unpacked all 3 archives.
    rm -f ark[1-9]isdone
else
    echo You still need to unpack the following archives:
    echo "        " ${MISSING}
fi
##  End of shell archive.
exit 0

exit 0 # Just in case...