CMSC 455 Lecture 1, Introduction, overview, floating point

    <- previous    index    next ->

Lecture 1, Numerical Errors

Introduction:
  Hello, my name is Jon Squire and I have been using computers
  to solve numerical problems since 1959. I have about 1 million
  lines of source code, in more than 15 languages,
  written over more than 50 years.
  How can that be? Check the numerical computation:
  1,000,000/50 years is 20,000 lines per year.
  20,000/200 working days per year is 100 lines per working day.
  Link to file type, file count, line count.

  With a lot of reuse, cut-and-paste, same programs and
  data files including scripts for many languages on many
  operating systems, easy.

  On a job, 20,000/(50 weeks*5 days per week) is 80 lines per day.
  80/8 hours is 10 lines per hour. You can do that.
  You may not save every line you type. sad.

Overview:
  You will be writing 6 small programs and a project in a language
  of your choice.
  Full details and sample code in a few lnguages will be provided.
  You will always have a weekend between a homework assignment
  and the due date.
  You may use whatever language or languages you like and you
  may experiment with other languages. Examples will be provided
  in C, Java, Python, Ruby, Matlab and others. You will see many
  languages including Fortran, Ada, Lisp, Pascal, Delphi, Scheme...
  The point is that the "syntactic sugar" of any language does
  not mean much. You may sometime convert code from some language
  into a language you like better.

  I have found some programs that I wrote can run faster in
  Java, Python, Ada, Fortran, Matlab than they run in optimized C.
  Well, the Python and Matlab use efficient library routines
  that are written in Fortran. We will cover threads and
  multiprocessor HPC concepts. There is a full course, CMSC 483
  Parallel and Distributed processing where you get to program
  a multiprocessor. 

  You will be exposed to toolkits. Very valuable to help you
  produce more and better software with less effort. Over time
  you may develop a toolkit to help others. Toolkits are
  available for numerical computation, graphics, AI, robotics,
  any many other areas.
 
  Read the syllabus.


In this course there may be no exactly correct answer.

This lecture will provide the definitions and the intuition for
  absolute error
  relative error
  round off error
  truncation error

  Then how to call intrinsic function and elementary functions.

Some terms will be used without comment in the rest of the course.
Learn them now. You should run the sample code to increase your
understanding and belief.

"Absolute error" A positive number, the difference between the
exact (analytic) answer and the computed, approximate, answer.

"Relative error" A positive number, the Absolute Error divided
by the exact answer. (Not defined if the exact answer is zero.)

Most numerical software is first tested with one or more
known solutions. Then, the software may be used on
problems where the solution is not known.

Given the exact answer is 1000 and the computed answer is 1001:
The absolute error is 1
The relative error is 1/1000 = 0.001

For a set of computed numbers there are three common absolute errors:
"Maximum Error" the largest error in the set.
"Average Error" the sum of the absolute errors divided by the number
                in the set.
"RMS Error" the root mean square of the absolute errors.
            sqrt(sum_over_set(absolute_error^2)/number_in_set)

Given the exact answer is 100 answers of 1.0 and we
computed 99 answers of 1.0 and one answer of 101.0:
The maximum error is 100.0     101.0 - 1.0
The average error is 1.0       100.0/100
The RMS error is 10.0          sqrt(100.0*100.0/100)

In some problems the main concern is the maximum error, yet
the RMS error is often the best intuitive measure of error.
Generally much better intuitive measure than average error.

Almost all Numerical Computation arithmetic is performed using
IEEE 754-1985 Standard for Binary Floating-Point Arithmetic.
The two formats that we deal with in practice are the 32 bit and
64 bit formats. You need to know how to get the format you desire
in the language you are programming. Complex numbers use two values.

                                          older
        C       Java    Fortran 95        Fortran     Ada 95        MATLAB    Python R
        ------  ------  ----------------  ----------  ------------  --------  --------
32 bit  float   float   real              real        float         N/A       float
64 bit  double  double  double precision  real*8      long_float    'default' 'default'

complex
32 bit  'none'  'none'  complex           complex     complex       N/A       N/A  
64 bit  'none'  'none'  double complex    complex*16  long_complex  'default' 'default'

'none' means not provided by the language (may be available as a library)
N/A means not available, you get the default.

IEEE Floating-Point numbers are stored as follows:
The single format 32 bit has
    1 bit for sign,  8 bits for exponent, 23 bits for fraction
                                          about 6 to 7 decimal digits
The double format 64 bit has
    1 bit for sign, 11 bits for exponent, 52 bits for fraction
                                          about 15 to 16 decimal digits

Some example numbers and their bit patterns:

   decimal
stored hexadecimal sign exponent  fraction                 significand 

                                     The "1" is not stored |
                                                           |
                    31  30....23  22....................0  |
   1.0
3F 80 00 00          0  01111111  00000000000000000000000  1.0   * 2^(127-127) 

   0.5
3F 00 00 00          0  01111110  00000000000000000000000  1.0   * 2^(126-127)

   0.75
3F 40 00 00          0  01111110  10000000000000000000000  1.1   * 2^(126-127)

   0.9999995
3F 7F FF FF          0  01111110  11111111111111111111111  1.1111* 2^(126-127)

   0.1
3D CC CC CD          0  01111011  10011001100110011001101  1.1001* 2^(123-127)
 

                          63  62...... 52  51 .....  0
   1.0
3F F0 00 00 00 00 00 00    0  01111111111  000 ... 000  1.0    * 2^(1023-1023)

   0.5
3F E0 00 00 00 00 00 00    0  01111111110  000 ... 000  1.0    * 2^(1022-1023)

   0.75
3F E8 00 00 00 00 00 00    0  01111111110  100 ... 000  1.1    * 2^(1022-1023)

   0.9999999999999995
3F EF FF FF FF FF FF FF    0  01111111110  111 ...      1.11111* 2^(1022-1023)

   0.1
3F B9 99 99 99 99 99 9A    0  01111111011  10011..1010  1.10011* 2^(1019-1023)
                                                                           |
                        sign   exponent      fraction                      |
                                                before storing subtract bias

Note that an integer in the range 0 to 2^23 -1 may be represented exactly.
Any power of two in the range -126 to +127 times such an integer may also
be represented exactly. Numbers such as 0.1, 0.3, 1.0/5.0, 1.0/9.0 are
represented approximately. 0.75 is 3/4 which is exact.
Some languages are careful to represent approximated numbers
accurate to plus or minus the least significant bit.
Other languages may be less accurate.

Now for some experiments for you to run an the computer of your choice
in the language of your choice.

In single precision floating point print:
    10^10 * sum( 1.0, 10^-7, -1.0)         answer should be  1,000
    10^10 * sum( 1.0, 0.5*10^-7, -1.0)     answer should be    500
    expression 10^10 * ( 1.0 + 10^-7 -1.0) answer should be  1,000
               10000000000.0*(1.0+10.0E-7-1.0)
The order of addition is important.
Adding a small number to 1.0 may not change the value.
This small number is less that what we call "epsilon".

error_demo1.adb Ada source code
error_demo1_ada.out output

error_demo1.c C source code
error_demo1_c.out output
epsilon.c C double and float source code
epsilon_c.out double and float output

error_demo1.f90 Fortran source code
error_demo1_f90.out output

error_demo1.java Java source code
error_demo1_java.out output

error_demo1.py3 Python source code
error_demo1_py3.out output

error_demo1.m Matlab source code
error_demo1_m.out output

error_demo1.r R source code
error_demo1_r.out output

Remember 10^-7 is 0.0000001, not a power of 2.
Thus, can not be stored exactly.
1.0e-16  = 10^-16 not exact because not a power of 2

Also, floating point arithmetic is performed in registers with
more bits than can be stored. Thus, as shown below, you may
get more precision than you expect. Do not count on it.
epsilon.c showing more precision source code
epsilon.out showing forced store output


We will cover, in the course,  methods of reducing error in areas:

statistics  sum x^2 - (sum x)^2  vs  sum(x-mean)

polynomial definition, evaluation Horner's rule vs x^N

approximation, e.g. sin(x)  truncation error vs roundoff error

derivatives

indefinite integral, definite integral, area

partial differential equations

and show sample code in many languages:
Ada 95, C, Fortran 95, Java, Python, MatLab and R

Error accumulation when computing standard deviation.
Subtracting large numbers loses significant digits.
 123456 - 123455 = 1.00000  only 1 significant digit
With only 6 digits representable
 123456.00 - 123455.99 = 1.00000 yet should be 0.010000


Two ways of computing the standard deviation are shown,
with the errors indicated for various sets of data:
Cases:
stddev(1, 2,..., 100)
stddev(10, 20,..., 1,000)              should just scale 
stddev(100, 200,..., 10,000)
stddev(1,000, 2,000,..., 100,000)
stddev(10,000, 20,000,..., 1,000,000)
stddev(10,001, 10,002,..., 10,100)     should be same as first

See computed values in  .out  files:

sigma_error.c
sigma_error_c.out

sigma_error.f90
sigma_error_f90.out

sigma_error.adb
sigma_error_ada.out

sigma_error.java  float
sigma_error_java.out

sigma_errord.java  double
sigma_errord_java.out

sigma_error.py
sigma_error_py.out

sigma_error.m
sigma_error_m.out

Using different algorithms, expect slightly different results.
Using different types, float or double, may get very different results.

Iteration needing uniform step size should use multiplication,
not addition as shown in:
bad_roundoff.c
bad_roundoff_c.out


The elementary functions are sin, cos, tan, exp, sqrt
and inverse forms asin, acos, atan, log, power
and reciprocal forms cosecant, secant, cotangent,
and hyperbolic forms sinh, cosh, tanh,
and inverse hyperbolic forms asinh, acosh, atanh, ...

The intrinsic functions are built into the language, e.g. abs (sometimes)

Note that Fortran 95 and Java need no extra information to get the
real valued elementary functions. "C" needs #include <math.h>
while Ada 95 needs 'with' and 'use' Ada.Numerics.Elementary_functions .

Note that Ada 95 overloads the function names and provides
single precision as 'float' and double precision as 'long_float'.
ef_ada.adb Ada 95, float and long float

Note that Fortran 95, java, python, overload the function names and provides
the same function name for single and double precision.
Fortran 95 names single precision as 'real' and
double precision as 'double precision'.
ef_f95.f90 Fortran 95, real and double

Note that Java provides double precision as 'double' for variables and constants.
ef_java.java Java, only double
Hyper.java create your own hyperbolic

Note that C provides double precision only as 'double' and constants.
ef_c.c C, only double is available
ef_c.out nan means not a number, bad input

Note that MATLAB has the most functions and all functions are
automatically double or double complex as needed.
ef_matlab.m MATLAB everything
ef_matlab.out automatic complex

Note that Python needs  import math  and can list available functions
test_math.py Python many, automatic conversion
test_math_py.out Python output
test_math.py3 Python many, automatic conversion
test_math_py3.out Python output


Just for fun: Power of 2 "tree"

A small program that prints epsilon, 'eps' and the largest and
smallest floating point numbers, run on one machine, shows:

float_sig.c running, may die 
type float has about six significant digits 
eps is the smallest positive number added to 1.0f that changes 1.0f
type float eps= 5.960464478E-08 
type float small= 1.401298464E-45 
this could be about 1.0E-38, above shows un-normalized IEEE floating point
type float large= 1.701411835E+38 

type double has about fifteen significant digits 
type double eps= 1.11022302462515654E-16 
type double small=4.940656458E-324 
this could be about 1.0E-304, above shows un-normalized IEEE floating point
type double large=8.988465674E+307 

The program is float_sig.c

Using 2^n = 10^k/3.32 and n bits has a largest number 2^n -1 and
k digits has a largest number 10^k -1. 24 bits would be 7 digits.
53 bits would be 16 digits, the more optimistic significant
digits for IEEE floating point.

Notes: I have chosen to keep most WEB pages plain text so that
they may be read by all browsers. Some referenced pages may
be .pdf, .jpg, .gif, .ps or other file types.
Many math books use many Greek letters. You may want to refer
to various alphabets.

If you want to stretch your concept of numbers,
check out the Continuum Hypothesis

Beware of easy methods of printing, related to epsilon
eps.py Python2 "print" Python3 "print( )"
eps.f90 Fortran "print" 

You may use any language you want. I do not run your code.
Turn in paper in class or submit your source code and output for grading.
It is OK to ask for an example in a language I have not
provided. For various "C", I just provide .h and .c.
You may add to .h files for C++
  #ifdef __cplusplus
  extern "C" {
  #endif
  // header file function prototypes

  #ifdef __cplusplus
  }
  #endif

You may use Fortran and C object files in many
languages, including Java, Python, Matlab, etc.

Last updated 12/13/2021

    <- previous    index    next ->

Lecture 1, Numerical Errors

Other links

Go to top