Basic Statistics

Over the last few years I have been collecting statistics books. I never took statistics in college since everyone I knew who took introductory statistics hated the class. Perhaps the reason for this is that statistics is poorly taught. Or perhaps people find statistics boring because they do not see how to apply statistics to applications that they find interesting.

Perhaps only sex or our survival focuses our fascination as much as money. Statistical mathematics are an indispensable tool in finance (e.g., portfolio analysis, risk analysis, statistical arbitrage). My interest in statistics arose from working on wavelet techniques for filtering and analyzing financial data (financial time series, like stock market daily close prices). Statistics provides a tool to understand whether a particular modeling technique may make money.

The rest of this web page consists of some notes on basic statistical functions. These statistical functions are not only useful for data analysis, but they are also building blocks for other techniques (like the calculation of the Hurst exponent). I've done little more than list the equations. In some cases I list the reference where I got the equation. See the reference list below for a list of statistics books. The C++ code that implements these functions is published here as well.

Simple Linear Regression

Linear regression plots a line through a cloud of points. This is shown in the plot below.


The data points are from section 6.1 of Statistics Manual by Crow, Davis and Maxfield, Dover Press.

Given a set of values, X0...XN-1 and Y0...YN-1, linear regression calculates the constant coefficients a and b for the line

The coefficient b (the slope of the line) is calculated with the equation

Where is the mean of the x values and is the mean of the y values.

The once the coefficient b is known, the coefficient a, the y-intercept, can be calculated using the equation

When calculating the linear regression it is also important to calculate the error statistics, to give a measure of how closely the linear regression line fits the data. The standard deviation of the points around the regression line is


From Cartoon Guide to Statistics and Introductory Statistics for Business and Economics

The standard error of the regression coefficient b is


From Cartoon Guide to Statistics and Introductory Statistics for Business and Economics

Correlation

Calculation of the regression line is closely related to the calculation of the correlation. Two values are correlated when a change in one causes a change, in the same direction, in the other. For example, if a respected stock market analyst publishes a report that states that she believes that a stock will go up, the is some probability that the report will be positively correlated with a rise in the stock. Two values may be anti-correlated when a change in one causes a change in the opposite direction in the other. For example, a positive change in the polls for a politician, W, who favors deficits may be negatively correlated with the direction of US dollar currency futures, D. Or, put more simply as deficits go up, the value of the dollar in foreign exchange tends to go down. The correlation coefficient, r, has a value -1 <= r <= 1. A correlation of 0 means that the two values are unrelated. If 0 < r <= 1, then there is a positive correlation. If -1 <= r < 0, then there is a negative correlation.

The value R2 is calculated using the equation


From Statistics Manual by Crow et al

The correlation is the square root of R2, times the sign of the regression coefficient b (e.g., the slope of the line).

The Autocorrelation Function (ACF)

Autocorrelation is the correlation of a data set with itself, offset by n-values. For example, autocorrelation with an offset of 5 would correlate the data set {s0, s1, s2, s3 ... sn-5} is correlated with {s4, s5, s6, s7 ... sn}. The autocorrelation function is the set of autocorrelations with offsets 1, 2, 3, 4 .. limit, where limit <= n/2.

The equation for the autocorrelation function (ACF) is

This function is related to the autocovariance, with a forward step of k elements.

The ACF equation I've used here (and implemented in software) is a modified version of the ACF given in Non-linear time series models in empirical finance by Philip Hans Frances and Kick van Kijk, Cambridge University Press, 2000, Section 2.2. Any errors are mine. I have gotten the same plots as Frances has published, using his data, so I have some reason to belive that my version is a correct translation.

Standard Deviation

The standard deviation of a data distribution is calculated from the equation below. Here the standard deviation is calculated from the square root of the "unbiased" estimate of the variance (where n-1 rather than n is used as a divisor). Some statistics text books use a form of the standard deviation where n is the devisor, but calculators and statistics software use the version shown here. Again, in the equation below, is the mean.

Math Packages vs. Writing Your Own Code

So far I've implemented all the statistics functions I've used in either C++ or Java. This has the advantage that I've developed a fairly complete understanding of the functions I've implemented. Math packages like MatLab, Mathematica and S+ provide support for basic statistics and include packages that support additional functions like wavelets. These packages also provide some level of support for plotting the results. So far I have not been able to justify the significant license fees for this software. Although implementing software forces a better understanding of the statistical functions being used, it is very time consuming. These statistical functions are a tool for what I'm trying to acomplish, rather than an end in themselves. So better math support is becoming increasingly attractive. There are three open source (or "public domain") packages that look interesting:

  1. Octave

    To quote the Octave web page:

    GNU Octave is a high-level language, primarily intended for numerical computations. It provides a convenient command line interface for solving linear and nonlinear problems numerically, and for performing other numerical experiments using a language that is mostly compatible with Matlab. It may also be used as a batch-oriented language.

    MatLab seems to be one of the most, if not the most, popular mathematics environments. In the book Ripples in Mathematics, which I used for reference for much of my wavelet work, many of the examples are given in MatLab code.

  2. The R Project for Statistical Computing

    R is a free software version of the S+ mathematics environment. I used S+ for work I did on wavelets while I was working at Prediction Company. The version of S+ we were using was heavily modified for the local environment (Prediction Company has a source license). This made it difficult to migrate to new versions of S+. So I'm not sure whether my experience with S+ is representative of the current product. I was not very happy with the S+ graphics environment. Also, S+ seems to be used much less than MatLab.

    Although S+ may have limitations, the R Project does look interesting. The documentation looks good and it has a more "professional feel" than Octave.

  3. NIST Dataplot

    Dataplot has been developed by the US National Institute of Standards and Technology. The software started being developed by James J. Filliben in 1978. It now includes a fairly sophisticated plotting and data analysis environment. The software is supported on virtually every major platform.

Copyright and use of the associated software

You may use this source code without limitation and without fee as long as you include:

This software was written and is copyrighted by Ian Kaplan, Bear Products International, www.bearcave.com, 2001.

This software is provided "as is", without any warranty or claim as to its usefulness. Anyone who uses this source code uses it at their own risk. Nor is any support provided by Ian Kaplan and Bear Products International. Please send any bug fixes or suggested source changes to:
     iank@bearcave.com

Software

The C++ source code that implements the basic statistics functions described here (along with C++ code to calculate the probability density function, which is not described here) can be downloaded by clicking here. This file is in uncompressed tar format.

The documentation for this source code, generated by Doxygen can be found here.

Books

Web Based References

Acknowledgments

I want to thank Keith Dyke who, in a time which seems increasingly long ago, in a land far away from where I am now, patiently explained basic statistics to me and helped me gain some facility with the S+ statistics package.

Ian Kaplan
May 2003
Revised: