Introduction

 

Linear Regression

A classic statistical problem is to try to determine the relationship between two random variables X and Y. For example, we might consider height and weight of a sample of adults. Linear regression attempts to explain this relationship with a straight line fit to the data. The linear regression model postulates that Y = a + bX + e. Where the "error" e is a random variable with mean zero. The coefficients a and b are determined by the condition that the sum of the square residuals is as small as possible.

For instance if   X = #(1 2 3 4) and   Y = #(4 5 6 7), then   a = 3.0, b = 1.0 and e = 0.0.

If   X = #(1 2 3 4) and   Y = #(2 4 6 8), then   a = 0.0, b = 2.0 and e = 0.0.

Also if   X = #(1 2 3 4) and   Y = #(1.5 4 6.2 8), then   a = -0.5, b = 2.17 and e = 0.123.

Multiple Linear Regression. The general purpose of multiple regression (the term was first used by Pearson, 1908) is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable. For example, a real estate Lambda might record for each listing the size of the house (in square feet), the number of bedrooms, the average income in the respective neighborhood according to census data, and a subjective rating of appeal of the house. Once this information has been compiled for various houses it would be interesting to see whether and how these measures relate to the price for which a house is sold. For example, one might learn that the number of bedrooms is a better predictor of the price for which a house sells in a particular neighborhood than how "pretty" the house is (subjective rating). One may also detect "outliers," that is, houses that should really sell for more, given their location and characteristics.

Personnel professionals customarily use multiple regression procedures to determine equitable compensation. One can determine a number of factors or dimensions such as "amount of responsibility" (Resp) or "number of people to supervise" (No_Super) that one believes to contribute to the value of a job. The personnel analyst then usually conducts a salary survey among comparable companies in the market, recording the salaries and respective characteristics (i.e., values on dimensions) for different positions. This information can be used in a multiple regression analysis to build a regression equation of the form:

Salary = .5*Resp + .8*No_Super

Once this so-called regression line has been determined, the analyst can now easily construct a graph of the expected (predicted) salaries and the actual salaries of job incumbents in his or her company. Thus, the analyst is able to determine which position is underpaid (below the regression line) or overpaid (above the regression line), or paid equitably.

In the social and natural sciences multiple regression procedures are very widely used in research. In general, multiple regression allows the researcher to ask (and hopefully answer) the general question "what is the best predictor of ...". For example, educational researchers might want to learn what are the best predictors of success in high-school. Psychologists may want to determine which personality variable best predicts social adjustment. Sociologists may want to find out which of the multiple social indicators best predict whether or not a new immigrant group will adapt and be absorbed into society.

The general computational problem that needs to be solved in multiple regression analysis is to fit a straight line to a number of points.

Least Squares. In our problem, we have an independent or X variable, and a dependent or Y variable. These variables may, for example, represent IQ (intelligence as measured by a test) and school achievement (grade point average; GPA), respectively. Each point in the plot represents one student, that is, the respective student's IQ and GPA. The goal of linear regression procedures is to fit a line through the points. Specifically, the program will compute a line so that the squared deviations of the observed points from that line are minimized. Thus, this general procedure is sometimes also referred to as least squares estimation.

The Regression Equation. A line in a two dimensional or two-variable space is defined by the equation Y=a+b*X; in full text: the Y variable can be expressed in terms of a constant (a) and a slope (b) times the X variable. The constant is also referred to as the intercept, and the slope as the regression coefficient or B coefficient. For example, GPA may best be predicted as 1+.02*IQ. Thus, knowing that a student has an IQ of 130 would lead us to predict that her GPA would be 3.6 (since, 1+.02*130=3.6). In the multivariate case, when there is more than one independent variable, the regression line cannot be visualized in the two dimensional space, but can be computed just as easily. For example, if in addition to IQ we had additional predictors of achievement (e.g., Motivation, Self- discipline) we could construct a linear equation containing all those variables. In general then, multiple regression procedures will estimate a linear equation of the form:   Y = a + b1*X1 + b2*X2 + ... + bp*Xp

Unique Prediction and Partial Correlation. Note that in this equation, the regression coefficients (or B coefficients) represent the independent contributions of each independent variable to the prediction of the dependent variable. Another way to express this fact is to say that, for example, variable X1 is correlated with the Y variable, after controlling for all other independent variables. This type of correlation is also referred to as a partial correlation (this term was first used by Yule, 1907). Perhaps the following example will clarify this issue. One would probably find a significant negative correlation between hair length and height in the population (i.e., short people have longer hair). At first this may seem odd; however, if we were to add the variable Gender into the multiple regression equation, this correlation would probably disappear. This is because women, on the average, have longer hair than men; they also are shorter on the average than men. Thus, after we remove this gender difference by entering Gender into the equation, the relationship between hair length and height disappears because hair length does not make any unique contribution to the prediction of height, above and beyond what it shares in the prediction with variable Gender. Put another way, after controlling for the variable Gender, the partial correlation between hair length and height is zero.

Predicted and Residual Scores. The regression line expresses the best prediction of the dependent variable (Y), given the independent variables (X). However, nature is rarely (if ever) perfectly predictable, and usually there is substantial variation of the observed points around the fitted regression line (as in the scatterplot shown earlier). The deviation of a particular point from the regression line (its predicted value) is called the residual value.

Residual Variance and R-square. The smaller the variability of the residual values around the regression line relative to the overall variability, the better is our prediction. For example, if there is no relationship between the X and Y variables, then the ratio of the residual variability of the Y variable to the original variance is equal to 1.0. If X and Y are perfectly related then there is no residual variance and the ratio of variance would be 0.0. In most cases, the ratio would fall somewhere between these extremes, that is, between 0.0 and 1.0. 1.0 minus this ratio is referred to as R-square or the coefficient of determination. This value is immediately interpretable in the following manner. If we have an R-square of 0.4 then we know that the variability of the Y values around the regression line is 1-0.4 times the original variance; in other words we have explained 40% of the original variability, and are left with 60% residual variability. Ideally, we would like to explain most if not all of the original variability. The R-square value is an indicator of how well the model fits the data (e.g., an R-square close to 1.0 indicates that we have accounted for almost all of the variability with the variables specified in the model).

Interpreting the Correlation Coefficient R. Customarily, the degree to which two or more predictors (independent or X variables) are related to the dependent (Y) variable is expressed in the correlation coefficient R, which is the square root of R-square. In multiple regression, R can assume values between 0 and 1. To interpret the direction of the relationship between variables, one looks at the signs (plus or minus) of the regression or B coefficients. If a B coefficient is positive, then the relationship of this variable with the dependent variable is positive (e.g., the greater the IQ the better the grade point average); if the B coefficient is negative then the relationship is negative (e.g., the lower the class size the better the average test scores). Of course, if the B coefficient is equal to 0 then there is no relationship between the variables.

NonLinear Regression

Nonlinear regression in statistics is the problem of fitting a model y = f(x,P) + e to multidimensional x, y data, where f is a nonlinear function of x with parameters P, and where the "error" e is a random variable with mean zero. In general, there is no algebraic expression for the best-fitting parameters P, as there is in linear regression. Usually numerical optimization algorithms are applied to determine the best-fitting parameters. There may be many local maxima of the goodness of fit, again in contrast to linear regression, in which there is usually a unique global maximum of the goodness of fit. To determine which maximum is to be located using numerical optimization, guess values of parameters are used. Some nonlinear regression problems can be linearized if the exact solution to the guess-regression equation can be found.

For example:

If we take a logarithm of y = A*exp(B*x) and cast it as a linear regression, it will look like log(y) = log(A) + B*x, a usual linear regression problem of optimizing parameters log(A) and B, the exact solution of which is well known.

However, performing such a linearization may bias some data towards being more "relevant" than others, which may not be a desired effect. More complex problems, such as transcendental regression are optimized by more complex algorithms. Other nonlinear regressions may have several goodness of fit maxima, and will require the scientist to input guess values for the optimized parameters.

Nonlinear regression fits a mathematical model to your data, and therefore Nonlinear regression requires that you choose a model. What is a model? A mathematical model is a simple description of a physical, chemical or biological state or process. Using a model can help you think about chemical and physiological processes or mechanisms, enabling you to design better experiments and make sense of the results. Your model must be expressed as a mathematical function. You can express the model as a single algebraic equation. You can express the model as a set of differential equations or you can write an equation in a manner that lets you have different models for different portions of your data.

Choosing a model for NonLinear regression obviously requires some understanding of the problem data and some preference for one choice of model over another.

Neural Net Regression

Neural Net regression is the problem of training a neural net model y = Nf(x,S,Ch,Wh,Co,Wo) + e on multidimensional M-Vectors x in X and real numbers y in Y, such that the trained numeric coefficients, Ch, Wh, Co and Wo, optimize the least squares error component, e which is a random variable with mean zero. Normally, a neural net is a complex learning machine receiving M dimensional inputs x, containing hidden layer coefficients Ch and Wh, also containing output coefficients Co and Wo, producing hidden layer internal signals S, and producing one real number output signal y.

A standard neural net, Nf, is defined by M, the number of input dimensions, and by K, the number of hidden layers. There are 1 thru M inputs, 0 thru K layers (with 1 thru M internal signals produced for each layer), and one real number output signal. The number of components inside the neural net learning machine is complex, and they are as follows.

A standard Neural Net operates by generating internal signals which propagate up through each layer until the final output signal is produced. The number and composition of these signals inside the neural net is complex, and they are as follows.

Normally each internal signal, S[k][m], is either a weighted function of the outputs from the layer below or an original input element x[m]. All of the neural nets, which we will consider in this document, are fully connected neural nets, receiving M input signals, with zero or more hidden layers, and have one real number output signal.

Symbolic Regression

Symbolic regression is the induction of mathematical expressions from data. This is called symbolic regression (first mentioned by Koza in 1992), to emphasize the fact that the object of search is a symbolic description of a model, not just discovering a set of optimized coefficients in a prespecified model. This is in sharp contrast with other methods of regression, including NonLinear regression, where a specific model is assumed and often only the complexity of this model can be varied.

For example:

If we start with a trigonometric function y = 3.56*cos((21.456*x1)-log(x2)) and create a training data set we might produce a three column by 1000 row matrix. The first column is the independent variable X1. The second column is the independent variable X2, and the third column is the dependent variable Y. The number of training rows, in this case 1000, is arbitrary; but, the larger the number of rows, the better for training purposes.

When we present the training data set, created above, to a symbolic regression machine, we will know that the Y column can be derived from the X columns using the model: y = 3.56*cos((21.456*x1)-log(x2)). However, the symbolic regression machine will not know this fact. We want the symbolic regression machine to discover this model relationship on its own without any further hints other than the training data set.

In the case of a symbolic regression machine, what is a model? For our purposes, using Analytic Information Server, a symbolic regression machine model is an AIS Lambda which inputs a number vector, x, and outputs a single number, y. Furthermore we express all symbolic regression models in the Estimator language. If our hypothetical symbolic regression machine were to discover the relationship hidden in the training data correctly, it would return a model regression Lambda as follows: function(x) {y = 3.56*cos((21.456*x[0])-log(x[1]))}. This would be a perfect score for our hypothetical symbolic regression machine.

Grammatical Swarm Evolution

Grammatical Swarm Evolution is a just-in-time algorithm that can evolve computer programs, rulesets, or more generally sentences in any language. Rulesets could be as diverse as a regression model or a trading system for a financial market. Rather than representing the programs as syntax trees, as in Genetic Programming, a linear genome representation is used in conjunction with a grammar. Each individual genome is a vector of integer codons each of which contains the information to select production rules from a grammar. The mapping from the genotype (genome) to the phenotype (computer program) is accomplished by reading the genome, from first element to last element, and using each integer codon to select a grammar rule. Selected grammar rules are used to incrementally build a grammatically correct computer program. The mapping process has been engineered so that it will always terminate, either at or before the end of the genome, with a valid program.

For the evolutionary component of its algorithm, Grammatical Swarm Evolution, turns to a technique called Particle Swarm optimization (PSO). The PSO algorithm was introduced by Kennedy and Eberhart in 1995. In PSO, a swarm of particles, which encode solutions to the problem of interest, move around in an n-dimensional search space in an attempt to uncover better solutions. Each of the particles has two associated properties, a current position and a velocity. Each particle has a memory of the best location in the search space that it has found so far (pbest), and knows the best location found to date by all the particles in the population (gbest). At each step of the algorithm, particles move from their current location by applying a velocity vector to their current position vector. The magnitude and direction of their velocity at each step is influenced by their velocity in the previous iteration of the algorithm, simulating momentum, and the location of the particle relative to the location of its pbest and the gbest. Therefore, at each step, the size and direction of each particle's move is a function of its own history, and the social influence of its peer group.

Therefore, for grammatical purposes, the genome is seen as a vector of integer codons; however, for evolutionary purposes, the genome is viewed as a discrete binary bit vector for use as a location in an n-dimensional search space. The evolutionary search mechanism is performed by the discrete binary version of PSO.

Combining All Of These into a Tool

The Grammatical Swarm Symbolic Regression Machine Lambda (GSM) is a learning machine which learns to select and score individuals from a universe of individuals over time. Over a series of discrete time steps, a universe of individuals is collected for each timestep. The individuals are things such as Stocks, People, Cities, etc. The discrete time steps are weeks, days, seconds, years, microseconds, etc.

Each individual followed by the system is given a unique identifier which remains unique across all time periods studied (no two individuals ever have the same identifier). Furthermore, each time period studied is given a unique ascending integer index (i.e. week 1, week 2, etc.). So, for a series of time periods, historical information about groups of individuals is collected for each time period. The historical information collected for each individual for each time period is stored in a Number Vector and includes: the time period index; the unique identifier of the individual; and other numeric information about the individual pertinent to the current investigation. Finally, each individual in each time period is given a numeric "score" which determines the value of the individual in that time period.

During training, the GSM is given historical information for time periods 0 through T for all individuals. The GSM is also given the "score" values for each individual in each training time period from 0 through T. During training the GSM attempts to "learn" any patterns in the available historical data. The machine (GSM) is free to discover static as well as time varying patterns.

During forward prediction, the GSM is given new information for time period T+1 for all individuals. The GSM is NOT given the "score" values for each individual in the new time period T+1. During prediction the GSM attempts to use any patterns it has learned to select and score the individuals, from the universe of individuals, seen in time period T+1. Once the machine scores the individuals, in the new time period, the accuracy of the machine is determined by: (a) the least squares error on the scored individuals in time period T+1; and (b) by the "order preserving" property of the estimated scores in time period T+1 (the degree to which the estimated scores preserve the natural sort ordering of the actual scores). Order preservation is a simple idea where if the estimated score for individual x is less than the estimated score for individual y, then the actual score for individual x should also be less than the actual actual score for individual y in time period T+1. Normally these two measures should be coincident -- especially if the least squares error is excellent. However, in cases where there is insufficient information in the training data, they may not be coincident. If a tradeoff is required, the Grammatical Swarm Symbolic Regression Machine prefers that at least natural ordering be preserved.

A time series set of vectors, X, together with a set, Y, of scores for the vectors in X are used to train a learning machine. There is also a testing set, TX and TY, of of vectors similar to those in X and Y but for the testing time period (a time period not present in X or Y). After training, the machine is presented with the testing set, TX and attempts to estimate TY. The learning machine returns a Vector EY, of estimates for TY. Each element in EY contains a numeric estimate for the corresponding value in TY. The learning machine attempts to: (a) minimize the least squared error between EY and TY; and to, as much as possible, have the natural ordering of EY be predictive of the natural ordering of TY.

The order preservation mission of the Grammatical Swarm Symbolic Regression Machine is an important distinguishing feature between this learning machine and general regression learning machines. The GSM is trying to fit a function to the testing data, using least squares error; but, the GSM is also trying to predict the natural order of the individuals in the testing data.

In many instances the Grammatical Swarm Symbolic Regression Machine may not have an exact estimate for the scores of the individuals in the testing data. However, if the learning machine is able to predict the natural ordering of individuals in the testing data, then the machine has been partially successful even if its estimated scores are incorrect.

Let X be a set of vectors such as the numeric vector x = #(num| xtime x1 x2 x3 ... xm), and let Y be a numeric vector of "score" values. Let both X and Y be of length N.

Furthermore, let the first prefix element, xtime, of every vector, in X, contain a non-negative integer value indicating some time span of integral length, for example, if the time span were weeks, a value of 1 would indicate week one, and a value of 10 would indicate week ten, etc. (i.e. the vectors contained in X are time sequenced).

Example

An example, but by no means the only example, of X and Y would be a set of vectors of stock market data taken from the Open High Low Close and Volume numbers for all NASDQ traded stocks over a 52 week period. In this example we have the following:

xtime The sequential index of the current week (from 1 through 52).
x1 The unique integer identifier of the stock (stocks not traded would not appear).
x2 The current week opening price.
x3 The current week high price.
x4 The current week low price.
x5 The current week closing price.
x6 The current week share volume traded.
Y The "score" vector of next week profits (next_week_closing_price - the_current_week_closing_price).

Similar examples can be constructed for oil exploration data over time, for the height and weight of individuals over time, etc. However, continuing with our securities example, we train our machine on the market data for four stocks over a four week period as follows:

Training Data

TimeStockOpenHighLowCloseVolScore
x.xtimex.x1x.x2x.x3x.x4x.x5x.x6y
(first week)Note: (next week's profit)
01 (Apple)$23.45$25.67$23.35$24.56193673.4%
02 (IBM)$143.45$145.27$143.15$144.96894676-1.2%
03 (Xerox)$13.95$15.27$13.35$14.72568324.8%
04 (GM)$57.15$62.17$53.65$62.0534196479.1%
(second week)Note: (next week's profit)
11 (Apple)$24.56$25.38$22.75$25.40120461.2%
12 (IBM)$144.96$144.96$143.15$143.23864023-3.2%
13 (Xerox)$14.72$16.12$14.39$15.43592043.4%
14 (GM)$62.05$62.05$68.00$67.7032193826.5%
(third week)Note: (next week's profit)
21 (Apple)$25.40$26.98$24.75$25.71220560.8%
22 (IBM)$143.23$143.23$136.75$138.64824093-4.3%
23 (Xerox)$15.43$16.45$15.09$15.9661205-1.4%
24 (GM)$67.70$75.35$66.39$72.1036195827.8%

We train the GSM on the training data shown above. After training, we show the machine the following testing data, TX, and ask it to return an estimate of the next week's profit, TY, for each of the four individuals. We do not show the machine the scores, TY.

Testing Data

TimeStockOpenHighLowCloseVolScore
tx.xtimetx.x1tx.x2tx.x3tx.x4tx.x5tx.x6ty
(fourth week)Note: (next week's profit)
31 (Apple)$25.71$26.18$25.55$25.9225046-1.2%
32 (IBM)$138.64$139.23$131.15$132.67774593-6.1%
33 (Xerox)$15.96$16.13$15.00$15.73592052.4%
34 (GM)$72.10$77.87$71.39$77.7337105825.8%

Resulting Estimates

After testing, the learning machine returns the following estimated scores, EY.

EstimateScore
eyty
-2.3%-1.2%
-7.6%-6.1%
1.9%2.4%
4.9%5.8%

We calculate the least squares error on the four individuals as 1.13%, so we did not a mediocre job of minimizing least squares error; however, we did a perfect job of preserving the sort order of the corresponding TY values.

FAQ

Question 1: Why must we train the learning machine on collections of individuals in each time period? Why not simply train the machine on each individual separately and perform a normal regression estimate for each individual?

Answer: Many real world estimates cannot be made unless one is aware of the competitive land scape.

For instance, suppose one is estimating the social popularity of students in the senior high school class. We can perform any number of individual regressions correlating high scores for intelligence, appearance, and social skills with social popularity. However, all of these individual regression models are greatly skewed in the case where all students in the senior high school class are male except one student who is female.

Also, suppose one is estimating the financial popularity of our Bank's Certificates of Deposit. We perform any number of individual regressions correlating our Bank's previous Certificates of Deposit with their financial popularity. However, all of these individual regression models are greatly skewed in the case where one of our competitors is advertising an aggressive interest rate two percentage points higher than ours.

Question 2: Why must we ask the learning machine to preserve the natural order of the individuals in the testing time period? Why not simply have the machine provide more accurate regression estimates for each individual in the testing time period?

Answer: Many real world estimates cannot be made for all individuals; but, only for a few individuals.

For instance, suppose one is estimating currency conversion rates. Normally these rates have very small random daily changes. However, every so often, a central bank will pre-announce its intention to buy or sell its own currency. In those special cases the learning machine will want to provide its normal estimates on most currencies; yet, make a special higher than normal estimate for the currency whose central bank has pre-announced.

Many real world situations do not allow us to accurately estimate the score of the best individuals; but, only to guess which might be the better individuals than others.

For instance, if ten monkeys and one five year old human child are taking a simple IQ test (this being all the information we have). We cannot, with the meager information provided, accurately estimate the IQ scores of the contestants after they take the IQ test. However, we can reasonably make the guess that the human child will have the better IQ score (whatever that score may be).

References

  1. A Discrete Binary Version of the Particle Swarm Algorithm; Proceedings of the Conference on Systems, Man and Cybernetics; J Kennedy, R Eberhart; IEEE Press; 1997.
  2. An Introduction to Genetic Algorithms; Melanie Mitchell; The MIT Press, Cambridge Massachusetts; 1996.
  3. Biologically Inspired Algorithms for Financial Modelling; Anthony Brabazon, Michael O'Neill; Springer-Verlag, Berlin; 2005.
  4. Datamining using Grammar based Genetic Programming and Applications; Man Leung Wong, Kwong Sak Leung; Kluwer Academic Publishers, Dordrecht Netherlands; 2000.
  5. Genetic Algorithms and Genetic Programming in Computational Finance; edited by Shu-Heng Chen; Kluwer Academic Publishers, Dordrecht Netherlands; 2002.
  6. Genetic Programming: On the Programming of Computers by Means of Natural Selection; John R. Koza; The MIT Press, Cambridge Massachusetts; 1992.
  7. Genetic Programming II: Automatic Discovery of Reusable Programs; John R Koza; The MIT Press, Cambridge Massachusetts; 1994.
  8. Genetic Programming III: Darwinian Invention and Problem Solving; John R Koza, Forrest H Bennett III, David Andre, Martin A Keane; Morgan Kaufmann Publishers, San Francisco, California; 1999.
  9. Genetic Programming IV: Routine Human-Competitive Machine Intelligence; John R Koza, Martin A Keane, Mathew J Streeter, William Mydlowec, Jessen Yu, Guido Lanza; Kluwer Academic Publishers, Dordrecht Netherlands; 2003.
  10. Grammatical Evolution; Michael O'Neill, Conor Ryan; Kluwer Academic Publishers, Dordrecht Netherlands; 2003.
  11. Particle Swarm Optimization; Proceedings of the IEEE International Conference on Neural Networks; J Kennedy, R Eberhart; IEEE Press; December 1995.