More precisely, multiple regression analysis helps us to predict the value of Y for given values of X1, X2, …, Xk.
For example the yield of rice per acre depends upon quality of seed, fertility of soil, fertilizer used, temperature, rainfall. If one is interested to study the joint affect of all these variables on rice yield, one can use this technique.
An additional advantage of this technique is it also enables us to study the individual influence of these variables on yield.
Dependent and Independent Variables
By multiple regression, we mean models with just one dependent and two or more independent (exploratory) variables. The variable whose value is to be predicted is known as the dependent variable and the ones whose known values are used for prediction are known independent (exploratory) variables.
The Multiple Regression Model
In general, the multiple regression equation of Y on X1, X2, …, Xk is given by:
Y = b0 + b1 X1 + b2 X2 + …………………… + bk Xk
Interpreting Regression Coefficients
Here b0 is the intercept and b1, b2, b3, …, bk are analogous to the slope in linear regression equation and are also called regression coefficients. They can be interpreted the same way as slope. Thus if bi = 2.5, it would indicates that Y will increase by 2.5 units if Xi increased by 1 unit.
The appropriateness of the multiple regression model as a whole can be tested by the F-test in the ANOVA table. A significant F indicates a linear relationship between Y and at least one of the X's.
How Good Is the Regression?
Once a multiple regression equation has been constructed, one can check how good it is (in terms of predictive ability) by examining the coefficient of determination (R2). R2 always lies between 0 and 1.
R2 - coefficient of determination
All software provides it whenever regression procedure is run. The closer R2 is to 1, the better is the model and its prediction.
A related question is whether the independent variables individually influence the dependent variable significantly. Statistically, it is equivalent to testing the null hypothesis that the relevant regression coefficient is zero.
This can be done using t-test. If the t-test of a regression coefficient is significant, it indicates that the variable is in question influences Y significantly while controlling for other independent explanatory variables.
Multiple regression technique does not test whether data are linear. On the contrary, it proceeds by assuming that the relationship between the Y and each of Xi's is linear. Hence as a rule, it is prudent to always look at the scatter plots of (Y, Xi), i= 1, 2,…,k. If any plot suggests non linearity, one may use a suitable transformation to attain linearity.
Another important assumption is non existence of multicollinearity- the independent variables are not related among themselves. At a very basic level, this can be tested by computing the correlation coefficient between each pair of independent variables.
Other assumptions include those of homoscedasticity and normality.
Multiple regression analysis is used when one is interested in predicting a continuous dependent variable from a number of independent variables. If dependent variable is dichotomous, then logistic regression should be used.