Correlation and linear regression are the most commonly used techniques for investigating the relationship between two quantitative variables.
The goal of a correlation analysis is to see whether two measurement variables co vary, and to quantify the strength of the relationship between the variables, whereas regression expresses the relationship in the form of an equation.
For example, in students taking a Maths and English test, we could use correlation to determine whether students who are good at Maths tend to be good at English as well, and regression to determine whether the marks in English can be predicted for given marks in Maths.
The starting point is to draw a scatter of points on a graph, with one variable on the X-axis and the other variable on the Y-axis, to get a feel of the relationship (if any) between the variables as suggested by the data. The closer the points are to a straight line, the stronger the linear relationship between two variables.
We can use the correlation coefficient, such as the Pearson Product Moment Correlation Coefficient, to test if there is a linear relationship between the variables. To quantify the strength of the relationship, we can calculate the correlation coefficient (r). Its numerical value ranges from +1.0 to -1.0. r > 0 indicates positive linear relationship, r < 0 indicates negative linear relationship while r = 0 indicates no linear relationship.
It must, however, be considered that there may be a third variable related to both of the variables being investigated, which is responsible for the apparent correlation. Correlation does not imply causation. Also, a nonlinear relationship may exist between two variables that would be inadequately described, or possibly even undetected, by the correlation coefficient.
The analysis consists of choosing and fitting an appropriate model, done by the method of least squares, with a view to exploiting the relationship between the variables to help estimate the expected response for a given value of the independent variable. For example, if we are interested in the effect of age on height, then by fitting a regression line, we can predict the height for a given age.
Some underlying assumptions governing the uses of correlation and regression are as follows.
The observations are assumed to be independent. For correlation, both variables should be random variables, but for regression only the dependent variable Y must be random. In carrying out hypothesis tests, the response variable should follow Normal distribution and the variability of Y should be the same for each value of the predictor variable. A scatter diagram of the data provides an initial check of the assumptions for regression.
There are three main uses for correlation and regression.