Chapter 9

Correlation and Regression


Terminology: Correlation vs. Regression

This chapter will speak of both correlations and regressions

both use similar mathematical procedures to provide a measure of relation; the degree to which two continuous variables vary together ... or covary

The correlations term is used when 1) both variables are random variables, and 2) the end goal is simply to find a number that expresses the relation between the variables

The regression term is used when 1) one of the variables is a fixed variable, and 2) the end goal is use the measure of relation to predict values of the random variable based on values of the fixed variable

 

Examples

In this class, height and ratings of physical attractiveness (both random variables) vary across individuals. We could ask, "What is the correlation between height and these ratings in our class"? Essentially, we are asking "As height increases, is there any systematic increase (positive correlation) or decrease (negative correlation) in one’s rating of their own attractiveness?"

Alternatively, we could do an experiment in which the experimenter compliments a subject on their appearance one to eight times prior to obtaining a rating (note that ‘number of compliments’ is a fixed variable). We could now ask "can we predict a person’s rating of their attractiveness, based on the number of compliments they were given"?

Scatterplots

The first way to get some idea about a possible relation between two variables is to do a scatterplot of the variables

Let’s consider the first example discussed previously where we were interested in the possible relation between height and ratings of physical attractiveness

The following is a sample of the data from our class as it pertains to this issue:

Subject
Height
Phy
1
69
7
2
61
8
3
68
6
4
66
5
5
66
8
.
..
.
48
71
10

 

 

 

correlation = 0.146235 or +0.15

 

Correlation = -1.0

 

Correlation = 0.0

 

Correlation = +1.0

 

 

Calculating the Covariance:

The first step in calculating a correlation co-efficient is to quantify the covariance between two variables

For the sake of an example, consider the height and weight variables from our class dataset ...

 

 

We’ll just focus on the first 12 subjects data for now

 

Subject
Height (X)
Weight (Y)
1
69
108
2
61
130
3
68
135
4
66
135
5
66
120
6
63
115
7
72
150
8
62
105
9
62
115
10
67
145
11
66
132
12
63
120
Mean
65.42
125.83
Sum(X) = 785
Sum(Y) = 1510
Sum (X2) = 51473
Sum(Y2) = 192238

Sum (XY) = 99064 

The covariance of these variables is computed as:

 

  

But what does it mean?

The covariance formula should look familiar to you. If all the Ys were exchanged for Xs, the covariance formula would be the variance formula

Note what this formula is doing, however, it is capturing the degree to which pairs of points systematically vary around their respective means

If paired X and Y values tend to both be above or below their means at the same time, this will lead to a high positive covariance 

However, if the paired X and Y values tend to be on opposite sides of their respective means, this will lead to a high negative covariance

If there is no systematic tendencies of the sort mentioned above, the covariance will tend towards zero

 

The Computational Formula for Cov

Given its similarity to the variance formula, it shouldn’t surprise you that there is also a computationally more workable version of the covariance formula:

 

  

For our height versus weight example then:

 

 

The covariance itself gives us little info about the relation we are interested in, because it is sensitive to the standard deviation of X and Y. It must be transformed (standardized) before it is useful. Hence ...

 

The Pearson Product-Moment Correlation Coefficient (r)

The Pearson Product-Moment Correlation Coefficient, r, is computed simple by standardizing the covariance estimate as follows:

 

 

This results in r values ranging from -1.0 to +1.0 as discussed earlier

There is another way to represent this formula. It is:

 

 

where SPXY is the sum of the products of X and Y, SSX is the sum of squares for X and SSY is the sum of squares for Y (see next overhead for a description of these beasties)

 

Sums of Squares and Sums of Products

The notion of sums of squares (and products) will come up later in this chapter and will be an important part of your C08 course 

Given this, it is worth spending some time on them now 

Computionally:

 

  

 

 

 

So, these critters are just the numerators of the variance and covariance formuli, respectively

 

Back to the Height versus Weight example:

 

Height              Weight
Sum(X) = 785 Sum(Y) = 1510
Sum (X2) = 51473 Sum(Y2) = 192238

 

 

 

 

   

 

 

So, based on the 12 subjects we examined, the correlation between height and weight was +0.55

Adjusted r

Unfortunately, the r we measure using our sample is not an unbiased estimator of the population correlation coefficient (rho)

We can correct for this using the adjusted correlation coefficient which is computed as follows:

 

 

So, for our example:

 

 

 

The Regression Line

Often scatterplots will include a ‘regression line’ that overlays the points in the graph:

 

 

The regression line represents the best prediction of the variable on the Y axis (Weight) for each point along the X axis (Height)

For example, my (Steve’s) data is not depicted in the graph. But if I tell you that I am 75 inches tall, you can use the graph to predict my weight

Computing the Regression Line

Going back to your high school days, you perhaps recall that any straight line can be depicted by an equation of the form:

 

 

where = the predicted value of Y

a = the slope of the line (the change in Y as a function of X)

X = the various values of X

b = the intercept of the line (the point where the line hits the Y axis)

 

since the regression line is supposed to be the line that provides the best prediction of Y, given some value of X, we need to find values of a & b that produce a line that will be the best-fitting linear function (i.e., the predicted values of Y will come as close as possible to the obtained values of Y)

 

Minimizing Prediction Errors

The first thing we need to do when finding this function is to define what we mean by best

Typically, the approach we take is to assume that the best regression line is the one the minimizes errors in prediction which are mathematically defined as the difference between the obtain and predicted values of Y: 

this difference is typically termed the residual 

For reasons similar to those involved in computations of variance, we cannot simply minimize because that sum will equal zero for any line passing through the point

Instead, we must minimize

 

Mathematical Razimitazz

At this point, the text book goes through a bunch of mathematical stuff showing you how to solve for a and b by substituting the equation for a line in for in the equation , and then minimizing the result

You don’t have to know any of that, just the result, which is:

 

 

 

for our height versus weight example, b = 2.36 and a = -28.56 (you should confirm these for yourself -- as a check of your understanding)

Thus, the regression line for our data is:

 

  

Residual (or Error) Variance

Once we have obtained a regression line, the next issue concerns how well the regression line actually fits the data

Analogous to how we calculated the variance around a mean, we can calculate the variance around a regression line, termed the residual variance or error variance and denoted as , in the following manner:

 

 

This equation uses N-2 in the denominator because two degrees of freedom were lost computing calculating a and b

The square root of this term is called the standard error of the estimate and is denoted as

 

Calculating

1) The hard (but logical) way

 

  

Subject
X
Y
1
69
108
134.28
-16.28
2
61
130
115.40
14.60
3
68
135
131.92
3.08
4
66
135
127.20
7.80
5
66
120
127.20
-7.20
6
63
115
120.12
-5.12
7
72
150
141.36
9.63
8
62
105
117.76
-12.76
9
62
115
117.76
-2.76
10
67
145
129.56
15.46
11
66
132
127.20
4.80
12
63
120
120.12
-0.12

 

 

 

 

2) The easy (but don’t ask me why it works) way

In another feat of mathematical wizardry, the textbook shows how you can go from the formula above, to the following (easier to work with) formula

 

 

If we use the non-corrected value of r, we should get the same answer as when we used the hard way:

 

 

The difference is due partially to rounding errors, but mostly to the fact that this "easy" formula is actually an approximation that assumes large N. When N is small, the obtained value over-estimates the actual value by a factor of (N-1)/(N-2).

Hypothesis Testing

The text discusses a number of hypothesis testing situations relevant to r and b, and gives the test to be performed in each situation

I only expect you to know how to test 1) whether a computed correlation coefficient is significantly different from zero, and 2) the power of that test 

That is, the most common reason one examines correlations is to see if two variables are related

If the computed correlation coefficient is significantly different from zero, that suggests that there is a relation ... the sign of the correlation describes exactly what that relation is

 

Testing whether r is different from zero

To test whether some computed r is significantly different from zero, you first compute the following t-value:

 

 

That value is then compared to a critical t with N-2 degrees of freedom

If the obtained t is more extreme than the critical t, then you can reject the null hypothesis (that the variables are not related).

For out height versus weight example:

 

 

tcrit(10) = 2.23, therefore we cannot reject H0

 

Power Calculations for r=0 tests

Power calculations are rather straightforward in this situation

d equals the correlation value you expect to obtain in your experiment (recall that you always base power calculations on an assumption that some effect of a certain size does exist)

Now:

 

 

So, for example, let’s say that we wanted to know the power of our test to find a height versus weight relation, and we believe the true relation has an r of .55

The power of our previous experiment to find that effect is:

 

 

From the power table, power is 0.44

Our power calculations suggest that our previous examination of the height versus weight relation may have failed to reach significance simply because the study had low power

Thus, we could ask, how big would the study have to be to find a significant relation with a probability of .85? (if the relation truly exists as we believe)

With math wizardry:

 

Testing whether two rs are different

Testing for a difference between two independent rs (i.e. does r1-r2=0?) turns out to be a trickier issue, because the sampling distribution of the difference in two r values is not normally distributed

Fisher has shown that this problem can be compensated for by first tranforming each of the r values using the following formula:

 

 

this leads to an r value that is normally ditributed and whose standard error is reflected by the following formula:

 

 

Given all this, one can test for the difference between two independent rs using the following z-test:

 

 

So, the steps we have to go through to test the difference of two independent rs are:

 
1) compute both r values
2) transform both r values
3) get a z value based on the formula above
4) find the probability associated with that z value
5) compare the obtained probability to alpha divided by two (or alpha if doing a one-tailed test)

 

Front to back example:

 

Midterm
Quiz
54
6
66
8
44.5
4
78.5
9.5
76.5
8.5
47
5
70.5
8
52.5
7
64.5
8.5
65
6
51.5
4
66.5
3
53.5
9.5
Sum
790.5
87
SumSq
49524.25
640
Sum XY =
5444.5