Chapter 12
Multiple Comparisons Among Treatment Means
When you use an ANOVA and find a significant F, all that says is that the various means are not all equal
It does not say which means are different
The purpose of this chapter is describe a number of different ways of testing which means are different
Before describing the tests, it is necessary to consider two different ways of thinking about error and how they are relevant to doing multiple comparisons
Error Rate per Comparison (PC)
This is simply the Type I error that we have talked about all along. So far, we have been simply setting its value at .05, a 5% chance of making an error
Familywise Error Rate (FW)
Often, after an ANOVA, we want to do a number of comparisons, not just one
The collection of comparisons we do is described as the "family"
The familywise error rate is the probability that at least one of these comparisons will include a type I error
Assuming that a ¢ is the per comparison error rate, then:
The per comparison error: a = a ¢
but, the familywise error: a = 1  (1a ¢ )^{c}
Thus, if we do two comparisons, but keep a ¢ at 0.05, the FWerror will really be:
a = 1  (1  0.05)^{2}
=1  (0.95)^{2} = 1  0.9025 = 0.0975
Thus, there is almost a 10% chance of one of the comparisons being significant when we do two comparisons, even when the nulls are true.
The basic problem then, is that if we are doing many comparisons, we want to somehow control our familywise error so that we don’t end up concluding that differences are there, when they really are not
The various tests we will talk about differ in terms of how they do this
They will also be categorized as being either "A priori" or "post hoc"
A priori: A priori tests are comparisons that the experimenter clearly intended to test before collecting any data
Post hoc: Post hoc tests are comparisons the experimenter has decided to test after collecting the data, looking at the means, and noting which means "seem" different.
The probability of making a type I error is smaller for A priori tests because, when doing post hoc tests, you are essentially doing all possible comparisons before deciding which to test in a formal statistical manner
Steve: Significant F issue
An example for context
See page 351 for a very complete description of the Morphine Tolerance study .. Seigel (1975)
Highlights:








































































Source  df  SS  MS  F 
Treatment  4  3497.60  874.40  27.33 
Within  35  1120.00  32.00  
Total  39  2455.22 
A Priori Comparisons
As discussed, these tests are only appropriate when the specific means to be compared where chosen before (a.k.a. a priori) data was collected and means were examined
Multiple ttests
One obvious thing to do is simply conduct ttests across the groups of interest
However, when we do so, we use the MS_{error} in the denominator instead of using the individual or pooled variance estimates (and evaluate t using df equal to df error)
This is because the MS_{error} is assumed to also measure random variation, but provides a better measure that group variances as it is based on a larger n
Thus, the general t formula becomes:
Examples
Group MS versus Group SS
Group McM versus Group MM
Linear Contrasts
While ttests allow us to compare one mean with another, we can use linear contrasts to compare some mean (or set of means) with some other mean (or set of means)
Must first understand the notion of a linear combination of means:
Note, if all the a’s were 1, L would be the sum of the means … if all the a’s were equal to 1/k, L would be the mean of the means
To make a linear combination a linear contrast, we simply impose the restriction that
So, we select our values of in a way that defines the contrast we are interested in
For example, say we had three means and we want to compare the first two .. ()
This is simply a ttest
More Generally
So, the above contrast compares the average of the first two means with the third mean
You can basically make any contrast you want as long as
Of course, the trick then is testing if the contrast is significant:
SS for contrasts:
While I won’t work out the proof, the SS for contrasts in a component of SS_{treat} and the value of SS_{contrast} can be quantified as:
where n is the number of subjects within each of the treatment groups
Contrasts always are assumed to have 1 df
Examples:
Assume three means of 1.5, 2.0 and 3.0, each based on 10 subjects
When we run the overall ANOVA, we find a SS_{treat} of 11.67
Contrast 1:
Contrast 2:
Note that SS_{treat} = SS_{contrast1} + SS_{contrast2}
Choosing Coefficients:
Sometimes choosing the values for the coefficients is fairly straightforward, as in the previous two cases
But what about when it gets more complicated … say you have seven means, and you want to compare the average of the first 2 with the average of the last 5
The trick .. think of those sets of means forming 2 groups, Group A (means 1 & 2) and Group B (the rest). Now, write out each mean, and before all of the Group A means, put the number of Group B means, then before all the Group B means, put the number of Group A means. Then, give one of the groups a plus sign, the other a minus:
If you wanted to compare the first three means with the last 4, it would be:
Know about the unequal n stuff, but don’t worry about it (you will only be asked to do equal n)
Testing For Significance
Once you have the SS_{contrast}, you treat it like SS_{treat} when it comes to testing for significance. That is, you calculate an F as:
That F has 1 and df_{error} degrees of freedom
For our morphine study then, we might do the following contrasts:































































Critical F(1,35) = 4.12 (about 5.5 with alpha .01)
See the text (p. 359) for a detailed description of how the SSs for these contrasts where calculated
Note: With 4 contrasts, FW_{error} @ 0.20 … could reduce this by using a lower PC level of alpha, or by doing less comparisons
Orthogonal Comparisons
Sometimes contrasts are independent, sometimes not
For example, if you find that mean1 is greater than the average of means 2 & 3, that tells us nothing about whether mean4 is bigger than mean 5 … those contrasts are independent
However, if you find that mean1 is greater than the average of means 2 & 3, then chances are that mean1 is greater than mean2 … those two contrasts would not be independent
When members of a set of contrasts are independent of one another, they are termed "orthogonal contrasts"
The total SSs for a complete set of orthogonal contrasts always equals SS_{treat}
This is a nice property as it is like the SS_{treat} is being composed into a set of independent chunks, each of relevance to the experimenter
Creating a set of orthogonal contrasts
Bonferroni t (Dunn’s test)
As mentioned several times, the problem with doing multiple comparisons is that the family wise error of the experiment increases with each comparison you do
One way to control this, is to try hard to limit the number of comparisons (perhaps using contrasts instead of a bunch of ttests)
Another way is to reduce your per comparison level of alpha to compensate for the inflation caused by doing multiple tests
If you want to continue using the tables we have, then you can only reduce alpha in crude steps (e.g., from .05 to .01)
In many cases, that may be overkill (e.g., three comparisons)
Dunn’s test allows you to do this same thing in a more precise manner
Doing a Dunn’s test
Step 1:
The first thing to do is to "compute" a value of t¢ for each comparison you wish to perform
If that comparison is a planned ttest, then t¢ simple equals your t_{obtained} and has the same degrees of freedom as that t
If that comparison is a linear contrast, the t¢ equals the square root of the F associated with that contrast, and has a degrees of freedom equal to that of MS_{error}
Step 2:
Go to the t¢ table at the back of the book (pp. 687) and find the critical t associated with the number of comparisons you are performing overall, and the relevant degrees of freedom
Now compare each t¢ obtained with that critical t value, which is really the critical t associated with a per comparison alpha of the desired level of familywise error divided by the number of comparisons
* Don’t worry about multistage bonferronis
Example:
PostHoc Comparisons
Whenever possible, it is best to specify a prior the comparisons you wish to do, then do linear contrasts in combination with the Bonferroni ttest
However, there are situations in which the experiment really is not sure what outcome(s) to expect
In these situations, the correct thing to do is one of a number of posthoc comparison procedures depending on the experimental context, and how liberal versus conservative the experimenter wishes to be
We will talk about the following procedures:
You should be trying to understand not only how to do each test, but also why you might choose one procedure over another in a given experimental situation
Fisher’s Least Significant Difference (LSD)
a.k.a. Fisher’s protected t
In fact, this procedure is not different from the a priori ttest described earlier EXCEPT that it requires that the F test (from the ANOVA) be significant prior to computing the t values
The requirement of a significant overall F insures that the familywise error for the complete null hypothesis (i.e., that all the means are equal) will remain at alpha
However, it does nothing to control for inflation of the familywise error when performing the actual comparisons
This is OK if you have only three means (see text for a description of why)
But if you have more than three, then the LSD test is very liberal (i.e., high probability of Type I errors), too liberal for most situations
The Studentized Range Statistic
Many of the tests we will discuss are based on the studentized range statistic (q)
Thus, it is important to understand it first
The mathematical definition of it is:
where and represent the largest and smallest of the treatment means, and r is the number of treatments in the set
Note that this statistic is very similar to the t statistic .. in fact
q tables are set up according to the number of treatment means, when there are only two means, the q and t tables are identical
Using the Studentized Range (an example with logic)
When you do this test, you first take your means (we will use the morphine means) and arrange them in ascending order:
4 10 11 24 29
then, if you want to compare the difference between some means (say the largest and smallest), you compute a q_{r} and compare it to the value given in the q table (usual logic, if q_{obt} > q_{crit}, reject H_{0})
So, comparing the largest and smallest:
From the tables, q5(35df) = 4.07. Since q_{obt}>q_{crit}, we reject H_{0} and conclude the means are significantly different
Note how large the q_{critical} is … that is because it controls for the number of means there is (as Steve will hopefully explain)
Q tables:
NewmanKeuls Test
Basic goal is to sort the means into subsets of means such that means within a given subset are not different from each other, but are different from means in another subset
How to:
Example
Step 1:
MS 
MM 
SS 
SM 
McM 
X_{1} 
X_{2} 
X_{3} 
X_{4} 
X_{5} 
4 
10 
11 
24 
29 
Step 2:
W3 To be filled in
W4 To be filled in
Steps 3 & 4:


















MS 








MM 








SS 








SM 

















Step 5:
MS MM SS SM McM
4 10 11 24 29
  
In words then, these results suggest the following.
First, the rats who received morphine on all occasions are acting the same as those who received saline on all occasions .. suggesting that a tolerance has developed very quickly.
Those rats who received morphine 3 times, but then only saline on the test trial are significantly more sensitive to pain than those who received saline all the time, or morphine all the time. This suggests that a compensatory mechanism was operating, making the rats hypersensitive to pain when not opposed by morphine.
Finally, those rats who received morphine in their cage three times before receiving it in the testing context seem as nonsensitive to pain as those who received morphine for the first time at test, both groups being significantly less sensitive to pain that the SS or MM groups. This suggests the compensatory mechanism is very context specific and does not operate when the context is changed.
Unequal Sample Sizes
Once again, don’t worry about the details of dealing with unequal n … just know that if you ever in the position of having unequal n there are ways of dealing with it
FamilyWise Error (The problem with NK)
This part is confusing, but I’ll try my best …
The problem with the NewmanKeuls is that is doesn’t control FW error very well (i.e., it tends to be fairly liberal .. too liberal for the taste of many)
Situation 1:
Situation 2:
Prob of Type I error is related to the number of possible null hypotheses … FW = 1  (1a )^{nulls}
So, as the number of means increases, FW error can increase considerably and is typically around 0.10 for most experiments (four or five means)
Tukey’s Test
A.K.A.  The honestlysignificant difference (HSD) test
Tukey’s test is simply a more conservative version of the NewmanKeuls (keeps FW error at .05)
The real difference is that instead of comparing the difference between a pair of means to a q value tied to the range of those means … the q of the largest range is always used (q_{HSD} = q_{max})
So in the morphine rats example, we would compare each difference to the q5 value of 8.14 … producing the following results


















MS 








MM 








SS 








SM 

















MS MM SS SM McM
4 10 11 24 29
 
The Ryan Procedure
Read through this section and note the similarities to the Dunn’s test (the Bonferroni t)
However, given that using the procedure requires either specialized tables or a statistical software package .. you will never be required to actually do it
Thus, get the general idea, but don’t worry about details
The Sheffe test
The Sheffe test extends the posthoc analysis possibilities to include linear contrasts as well as comparisons between specific means
As before, a linear contrast is always described by the equation:
With the SS_{contrast} being equal to:
And the F_{obtained} then being:
*recall that there is always 1df associated with a contrast so MS_{contrast} = SS_{contrast}
This is all as it was for the apriori version of contrasts. The difference is that instead of comparing the F_{obtained} to an F(1,df_{error}), the F_{critical} is:
F_{critical} = (k1) F(k1,df_{error})
Doing this will hold FW error constant for all linear contrasts
However, there is a cost … the Sheffe is the least powerful of all the posthoc procedures (i.e., very conservative)
Moral: Don’t use when you only want to compare means or when you can justify comparisons apriori
Dunnett’s Test
This test was designed specifically for comparing a number of treatment groups with a control group
Note, this situation is somewhat different from the previous posthoc tests in that it is somewhat apriori (i.e., the "position" of the control condition can vary .. Steve will explain .. hopefully)
This allows the Dunnett’s test to be more powerful … FW error can be controlled in less stringent ways
All that is really involved is using a different t table when looking up the critical t … t_{d}
However, when using this test, the most typical thing to do is to calculate a critical difference .. that is, when the difference between any two means exceeds this value .. those means are signficantly different
Doing a Dunnett’s
MS MM SS SM McM
4 10 11 24 29
Critical Difference =
We get the td value from the table with k=5 and df_{error}=35 … producing a value of 2.56
Critical Difference = = 7.24
So, assuming the SS group is the control group … any mean that is more than 7.24 units from it is considered significantly different
That is … the SM and McM groups
Which test when and why?
When I presented each test, I went through the situations in which they are typically used so I hope you have a decent idea about that
Nonetheless, read the "comparison of the alternative procedures" and "which test?" sections of the text to make sure you have a good feel for this
Make sure you understand the distinction between aprior versus posthoc tests … and the distinction between the tests within each category