Chapter 12

Multiple Comparisons Among Treatment Means

When you use an ANOVA and find a significant F, all that says is that the various means are not all equal

It does not say which means are different

The purpose of this chapter is describe a number of different ways of testing which means are different

Before describing the tests, it is necessary to consider two different ways of thinking about error and how they are relevant to doing multiple comparisons

Error Rate per Comparison (PC)

This is simply the Type I error that we have talked about all along. So far, we have been simply setting its value at .05, a 5% chance of making an error

Familywise Error Rate (FW)

Often, after an ANOVA, we want to do a number of comparisons, not just one

The collection of comparisons we do is described as the "family"

The familywise error rate is the probability that at least one of these comparisons will include a type I error

Assuming that a ¢ is the per comparison error rate, then:

The per comparison error: a = a ¢

but, the familywise error: a = 1 - (1-a ¢ )c

Thus, if we do two comparisons, but keep a ¢ at 0.05, the FWerror will really be:

a = 1 - (1 - 0.05)2

=1 - (0.95)2 = 1 - 0.9025 = 0.0975

Thus, there is almost a 10% chance of one of the comparisons being significant when we do two comparisons, even when the nulls are true.

The basic problem then, is that if we are doing many comparisons, we want to somehow control our familywise error so that we don’t end up concluding that differences are there, when they really are not

The various tests we will talk about differ in terms of how they do this

They will also be categorized as being either "A priori" or "post hoc"

A priori: A priori tests are comparisons that the experimenter clearly intended to test before collecting any data

Post hoc: Post hoc tests are comparisons the experimenter has decided to test after collecting the data, looking at the means, and noting which means "seem" different.

The probability of making a type I error is smaller for A priori tests because, when doing post hoc tests, you are essentially doing all possible comparisons before deciding which to test in a formal statistical manner

Steve: Significant F issue

An example for context

See page 351 for a very complete description of the Morphine Tolerance study .. Seigel (1975)

Highlights:

• paw lick latency as a measure of pain resistance
• tolerance to morphine develops quickly
• notion of a compensatory mechanism
• this mechanism very context dependent

 M-S M-M S-S S-M Mc - M 3 2 14 29 24 5 12 6 20 26 1 13 12 36 40 8 6 4 21 32 1 10 19 25 20 1 7 3 18 33 4 11 9 26 27 9 19 21 17 30 Total 32 80 88 192 232 Mean 4 10 11 24 29 Var 9.99 26.32 45.16 40.58 37.95

 Source df SS MS F Treatment 4 3497.60 874.40 27.33 Within 35 1120.00 32.00 Total 39 2455.22

A Priori Comparisons

As discussed, these tests are only appropriate when the specific means to be compared where chosen before (a.k.a. a priori) data was collected and means were examined

Multiple t-tests

One obvious thing to do is simply conduct t-tests across the groups of interest

However, when we do so, we use the MSerror in the denominator instead of using the individual or pooled variance estimates (and evaluate t using df equal to df error)

This is because the MSerror is assumed to also measure random variation, but provides a better measure that group variances as it is based on a larger n

Thus, the general t formula becomes: Examples

Group M-S versus Group S-S

Group Mc-M versus Group M-M

Linear Contrasts

While t-tests allow us to compare one mean with another, we can use linear contrasts to compare some mean (or set of means) with some other mean (or set of means)

Must first understand the notion of a linear combination of means: Note, if all the a’s were 1, L would be the sum of the means … if all the a’s were equal to 1/k, L would be the mean of the means

To make a linear combination a linear contrast, we simply impose the restriction that So, we select our values of in a way that defines the contrast we are interested in

For example, say we had three means and we want to compare the first two .. ( ) This is simply a t-test

More Generally So, the above contrast compares the average of the first two means with the third mean

You can basically make any contrast you want as long as Of course, the trick then is testing if the contrast is significant:

SS for contrasts:

While I won’t work out the proof, the SS for contrasts in a component of SStreat and the value of SScontrast can be quantified as: where n is the number of subjects within each of the treatment groups

Contrasts always are assumed to have 1 df

Examples:

Assume three means of 1.5, 2.0 and 3.0, each based on 10 subjects

When we run the overall ANOVA, we find a SStreat of 11.67

Contrast 1:  Contrast 2:  Note that SStreat = SScontrast1 + SScontrast2

Choosing Coefficients:

Sometimes choosing the values for the coefficients is fairly straightforward, as in the previous two cases

But what about when it gets more complicated … say you have seven means, and you want to compare the average of the first 2 with the average of the last 5

The trick .. think of those sets of means forming 2 groups, Group A (means 1 & 2) and Group B (the rest). Now, write out each mean, and before all of the Group A means, put the number of Group B means, then before all the Group B means, put the number of Group A means. Then, give one of the groups a plus sign, the other a minus: If you wanted to compare the first three means with the last 4, it would be: Know about the unequal n stuff, but don’t worry about it (you will only be asked to do equal n)

Testing For Significance

Once you have the SScontrast, you treat it like SStreat when it comes to testing for significance. That is, you calculate an F as: That F has 1 and dferror degrees of freedom

For our morphine study then, we might do the following contrasts:

 Group M-S M-M S-S S-M Mc-M Mean 4.00 10.00 11.00 24.00 29.00 SS F Con 1 -3 2 -3 2 2 1750 55** Con 2 0 -1 0 0 1 1444 45** Con 3 -1 0 1 0 0 196 6* Con 4 0 1 -1 0 0 4 0.125

Critical F(1,35) = 4.12 (about 5.5 with alpha .01)

See the text (p. 359) for a detailed description of how the SSs for these contrasts where calculated

Note: With 4 contrasts, FWerror @ 0.20 … could reduce this by using a lower PC level of alpha, or by doing less comparisons

Orthogonal Comparisons

Sometimes contrasts are independent, sometimes not

For example, if you find that mean1 is greater than the average of means 2 & 3, that tells us nothing about whether mean4 is bigger than mean 5 … those contrasts are independent

However, if you find that mean1 is greater than the average of means 2 & 3, then chances are that mean1 is greater than mean2 … those two contrasts would not be independent

When members of a set of contrasts are independent of one another, they are termed "orthogonal contrasts"

The total SSs for a complete set of orthogonal contrasts always equals SStreat

This is a nice property as it is like the SStreat is being composed into a set of independent chunks, each of relevance to the experimenter

Creating a set of orthogonal contrasts

Bonferroni t (Dunn’s test)

As mentioned several times, the problem with doing multiple comparisons is that the family wise error of the experiment increases with each comparison you do

One way to control this, is to try hard to limit the number of comparisons (perhaps using contrasts instead of a bunch of t-tests)

Another way is to reduce your per comparison level of alpha to compensate for the inflation caused by doing multiple tests

If you want to continue using the tables we have, then you can only reduce alpha in crude steps (e.g., from .05 to .01)

In many cases, that may be overkill (e.g., three comparisons)

Dunn’s test allows you to do this same thing in a more precise manner

• basically, the Dunn’s test allows you to evaluate each comparison at an a ¢ = a /c
•

Doing a Dunn’s test

Step 1:

The first thing to do is to "compute" a value of t¢ for each comparison you wish to perform

If that comparison is a planned t-test, then t¢ simple equals your tobtained and has the same degrees of freedom as that t

If that comparison is a linear contrast, the t¢ equals the square root of the F associated with that contrast, and has a degrees of freedom equal to that of MSerror

Step 2:

Go to the t¢ table at the back of the book (pp. 687) and find the critical t associated with the number of comparisons you are performing overall, and the relevant degrees of freedom

Now compare each t¢ obtained with that critical t value, which is really the critical t associated with a per comparison alpha of the desired level of family-wise error divided by the number of comparisons

* Don’t worry about multi-stage bonferronis

Example:

Post-Hoc Comparisons

Whenever possible, it is best to specify a prior the comparisons you wish to do, then do linear contrasts in combination with the Bonferroni t-test

However, there are situations in which the experiment really is not sure what outcome(s) to expect

In these situations, the correct thing to do is one of a number of post-hoc comparison procedures depending on the experimental context, and how liberal versus conservative the experimenter wishes to be

We will talk about the following procedures:

• Fisher’s Least Significant Difference Procedure
• The Newman-Keuls Test
• Tukey’s Test
• The Ryan Procedure
• The Sheffe Test
• Dunnett’s Test

You should be trying to understand not only how to do each test, but also why you might choose one procedure over another in a given experimental situation

Fisher’s Least Significant Difference (LSD)

a.k.a. Fisher’s protected

In fact, this procedure is not different from the a priori t-test described earlier EXCEPT that it requires that the F test (from the ANOVA) be significant prior to computing the t values

The requirement of a significant overall F insures that the family-wise error for the complete null hypothesis (i.e., that all the means are equal) will remain at alpha

However, it does nothing to control for inflation of the family-wise error when performing the actual comparisons

This is OK if you have only three means (see text for a description of why)

But if you have more than three, then the LSD test is very liberal (i.e., high probability of Type I errors), too liberal for most situations

The Studentized Range Statistic

Many of the tests we will discuss are based on the studentized range statistic (q)

Thus, it is important to understand it first

The mathematical definition of it is: where and represent the largest and smallest of the treatment means, and r is the number of treatments in the set

Note that this statistic is very similar to the t statistic .. in fact q tables are set up according to the number of treatment means, when there are only two means, the q and t tables are identical

Using the Studentized Range (an example with logic)

When you do this test, you first take your means (we will use the morphine means) and arrange them in ascending order:

`4    10   11    24    29`

then, if you want to compare the difference between some means (say the largest and smallest), you compute a qr and compare it to the value given in the q table (usual logic, if qobt > qcrit, reject H0

So, comparing the largest and smallest: From the tables, q5(35df) = 4.07. Since qobt>qcrit, we reject H0 and conclude the means are significantly different

Note how large the qcritical is … that is because it controls for the number of means there is (as Steve will hopefully explain)

Q tables:

Newman-Keuls Test

Basic goal is to sort the means into subsets of means such that means within a given subset are not different from each other, but are different from means in another subset

How to:

• Sort means in ascending order such that mean1 is the lowest mean … up to meani where i is the total number of means
• Calculate the Wr associated with each width where the width between means i and j equals i-j+1
• Construct a matrix with treatment means on the rows and columns, and the differences between means in the cells of the matrix
• Using rules (which I will specify momentarily) we move from right to left across the entries and compare each difference with its qr
• Based on the pattern of significance observed, we group the means
•

Example

Step 1:

 M-S M-M S-S S-M Mc-M X1 X2 X3 X4 X5 4 10 11 24 29

Step 2: W3 To be filled in

W4 To be filled in Steps 3 & 4:

 M-S M-M S-S S-M Mc-M 4 10 11 24 29 r Wr M-S 4 6 7 20 25 5 8.14 M-M 10 1 14 19 4 7.63 S-S 11 13 18 3 6.93 S-M 24 5 2 5.75 Mc-M 29

Step 5:

`M-S     M-M    S-S    S-M    Mc-M`
`4       10     11     24     29`
`---     ----------    -----------`
` `

In words then, these results suggest the following.

First, the rats who received morphine on all occasions are acting the same as those who received saline on all occasions .. suggesting that a tolerance has developed very quickly.

Those rats who received morphine 3 times, but then only saline on the test trial are significantly more sensitive to pain than those who received saline all the time, or morphine all the time. This suggests that a compensatory mechanism was operating, making the rats hypersensitive to pain when not opposed by morphine.

Finally, those rats who received morphine in their cage three times before receiving it in the testing context seem as non-sensitive to pain as those who received morphine for the first time at test, both groups being significantly less sensitive to pain that the S-S or M-M groups. This suggests the compensatory mechanism is very context specific and does not operate when the context is changed.

Unequal Sample Sizes

Once again, don’t worry about the details of dealing with unequal n … just know that if you ever in the position of having unequal n there are ways of dealing with it

Family-Wise Error (The problem with NK)

This part is confusing, but I’ll try my best …

The problem with the Newman-Keuls is that is doesn’t control FW error very well (i.e., it tends to be fairly liberal .. too liberal for the taste of many)

Situation 1: Situation 2: Prob of Type I error is related to the number of possible null hypotheses … FW = 1 - (1-a )nulls

So, as the number of means increases, FW error can increase considerably and is typically around 0.10 for most experiments (four or five means)

Tukey’s Test

A.K.A. - The honestly-significant difference (HSD) test

Tukey’s test is simply a more conservative version of the Newman-Keuls (keeps FW error at .05)

The real difference is that instead of comparing the difference between a pair of means to a q value tied to the range of those means … the q of the largest range is always used (qHSD = qmax)

So in the morphine rats example, we would compare each difference to the q5 value of 8.14 … producing the following results

 M-S M-M S-S S-M Mc-M 4 10 11 24 29 r Wr M-S 4 6 7 20 25 5 8.14 M-M 10 1 14 19 8.14 S-S 11 13 18 8.14 S-M 24 5 8.14 Mc-M 29

`M-S     M-M     S-S     S-M     Mc-M`
`4       10      11      24      29`
`-------------------     ------------`

The Ryan Procedure

Read through this section and note the similarities to the Dunn’s test (the Bonferroni t)

However, given that using the procedure requires either specialized tables or a statistical software package .. you will never be required to actually do it

Thus, get the general idea, but don’t worry about details

The Sheffe test

The Sheffe test extends the post-hoc analysis possibilities to include linear contrasts as well as comparisons between specific means

As before, a linear contrast is always described by the equation: With the SScontrast being equal to: And the Fobtained then being: *recall that there is always 1df associated with a contrast so MScontrast = SScontrast

This is all as it was for the a-priori version of contrasts. The difference is that instead of comparing the Fobtained to an F(1,dferror), the Fcritical is:

Fcritical = (k-1) F(k-1,dferror)

Doing this will hold FW error constant for all linear contrasts

However, there is a cost … the Sheffe is the least powerful of all the post-hoc procedures (i.e., very conservative)

Moral: Don’t use when you only want to compare means or when you can justify comparisons a-priori

Dunnett’s Test

This test was designed specifically for comparing a number of treatment groups with a control group

Note, this situation is somewhat different from the previous post-hoc tests in that it is somewhat a-priori (i.e., the "position" of the control condition can vary .. Steve will explain .. hopefully)

This allows the Dunnett’s test to be more powerful … FW error can be controlled in less stringent ways

All that is really involved is using a different t table when looking up the critical t … td

However, when using this test, the most typical thing to do is to calculate a critical difference .. that is, when the difference between any two means exceeds this value .. those means are signficantly different

Doing a Dunnett’s

`M-S     M-M     S-S     S-M    Mc-M`
`4       10      11      24     29`

Critical Difference = We get the td value from the table with k=5 and dferror=35 … producing a value of 2.56

Critical Difference = = 7.24

So, assuming the S-S group is the control group … any mean that is more than 7.24 units from it is considered significantly different

That is … the S-M and Mc-M groups

Which test when and why?

When I presented each test, I went through the situations in which they are typically used so I hope you have a decent idea about that

Nonetheless, read the "comparison of the alternative procedures" and "which test?" sections of the text to make sure you have a good feel for this

Make sure you understand the distinction between a-prior versus post-hoc tests … and the distinction between the tests within each category