Open Mind Tree: Statistics chapter 12

Chapter 12Multiple Comparisons Among Treatment Means

When you use an ANOVA and find a significant F, all that says is that the various means are not all equal
It does not say which means are different
The purpose of this chapter is describe a number of different ways of testing which means are different
Before describing the tests, it is necessary to consider two different ways of thinking about error and how they are relevant to doing multiple comparisons
Error Rate per Comparison (PC)
This is simply the Type I error that we have talked about all along. So far, we have been simply setting its value at .05, a 5% chance of making an error
Familywise Error Rate (FW)
Often, after an ANOVA, we want to do a number of comparisons, not just one
The collection of comparisons we do is described as the "family"
The familywise error rate is the probability that at least one of these comparisons will include a type I error
Assuming that a ¢ is the per comparison error rate, then:
The per comparison error: a = a ¢
but, the familywise error: a = 1 - (1-a ¢ )^cThus, if we do two comparisons, but keep a ¢ at 0.05, the FWerror will really be:

a = 1 - (1 - 0.05)²
=1 - (0.95)² = 1 - 0.9025 = 0.0975
Thus, there is almost a 10% chance of one of the comparisons being significant when we do two comparisons, even when the nulls are true.
The basic problem then, is that if we are doing many comparisons, we want to somehow control our familywise error so that we don�t end up concluding that differences are there, when they really are not
The various tests we will talk about differ in terms of how they do this
They will also be categorized as being either "A priori" or "post hoc"
A priori: A priori tests are comparisons that the experimenter clearly intended to test before collecting any data

Post hoc: Post hoc tests are comparisons the experimenter has decided to test after collecting the data, looking at the means, and noting which means "seem" different.
The probability of making a type I error is smaller for A priori tests because, when doing post hoc tests, you are essentially doing all possible comparisons before deciding which to test in a formal statistical manner
Steve: Significant F issueAn example for context

See page 351 for a very complete description of the Morphine Tolerance study .. Seigel (1975)

Highlights:

paw lick latency as a measure of pain resistance
tolerance to morphine develops quickly
notion of a compensatory mechanism
this mechanism very context dependent

	M-S	M-M	S-S	S-M	Mc - M
	3	2	14	29	24
	5	12	6	20	26
	1	13	12	36	40
	8	6	4	21	32
	1	10	19	25	20
	1	7	3	18	33
	4	11	9	26	27
	9	19	21	17	30
Total	32	80	88	192	232
Mean	4	10	11	24	29
Var	9.99	26.32	45.16	40.58	37.95

Source	df	SS	MS	F
Treatment	4	3497.60	874.40	27.33
Within	35	1120.00	32.00
Total	39	2455.22

A Priori Comparisons
As discussed, these tests are only appropriate when the specific means to be compared where chosen before (a.k.a. a priori) data was collected and means were examined Multiple t-tests
One obvious thing to do is simply conduct t-tests across the groups of interestHowever, when we do so, we use the MS_error in the denominator instead of using the individual or pooled variance estimates (and evaluate t using df equal to df error)This is because the MS_error is assumed to also measure random variation, but provides a better measure that group variances as it is based on a larger nThus, the general t formula becomes:

Examples
Group M-S versus Group S-S
Group Mc-M versus Group M-M
Linear Contrasts
While t-tests allow us to compare one mean with another, we can use linear contrasts to compare some mean (or set of means) with some other mean (or set of means)
Must first understand the notion of a linear combination of means:

Note, if all the a�s were 1, L would be the sum of the means � if all the a�s were equal to 1/k, L would be the mean of the means
To make a linear combination a linear contrast, we simply impose the restriction that

So, we select our values of

in a way that defines the contrast we are interested in
For example, say we had three means and we want to compare the first two .. (

)

This is simply a t-test
More Generally

So, the above contrast compares the average of the first two means with the third mean
You can basically make any contrast you want as long as

Of course, the trick then is testing if the contrast is significant:
SS for contrasts:
While I won�t work out the proof, the SS for contrasts in a component of SS_treat and the value of SS_contrast can be quantified as:

where n is the number of subjects within each of the treatment groups
Contrasts always are assumed to have 1 df
Examples:
Assume three means of 1.5, 2.0 and 3.0, each based on 10 subjects
When we run the overall ANOVA, we find a SS_treat of 11.67
Contrast 1:

Contrast 2:

Note that SS_treat = SS_contrast1 + SS_contrast2
Choosing Coefficients:
Sometimes choosing the values for the coefficients is fairly straightforward, as in the previous two cases
But what about when it gets more complicated � say you have seven means, and you want to compare the average of the first 2 with the average of the last 5
The trick .. think of those sets of means forming 2 groups, Group A (means 1 & 2) and Group B (the rest). Now, write out each mean, and before all of the Group A means, put the number of Group B means, then before all the Group B means, put the number of Group A means. Then, give one of the groups a plus sign, the other a minus:

If you wanted to compare the first three means with the last 4, it would be:

Know about the unequal n stuff, but don�t worry about it (you will only be asked to do equal n)
Testing For Significance
Once you have the SS_contrast, you treat it like SS_treat when it comes to testing for significance. That is, you calculate an F as:

That F has 1 and df_error degrees of freedomFor our morphine study then, we might do the following contrasts:

Group	M-S	M-M	S-S	S-M	M_c-M
Mean	4.00	10.00	11.00	24.00	29.00
						SS	F
Con 1	-3	2	-3	2	2	1750	55**
Con 2	0	-1	0	0	1	1444	45**
Con 3	-1	0	1	0	0	196	6*
Con 4	0	1	-1	0	0	4	0.125

Critical F(1,35) = 4.12 (about 5.5 with alpha .01)
See the text (p. 359) for a detailed description of how the SSs for these contrasts where calculated
Note: With 4 contrasts, FW_error @ 0.20 � could reduce this by using a lower PC level of alpha, or by doing less comparisons

Orthogonal Comparisons
Sometimes contrasts are independent, sometimes not
For example, if you find that mean1 is greater than the average of means 2 & 3, that tells us nothing about whether mean4 is bigger than mean 5 � those contrasts are independent
However, if you find that mean1 is greater than the average of means 2 & 3, then chances are that mean1 is greater than mean2 � those two contrasts would not be independent
When members of a set of contrasts are independent of one another, they are termed "orthogonal contrasts"
The total SSs for a complete set of orthogonal contrasts always equals SS_treat
This is a nice property as it is like the SS_treat is being composed into a set of independent chunks, each of relevance to the experimenter
Creating a set of orthogonal contrasts
Bonferroni t (Dunn�s test)
As mentioned several times, the problem with doing multiple comparisons is that the family wise error of the experiment increases with each comparison you do
One way to control this, is to try hard to limit the number of comparisons (perhaps using contrasts instead of a bunch of t-tests)
Another way is to reduce your per comparison level of alpha to compensate for the inflation caused by doing multiple tests
If you want to continue using the tables we have, then you can only reduce alpha in crude steps (e.g., from .05 to .01)
In many cases, that may be overkill (e.g., three comparisons)
Dunn�s test allows you to do this same thing in a more precise manner

basically, the Dunn�s test allows you to evaluate each comparison at an a ¢ = a /c

Doing a Dunn�s test
Step 1:
The first thing to do is to "compute" a value of t¢ for each comparison you wish to perform
If that comparison is a planned t-test, then t¢ simple equals your t_obtained and has the same degrees of freedom as that t
If that comparison is a linear contrast, the t¢ equals the square root of the F associated with that contrast, and has a degrees of freedom equal to that of MS_error Step 2:
Go to the t¢ table at the back of the book (pp. 687) and find the critical t associated with the number of comparisons you are performing overall, and the relevant degrees of freedom
Now compare each t¢ obtained with that critical t value, which is really the critical t associated with a per comparison alpha of the desired level of family-wise error divided by the number of comparisons
* Don�t worry about multi-stage bonferronis
Example:
Post-Hoc Comparisons
Whenever possible, it is best to specify a prior the comparisons you wish to do, then do linear contrasts in combination with the Bonferroni t-test
However, there are situations in which the experiment really is not sure what outcome(s) to expect
In these situations, the correct thing to do is one of a number of post-hoc comparison procedures depending on the experimental context, and how liberal versus conservative the experimenter wishes to be
We will talk about the following procedures:

Fisher�s Least Significant Difference Procedure
The Newman-Keuls Test
Tukey�s Test
The Ryan Procedure
The Sheffe Test
Dunnett�s Test

You should be trying to understand not only how to do each test, but also why you might choose one procedure over another in a given experimental situationFisher�s Least Significant Difference (LSD)
a.k.a. Fisher�s protected t
In fact, this procedure is not different from the a priori t-test described earlier EXCEPT that it requires that the F test (from the ANOVA) be significant prior to computing the t values
The requirement of a significant overall F insures that the family-wise error for the complete null hypothesis (i.e., that all the means are equal) will remain at alpha
However, it does nothing to control for inflation of the family-wise error when performing the actual comparisons
This is OK if you have only three means (see text for a description of why)
But if you have more than three, then the LSD test is very liberal (i.e., high probability of Type I errors), too liberal for most situations
The Studentized Range Statistic
Many of the tests we will discuss are based on the studentized range statistic (q)
Thus, it is important to understand it first
The mathematical definition of it is:

where

and

represent the largest and smallest of the treatment means, and r is the number of treatments in the set
Note that this statistic is very similar to the t statistic .. in fact

q tables are set up according to the number of treatment means, when there are only two means, the q and t tables are identical
Using the Studentized Range (an example with logic)
When you do this test, you first take your means (we will use the morphine means) and arrange them in ascending order:

4    10   11    24    29

then, if you want to compare the difference between some means (say the largest and smallest), you compute a q_r and compare it to the value given in the q table (usual logic, if q_obt > q_crit, reject H₀)
So, comparing the largest and smallest:

From the tables, q5(35df) = 4.07. Since q_obt>q_crit, we reject H₀ and conclude the means are significantly different
Note how large the q_critical is � that is because it controls for the number of means there is (as Steve will hopefully explain)
Q tables:
Newman-Keuls Test
Basic goal is to sort the means into subsets of means such that means within a given subset are not different from each other, but are different from means in another subset
How to:

Sort means in ascending order such that mean₁ is the lowest mean � up to mean_i where i is the total number of means

Calculate the W_r associated with each width where the width between means i and j equals i-j+1

Construct a matrix with treatment means on the rows and columns, and the differences between means in the cells of the matrix

Using rules (which I will specify momentarily) we move from right to left across the entries and compare each difference with its q_r

Based on the pattern of significance observed, we group the means

Example
Step 1:

M-S	M-M	S-S	S-M	Mc-M
X₁	X₂	X₃	X₄	X₅
4	10	11	24	29

Step 2:

W3 To be filled in
W4 To be filled in

Steps 3 & 4:

		*M-S*	*M-M*	*S-S*	*S-M*	*Mc-M*
		4	10	11	24	29	r	W_r
*M-S*	4		6	7	20	25	5	8.14
*M-M*	10			1	14	19	4	7.63
*S-S*	11				13	18	3	6.93
*S-M*	24					5	2	5.75
*Mc-M*	29

Step 5:

M-S     M-M    S-S    S-M    Mc-M

4       10     11     24     29

---     ----------    -----------

In words then, these results suggest the following.
First, the rats who received morphine on all occasions are acting the same as those who received saline on all occasions .. suggesting that a tolerance has developed very quickly.
Those rats who received morphine 3 times, but then only saline on the test trial are significantly more sensitive to pain than those who received saline all the time, or morphine all the time. This suggests that a compensatory mechanism was operating, making the rats hypersensitive to pain when not opposed by morphine.
Finally, those rats who received morphine in their cage three times before receiving it in the testing context seem as non-sensitive to pain as those who received morphine for the first time at test, both groups being significantly less sensitive to pain that the S-S or M-M groups. This suggests the compensatory mechanism is very context specific and does not operate when the context is changed.
Unequal Sample Sizes
Once again, don�t worry about the details of dealing with unequal n � just know that if you ever in the position of having unequal n there are ways of dealing with itFamily-Wise Error (The problem with NK)
This part is confusing, but I�ll try my best �
The problem with the Newman-Keuls is that is doesn�t control FW error very well (i.e., it tends to be fairly liberal .. too liberal for the taste of many)
Situation 1:

Situation 2:

Prob of Type I error is related to the number of possible null hypotheses � FW = 1 - (1-a )^nulls
So, as the number of means increases, FW error can increase considerably and is typically around 0.10 for most experiments (four or five means)
Tukey�s Test
A.K.A. - The honestly-significant difference (HSD) testTukey�s test is simply a more conservative version of the Newman-Keuls (keeps FW error at .05)
The real difference is that instead of comparing the difference between a pair of means to a q value tied to the range of those means � the q of the largest range is always used (q_HSD = q_max)
So in the morphine rats example, we would compare each difference to the q5 value of 8.14 � producing the following results

		M-S	M-M	S-S	S-M	Mc-M
		4	10	11	24	29	r	W_r
M-S	4		6	7	20	25	5	8.14
M-M	10			1	14	19		8.14
S-S	11				13	18		8.14
S-M	24					5		8.14
Mc-M	29

M-S     M-M     S-S     S-M     Mc-M

4       10      11      24      29

-------------------     ------------

The Ryan Procedure
Read through this section and note the similarities to the Dunn�s test (the Bonferroni t)
However, given that using the procedure requires either specialized tables or a statistical software package .. you will never be required to actually do it
Thus, get the general idea, but don�t worry about details
The Sheffe test
The Sheffe test extends the post-hoc analysis possibilities to include linear contrasts as well as comparisons between specific means
As before, a linear contrast is always described by the equation:

With the SS_contrast being equal to:

And the F_obtained then being:

*recall that there is always 1df associated with a contrast so MS_contrast = SS_contrast
This is all as it was for the a-priori version of contrasts. The difference is that instead of comparing the F_obtained to an F(1,df_error), the F_critical is:F_critical = (k-1) F(k-1,df_error)
Doing this will hold FW error constant for all linear contrasts
However, there is a cost � the Sheffe is the least powerful of all the post-hoc procedures (i.e., very conservative)
Moral: Don�t use when you only want to compare means or when you can justify comparisons a-priori
Dunnett�s Test
This test was designed specifically for comparing a number of treatment groups with a control group
Note, this situation is somewhat different from the previous post-hoc tests in that it is somewhat a-priori (i.e., the "position" of the control condition can vary .. Steve will explain .. hopefully)
This allows the Dunnett�s test to be more powerful � FW error can be controlled in less stringent ways
All that is really involved is using a different t table when looking up the critical t � t_d
However, when using this test, the most typical thing to do is to calculate a critical difference .. that is, when the difference between any two means exceeds this value .. those means are signficantly different
Doing a Dunnett�s

M-S     M-M     S-S     S-M    Mc-M

4       10      11      24     29

Critical Difference =

We get the td value from the table with k=5 and df_error=35 � producing a value of 2.56

Critical Difference =

= 7.24

So, assuming the S-S group is the control group � any mean that is more than 7.24 units from it is considered significantly different
That is � the S-M and Mc-M groups
Which test when and why?
When I presented each test, I went through the situations in which they are typically used so I hope you have a decent idea about that
Nonetheless, read the "comparison of the alternative procedures" and "which test?" sections of the text to make sure you have a good feel for this
Make sure you understand the distinction between a-prior versus post-hoc tests � and the distinction between the tests within each category

Open Mind Tree

Statistics chapter 12

0 comments:

Popular Posts

Visitors

Archives

Infolinks In Text Ads

Featured Posts

Blogger Tips