Wednesday, December 21, 2011

Statistics chapter 12


Chapter 12
Multiple Comparisons Among Treatment Means



When you use an ANOVA and find a significant F, all that says is that the various means are not all equal
It does not say which means are different
The purpose of this chapter is describe a number of different ways of testing which means are different
Before describing the tests, it is necessary to consider two different ways of thinking about error and how they are relevant to doing multiple comparisons
Error Rate per Comparison (PC)
This is simply the Type I error that we have talked about all along. So far, we have been simply setting its value at .05, a 5% chance of making an error
Familywise Error Rate (FW)
Often, after an ANOVA, we want to do a number of comparisons, not just one
The collection of comparisons we do is described as the "family" 
The familywise error rate is the probability that at least one of these comparisons will include a type I error 
Assuming that a ¢ is the per comparison error rate, then:
The per comparison error: a = a ¢
but, the familywise error: a = 1 - (1-a ¢ )c
Thus, if we do two comparisons, but keep a ¢ at 0.05, the FWerror will really be:

a = 1 - (1 - 0.05)2
=1 - (0.95)2 = 1 - 0.9025 = 0.0975
Thus, there is almost a 10% chance of one of the comparisons being significant when we do two comparisons, even when the nulls are true. 
The basic problem then, is that if we are doing many comparisons, we want to somehow control our familywise error so that we don�t end up concluding that differences are there, when they really are not 
The various tests we will talk about differ in terms of how they do this 
They will also be categorized as being either "A priori" or "post hoc"
A priori: A priori tests are comparisons that the experimenter clearly intended to test before collecting any data

Post hoc: Post hoc tests are comparisons the experimenter has decided to test after collecting the data, looking at the means, and noting which means "seem" different.
The probability of making a type I error is smaller for A priori tests because, when doing post hoc tests, you are essentially doing all possible comparisons before deciding which to test in a formal statistical manner
Steve: Significant F issue
An example for context
    See page 351 for a very complete description of the Morphine Tolerance study .. Seigel (1975)

Highlights:
  • paw lick latency as a measure of pain resistance
  • tolerance to morphine develops quickly
  • notion of a compensatory mechanism
  • this mechanism very context dependent

 
M-S
M-M
S-S
S-M
Mc - M
 
3
2
14
29
24
 
5
12
6
20
26
 
1
13
12
36
40
 
8
6
4
21
32
 
1
10
19
25
20
 
1
7
3
18
33
 
4
11
9
26
27
 
9
19
21
17
30
Total
32
80
88
192
232
Mean
4
10
11
24
29
Var
9.99
26.32
45.16
40.58
37.95

Source
df
SS
MS
F
Treatment
4
3497.60
874.40
27.33
Within
35
1120.00
32.00
 
Total
39
2455.22
  

A Priori Comparisons
As discussed, these tests are only appropriate when the specific means to be compared where chosen before (a.k.a. a priori) data was collected and means were examined Multiple t-tests
One obvious thing to do is simply conduct t-tests across the groups of interestHowever, when we do so, we use the MSerror in the denominator instead of using the individual or pooled variance estimates (and evaluate t using df equal to df error)This is because the MSerror is assumed to also measure random variation, but provides a better measure that group variances as it is based on a larger nThus, the general t formula becomes:

Examples
Group M-S versus Group S-S
Group Mc-M versus Group M-M
Linear Contrasts
While t-tests allow us to compare one mean with another, we can use linear contrasts to compare some mean (or set of means) with some other mean (or set of means)
Must first understand the notion of a linear combination of means:


Note, if all the a�s were 1, L would be the sum of the means � if all the a�s were equal to 1/k, L would be the mean of the means 
To make a linear combination a linear contrast, we simply impose the restriction that 
So, we select our values of  in a way that defines the contrast we are interested in 
For example, say we had three means and we want to compare the first two .. ()


This is simply a t-test
More Generally

So, the above contrast compares the average of the first two means with the third mean
You can basically make any contrast you want as long as 
Of course, the trick then is testing if the contrast is significant:
SS for contrasts:
While I won�t work out the proof, the SS for contrasts in a component of SStreat and the value of SScontrast can be quantified as:


where n is the number of subjects within each of the treatment groups 
Contrasts always are assumed to have 1 df
Examples:
Assume three means of 1.5, 2.0 and 3.0, each based on 10 subjects
When we run the overall ANOVA, we find a SStreat of 11.67
Contrast 1:


Contrast 2:


Note that SStreat SScontrast1 SScontrast2
Choosing Coefficients:
Sometimes choosing the values for the coefficients is fairly straightforward, as in the previous two cases
But what about when it gets more complicated � say you have seven means, and you want to compare the average of the first 2 with the average of the last 5 
The trick .. think of those sets of means forming 2 groups, Group A (means 1 & 2) and Group B (the rest). Now, write out each mean, and before all of the Group A means, put the number of Group B means, then before all the Group B means, put the number of Group A means. Then, give one of the groups a plus sign, the other a minus: 

If you wanted to compare the first three means with the last 4, it would be: 


Know about the unequal n stuff, but don�t worry about it (you will only be asked to do equal n)
Testing For Significance
Once you have the SScontrast, you treat it like SStreat when it comes to testing for significance. That is, you calculate an F as:
That F has 1 and dferror degrees of freedomFor our morphine study then, we might do the following contrasts:

Group
M-S
M-M
S-S
S-M
Mc-M
 
 
 
Mean
4.00
10.00
11.00
24.00
29.00
 
 
 
 
 
 
 
 
 
 
SS
F
Con 1
-3
2
-3
2
2
 
1750
55**
Con 2
0
-1
0
0
1
 
1444
45**
Con 3
-1
0
1
0
0
 
196
6*
Con 4
0
1
-1
0
0
 
4
0.125

Critical F(1,35) = 4.12 (about 5.5 with alpha .01)
See the text (p. 359) for a detailed description of how the SSs for these contrasts where calculated
Note: With 4 contrasts, FWerror @ 0.20 � could reduce this by using a lower PC level of alpha, or by doing less comparisons

Orthogonal Comparisons
Sometimes contrasts are independent, sometimes not
For example, if you find that mean1 is greater than the average of means 2 & 3, that tells us nothing about whether mean4 is bigger than mean 5 � those contrasts are independent
However, if you find that mean1 is greater than the average of means 2 & 3, then chances are that mean1 is greater than mean2 � those two contrasts would not be independent
When members of a set of contrasts are independent of one another, they are termed "orthogonal contrasts" 
The total SSs for a complete set of orthogonal contrasts always equals SStreat
This is a nice property as it is like the SStreat is being composed into a set of independent chunks, each of relevance to the experimenter
Creating a set of orthogonal contrasts
Bonferroni t (Dunn�s test)
As mentioned several times, the problem with doing multiple comparisons is that the family wise error of the experiment increases with each comparison you do
One way to control this, is to try hard to limit the number of comparisons (perhaps using contrasts instead of a bunch of t-tests)
Another way is to reduce your per comparison level of alpha to compensate for the inflation caused by doing multiple tests
If you want to continue using the tables we have, then you can only reduce alpha in crude steps (e.g., from .05 to .01)
In many cases, that may be overkill (e.g., three comparisons)
Dunn�s test allows you to do this same thing in a more precise manner
  • basically, the Dunn�s test allows you to evaluate each comparison at an a ¢ = a /c
  • Doing a Dunn�s test
    Step 1:
    The first thing to do is to "compute" a value of t¢ for each comparison you wish to perform
    If that comparison is a planned t-test, then t¢ simple equals your tobtained and has the same degrees of freedom as that t
    If that comparison is a linear contrast, the t¢ equals the square root of the F associated with that contrast, and has a degrees of freedom equal to that of MSerror 
    Step 2:
    Go to the t¢ table at the back of the book (pp. 687) and find the critical t associated with the number of comparisons you are performing overall, and the relevant degrees of freedom
    Now compare each t¢ obtained with that critical t value, which is really the critical t associated with a per comparison alpha of the desired level of family-wise error divided by the number of comparisons
    * Don�t worry about multi-stage bonferronis
    Example: 
    Post-Hoc Comparisons
    Whenever possible, it is best to specify a prior the comparisons you wish to do, then do linear contrasts in combination with the Bonferroni t-test 
    However, there are situations in which the experiment really is not sure what outcome(s) to expect
    In these situations, the correct thing to do is one of a number of post-hoc comparison procedures depending on the experimental context, and how liberal versus conservative the experimenter wishes to be 
    We will talk about the following procedures:
    • Fisher�s Least Significant Difference Procedure
    • The Newman-Keuls Test
    • Tukey�s Test
    • The Ryan Procedure
    • The Sheffe Test
    • Dunnett�s Test
    You should be trying to understand not only how to do each test, but also why you might choose one procedure over another in a given experimental situationFisher�s Least Significant Difference (LSD)
    a.k.a. Fisher�s protected 
    In fact, this procedure is not different from the a priori t-test described earlier EXCEPT that it requires that the F test (from the ANOVA) be significant prior to computing the t values
    The requirement of a significant overall F insures that the family-wise error for the complete null hypothesis (i.e., that all the means are equal) will remain at alpha
    However, it does nothing to control for inflation of the family-wise error when performing the actual comparisons 
    This is OK if you have only three means (see text for a description of why)
    But if you have more than three, then the LSD test is very liberal (i.e., high probability of Type I errors), too liberal for most situations
    The Studentized Range Statistic
    Many of the tests we will discuss are based on the studentized range statistic (q)
    Thus, it is important to understand it first
    The mathematical definition of it is:


    where  and  represent the largest and smallest of the treatment means, and r is the number of treatments in the set 
    Note that this statistic is very similar to the t statistic .. in fact


    q tables are set up according to the number of treatment means, when there are only two means, the q and t tables are identical
    Using the Studentized Range (an example with logic)
    When you do this test, you first take your means (we will use the morphine means) and arrange them in ascending order:
    4    10   11    24    29
    then, if you want to compare the difference between some means (say the largest and smallest), you compute a qr and compare it to the value given in the q table (usual logic, if qobt > qcrit, reject H0
    So, comparing the largest and smallest:


    From the tables, q5(35df) = 4.07. Since qobt>qcrit, we reject H0 and conclude the means are significantly different 
    Note how large the qcritical is � that is because it controls for the number of means there is (as Steve will hopefully explain)
    Q tables:
    Newman-Keuls Test
    Basic goal is to sort the means into subsets of means such that means within a given subset are not different from each other, but are different from means in another subset 
    How to:
  • Sort means in ascending order such that mean1 is the lowest mean � up to meani where i is the total number of means
  • Calculate the Wr associated with each width where the width between means i and j equals i-j+1
  • Construct a matrix with treatment means on the rows and columns, and the differences between means in the cells of the matrix
  • Using rules (which I will specify momentarily) we move from right to left across the entries and compare each difference with its qr
  • Based on the pattern of significance observed, we group the means
  •   
    Example
    Step 1:

    M-S
    M-M
    S-S
    S-M
    Mc-M
    X1
    X2
    X3
    X4
    X5
    4
    10
    11
    24
    29
    Step 2:


    W3 To be filled in
    W4 To be filled in

    Steps 3 & 4:

     
     
    M-S
    M-M
    S-S
    S-M
    Mc-M
     
     
     
     
    4
    10
    11
    24
    29
    r
    Wr

    M-S
    4
    6
    7
    20
    25
    5
    8.14

    M-M
    10
     
     
    1
    14
    19
    4
    7.63

    S-S
    11
     
     
     
    13
    18
    3
    6.93

    S-M
    24
     
     
     
     
    5
    2
    5.75
    Mc-M
    29
     
     
     
     
     
     
     

    Step 5:
    M-S     M-M    S-S    S-M    Mc-M
    4       10     11     24     29
    ---     ----------    -----------
     
    In words then, these results suggest the following.
    First, the rats who received morphine on all occasions are acting the same as those who received saline on all occasions .. suggesting that a tolerance has developed very quickly.
    Those rats who received morphine 3 times, but then only saline on the test trial are significantly more sensitive to pain than those who received saline all the time, or morphine all the time. This suggests that a compensatory mechanism was operating, making the rats hypersensitive to pain when not opposed by morphine.
    Finally, those rats who received morphine in their cage three times before receiving it in the testing context seem as non-sensitive to pain as those who received morphine for the first time at test, both groups being significantly less sensitive to pain that the S-S or M-M groups. This suggests the compensatory mechanism is very context specific and does not operate when the context is changed.
    Unequal Sample Sizes
    Once again, don�t worry about the details of dealing with unequal n � just know that if you ever in the position of having unequal n there are ways of dealing with itFamily-Wise Error (The problem with NK)
    This part is confusing, but I�ll try my best �
    The problem with the Newman-Keuls is that is doesn�t control FW error very well (i.e., it tends to be fairly liberal .. too liberal for the taste of many)
    Situation 1:

    Situation 2:

    Prob of Type I error is related to the number of possible null hypotheses � FW = 1 - (1-a )nulls
    So, as the number of means increases, FW error can increase considerably and is typically around 0.10 for most experiments (four or five means)
    Tukey�s Test
    A.K.A. - The honestly-significant difference (HSD) test
    Tukey�s test is simply a more conservative version of the Newman-Keuls (keeps FW error at .05)
    The real difference is that instead of comparing the difference between a pair of means to a q value tied to the range of those means � the q of the largest range is always used (qHSD = qmax)
    So in the morphine rats example, we would compare each difference to the q5 value of 8.14 � producing the following results
     
     
    M-S
    M-M
    S-S
    S-M
    Mc-M
     
     
     
     
    4
    10
    11
    24
    29
    r
    Wr

    M-S
    4
    6
    7
    20
    25
    5
    8.14

    M-M
    10
     
     
    1
    14
    19
    8.14

    S-S
    11
     
     
     
    13
    18
    8.14

    S-M
    24
     
     
     
     
    5
    8.14
    Mc-M
    29
     
     
     
     
     
     
     

    M-S     M-M     S-S     S-M     Mc-M
    4       10      11      24      29
    -------------------     ------------
    The Ryan Procedure
    Read through this section and note the similarities to the Dunn�s test (the Bonferroni t)
    However, given that using the procedure requires either specialized tables or a statistical software package .. you will never be required to actually do it
    Thus, get the general idea, but don�t worry about details
    The Sheffe test
    The Sheffe test extends the post-hoc analysis possibilities to include linear contrasts as well as comparisons between specific means
    As before, a linear contrast is always described by the equation:

    With the SScontrast being equal to:
    And the Fobtained then being:

    *recall that there is always 1df associated with a contrast so MScontrast = SScontrast
    This is all as it was for the a-priori version of contrasts. The difference is that instead of comparing the Fobtained to an F(1,dferror), the Fcritical is:
    Fcritical = (k-1) F(k-1,dferror)

    Doing this will hold FW error constant for all linear contrasts 
    However, there is a cost � the Sheffe is the least powerful of all the post-hoc procedures (i.e., very conservative) 
    Moral: Don�t use when you only want to compare means or when you can justify comparisons a-priori
      
    Dunnett�s Test
    This test was designed specifically for comparing a number of treatment groups with a control group
    Note, this situation is somewhat different from the previous post-hoc tests in that it is somewhat a-priori (i.e., the "position" of the control condition can vary .. Steve will explain .. hopefully)
    This allows the Dunnett�s test to be more powerful � FW error can be controlled in less stringent ways
    All that is really involved is using a different t table when looking up the critical t � td
    However, when using this test, the most typical thing to do is to calculate a critical difference .. that is, when the difference between any two means exceeds this value .. those means are signficantly different
    Doing a Dunnett�s

    M-S     M-M     S-S     S-M    Mc-M
    4       10      11      24     29

    Critical Difference = 


    We get the td value from the table with k=5 and dferror=35 � producing a value of 2.56

    Critical Difference = = 7.24

    So, assuming the S-S group is the control group � any mean that is more than 7.24 units from it is considered significantly different 
    That is � the S-M and Mc-M groups
    Which test when and why?
    When I presented each test, I went through the situations in which they are typically used so I hope you have a decent idea about that
    Nonetheless, read the "comparison of the alternative procedures" and "which test?" sections of the text to make sure you have a good feel for this
    Make sure you understand the distinction between a-prior versus post-hoc tests � and the distinction between the tests within each category

    0 comments: