6 ONE-WAY BETWEEN-

SUBJECTS ANALYSIS OF VARIANCE

6.1 Research Situations Where One-Way Between-Subjects Analysis of Variance (ANOVA) Is Used

A one-way between-subjects (between-S) analysis of variance (ANOVA) is used in research situations where the researcher wants to compare means on a quantitative Y outcome variable across two or more groups. Group membership is identified by each participant’s score on a categorical X predictor variable. ANOVA is a generalization of the t test; a t test provides information about the distance between the means on a quantitative outcome variable for just two groups, whereas a one-way ANOVA compares means on a quantitative variable across any number of groups. The categorical predictor variable in an ANOVA may represent either naturally occurring groups or groups formed by a researcher and then exposed to different interventions. When the means of naturally occurring groups are compared (e.g., a one-way ANOVA to compare mean scores on a self-report measure of political conservatism across groups based on religious affiliation), the design is nonexperimental. When the groups are formed by the researcher and the researcher administers a different type or amount of treatment to each group while controlling extraneous variables, the design is experimental.

The term between-S (like the term independent samples) tells us that each participant is a member of one and only one group and that the members of samples are not matched or paired. When the data for a study consist of repeated measures or paired or matched samples, a repeated measures ANOVA is required (see Chapter 22 for an introduction to the analysis of repeated measures). If there is more than one categorical variable or factor

included in the study, factorial ANOVA is used (see Chapter 13). When there is just a single factor, textbooks often name this single factor A, and if there are additional factors, these are usually designated factors B, C, D, and so forth. If scores on the dependent Y variable are in the form of rank or ordinal data, or if the data seriously violate assumptions required for ANOVA, a nonparametric alternative to ANOVA may be preferred.

In ANOVA, the categorical predictor variable is called a factor; the groups are called the levels of this factor. In the hypothetical research example introduced in Section 6.2, the factor is called “Types of Stress,” and the levels of this factor are as follows: 1, no stress; 2, cognitive stress from a mental arithmetic task; 3, stressful social role play; and 4, mock job interview.

Comparisons among several group means could be made by calculating t tests for each pairwise comparison among the means of these four treatment groups. However, as described in Chapter 3, doing a large number of significance tests leads to an inflated risk for Type I error. If a study includes k groups, there are k(k – 1)/2 pairs of means; thus, for a set of four groups, the researcher would need to do (4 × 3)/2 = 6 t tests to make all possible pairwise comparisons. If α = .05 is used as the criterion for significance for each test, and the researcher conducts six significance tests, the probability that this set of six decisions contains at least one instance of Type I error is greater than .05. Most researchers assume that it is permissible to do a few significance tests (perhaps three or four) without taking special precautions to limit the risk of Type I error. However, if more than a few comparisons are examined, researchers usually want to use one of several methods that offer some protection against inflated risk of Type I error.

One way to limit the risk of Type I error is to perform a single omnibus test that examines all the comparisons in the study as a set. The F test in a one-way ANOVA provides a single omnibus test of the hypothesis that the means of all k populations are equal, in place of many t tests for all possible pairs of groups. As a follow-up to a significant overall F test, researchers often still want to examine selected pairwise comparisons of group means to obtain more information about the pattern of differences among groups. This chapter reviews the F test, which is used to assess differences for a set of more than two group means, and some of the more popular procedures for follow-up comparisons among group means.

The overall null hypothesis for one-way ANOVA is that the means of the

k populations that correspond to the groups in the study are all equal:

H0: μ1 = μ2 = … = μk.

When each group has been exposed to different types or dosages of a treatment, as in a typical experiment, this null hypothesis corresponds to an assumption that the treatment has no effect on the outcome variable. The alternative hypothesis in this situation is not that all population means are unequal, but that there is at least one inequality for one pair of population means in the set or one significant contrast involving group means.

This chapter reviews the computation and interpretation of the F ratio that is used in between-S one-way ANOVA. This chapter also introduces an extremely important concept: the partition of each score into two components —namely, a first component that is predictable from the independent variable that corresponds to group membership and a second component that is due to the collective effects of all other uncontrolled or extraneous variables. Similarly, the total sum of squares can also be partitioned into a sum of squares (between groups) that reflects variability associated with treatment and a sum of squares (within groups) that is due to the influence of extraneous uncontrolled variables. From these sums of squares, we can estimate the proportion of variance in scores that is explained by, accounted for by, or predictable from group membership (or from the different treatments administered to groups). The concept of explained or predicted variance is crucial for understanding ANOVA; it is also important for the comprehension of multivariate analyses described in later chapters.

The best question ever asked by a student in any of my statistics classes was a deceptively simple “Why is there variance?” In the hypothetical research situation described in the following section, the outcome variable is a self-report measure of anxiety, and the research question is, Why is there variance in anxiety? Why do some persons report more anxiety than other persons? To what extent are the differences in amount of anxiety systematically associated with the independent variable (type of stress), and to what extent are differences in the amount of self-reported anxiety due to other factors (such as trait levels of anxiety, physiological arousal, drug use, gender, other anxiety-arousing events that each participant has experienced on the day of the study, etc.)? When we do an ANOVA, we identify the part of each individual score that is associated with group membership (and,

therefore, with the treatment or intervention, if the study is an experiment) and the part of each individual score that is not associated with group membership and that is, therefore, attributable to a variety of other variables that are not controlled in the study, or “error.” In most research situations, researchers usually hope to show that a large part of the scores is associated with, or predictable from, group membership or treatment.

6.2 Hypothetical Research Example

As an illustration of the application of one-way ANOVA, consider the following imaginary study. Suppose that an experiment is done to compare the effects of four situations: Group 1 is tested in a “no-stress,” baseline situation; Group 2 does a mental arithmetic task; Group 3 does a stressful social role play; and Group 4 participants do a mock job interview. For this study, the X variable is a categorical variable with codes 1, 2, 3, and 4 that represent which of these four types of stress each participant received. This categorical X predictor variable is called a factor; in this case, the factor is called “Types of Stress”; the four levels of this factor correspond to no stress, mental arithmetic, stressful role play, and mock job interview. At the end of each session, the participants self-report their anxiety on a scale that ranges from 0 = no anxiety to 20 = extremely high anxiety. Scores on anxiety are, therefore, scores on a quantitative Y outcome variable. The researcher wants to know whether, overall, anxiety levels differed across these four situations and, if so, which situations elicited the highest anxiety and which treatment conditions differed significantly from baseline and from other types of stress. A convenience sample of 28 participants was obtained; each participant was randomly assigned to one of the four levels of stress. This results in k = 4 groups with n = 7 participants in each group, for a total of N = 28 participants in the entire study. The SPSS Data View worksheet that contains data for this imaginary study appears in Figure 6.1.

6.3 Assumptions About Scores on the Dependent Variable for One- Way Between-S ANOVA

The assumptions for one-way ANOVA are similar to those described for the independent samples t test (see Section 3 of Chapter 5). The scores on the

quantitative dependent variable should be quantitative and, at least approximately, interval/ratio level of measurement. The scores should be approximately normally distributed in the entire sample and within each group, with no extreme outliers. The variances of scores should be approximately equal across groups. Observations should be independent of each other, both within groups and between groups. Preliminary screening involves the same procedures as for the t test: Histograms of scores should be examined to assess normality of distribution shape, a box and whiskers plot of the scores within each group could be used to identify outliers, and the Levene test (or some other test of homogeneity of variance) could be used to assess whether the homogeneity of variance assumption is violated. If major departures from the normal distribution shape and/or substantial differences among variances are detected, it may be possible to remedy these problems by applying data transformations, such as the base 10 logarithm, or by removing or modifying scores that are outliers, as described in Chapter 4.

Figure 6.1 Data View Worksheet for Stress/Anxiety Study in stress_anxiety.sav

6.4 Issues in Planning a Study

When an experiment is designed to compare treatment groups, the researcher needs to decide how many groups to include, how many participants to include in each group, how to assign participants to groups, and what types or dosages of treatments to administer to each group. In an experiment, the levels of the factor may correspond to different dosage levels of the same treatment variable (such as 0 mg caffeine, 100 mg caffeine, 200 mg caffeine) or to qualitatively different types of interventions (as in the stress study, where the groups received, respectively, no stress, mental arithmetic, role play, or mock job interview stress interventions). In addition, it is necessary

to think about issues of experimental control, as described in research methods textbooks; for example, in experimental designs, researchers need to avoid confounds of other variables with the treatment variable. Research methods textbooks (e.g., Cozby, 2004) provide a more detailed discussion of design issues; a few guidelines are listed here.

- If the treatment variable has a curvilinear relation to the outcome variable, it is necessary to have a sufficient number of groups to describe this relation accurately—usually, at least 3 groups. On the other hand, it may not be practical or affordable to have a very large number of groups; if an absolute minimum n of 10 (or, better, 20) participants per group are included, then a study that included 15 groups would require a minimum of 150 (or 300) participants.

- In studies of interventions that may create expectancy effects, it is necessary to include one or several kinds of placebo and no treatment/control groups for comparison. For instance, studies of the effect of biofeedback on heart rate (HR) sometimes include a group that gets real biofeedback (a tone is turned on when HR increases), a group that gets noncontingent feedback (a tone is turned on and off at random in a way that is unrelated to HR), and a group that receives instructions for relaxation and sits quietly in the lab without any feedback (Burish, 1981).

- Random assignment of participants to conditions is desirable to try to ensure equivalence of the groups prior to treatment. For example, in a study where HR is the dependent variable, it would be desirable to have participants whose HRs were equal across all groups prior to the administration of any treatments. However, it cannot be assumed that random assignment will always succeed in creating equivalent groups. In studies where equivalence among groups prior to treatment is important, the researcher should collect data on behavior prior to treatment to verify whether the groups are equivalent.1

The same factors that affect the size of the t ratio (see Chapter 5) also affect the size of F ratios: the distances among group means, the amount of variability of scores within each group, and the number, n, of participants per group. Other things being equal, an F ratio tends to be larger when there are large between-group differences among dosage levels (or participant

characteristics). As an example, a study that used three different noise levels in decibels (dB) as the treatment variable, for example, 35, 65, and 95 dB, would result in larger differences in group means on arousal and a larger F ratio than a study that looked, for example, at these three noise levels, which are closer together: 60, 65, and 70 dB. Similar to the t test, the F ratio involves a comparison of between-group and within-group variability; the selection of homogeneous participants, standardization of testing conditions, and control over extraneous variables will tend to reduce the magnitude of within-group variability of scores, which in turn tends to produce a larger F (or t) ratio.

Finally, the F test involves degrees of freedom (df) terms based on number of participants (and also number of groups); the larger the number of participants, other things being equal, the larger the F ratio will tend to be. A researcher can usually increase the size of an F (or t) ratio by increasing the differences in the dosage levels of treatments given to groups and/or reducing the effects of extraneous variables through experimental control, and/or increasing the number of participants. In studies that involve comparisons of naturally occurring groups, such as age groups, a similar principle applies: A researcher is more likely to see age-related changes in mental processing speed in a study that compares ages 20, 50, and 80 than in a study that compares ages 20, 25, and 30.

6.5 Data Screening

A box and whiskers plot was set up to examine the distribution of anxiety scores within each of the four stress intervention groups; this plot appears in Figure 6.2. Examination of this box and whiskers plot indicates that there are no outlier scores in any of the four groups. Other preliminary data screening that could be done includes a histogram for all the anxiety scores in the entire sample. The Levene test to assess possible violations of the homogeneity of variance assumption is reported in Section 6.11.

Figure 6.2 Boxplot Showing Distribution of Anxiety Scores Within Each of the Four Treatment Groups in the Stress/Anxiety Experiment

Table 6.1 Data for the Partition of Heart Rate (HR) Scores

NOTES: DevGrand = HR – Grand Mean. DevGroup or residual = HR – Group Mean. Effect = Group Mean – Grand Mean. Reproduced Score = Grand Mean + Effect + DevGroup.

6.6 Partition of Scores Into Components

When we do a one-way ANOVA, the analysis involves partition of each score into two components: a component of the score that is associated with group membership and a component of the score that is not associated with group membership. The sums of squares (SS) summarize how much of the variance in scores is associated with group membership and how much is not related to group membership. Examining the proportion of variance that is

associated with treatment group membership is one way of approaching the general research question, Why is there variance in the scores on the outcome variable?

To illustrate the partition of scores into components, consider the hypothetical data in Table 6.1. Suppose that we measure HR for each of 6 persons in a small sample and obtain the following set of scores: HR scores of 81, 85, and 92 for the 3 female participants and HR scores of 70, 63, and 68 for the 3 male participants. In this example, the X group membership variable is gender, coded 1 = female and 2 = male, and the Y quantitative outcome variable is HR. We can partition each HR score into a component that is related to group membership (i.e., related to gender) and a component that is not related to group membership. In a textbook discussion of ANOVA using just two groups, the factor that represents gender might be called Factor A.

We will denote the HR score for person j in group i as Yij. For example, Cathy’s HR of 92 corresponds to Y13, the score for the third person in Group 1. We will denote the grand mean (i.e., the mean HR for all N = 6 persons in this dataset) as MY; for this set of scores, the value of MY is 76.5. We will denote the mean for group i as Mi. For example, the mean HR for the female group (Group 1) is M1 = 86; the mean HR for the male group (Group 2) is M2 = 67. Once we know the individual scores, the grand mean, and the group means, we can work out a partition of each individual HR score into two components.

The question we are trying to answer is, Why is there variance in HR? In other words, why do some people in this sample have HRs that are higher than the sample mean of 76.5, while others have HRs that are lower than 76.5? Is gender one of the variables that predicts whether a person’s HR will be relatively high or low? For each person, we can compute a deviation of that person’s individual Yij score from the grand mean; this tells us how far above (or below) the grand mean of HR each person’s score was. Note that in Table 6.1, the value of the grand mean is the same for all 6 participants. The value of the group mean was different for members of Group i = 1 (females) than for members of Group i = 2 (males). We can use these means to compute the following three deviations from means for each person:

The total deviation of the individual score from the grand mean, DevGrand, can be divided into two components: the deviation of the individual score from the group mean, DevGroup, and the deviation of the group mean from the grand mean, Effect:

For example, look at the line of data for Cathy in Table 6.1. Cathy’s observed HR was 92. Cathy’s HR has a total deviation from the grand mean of (92 − 76.5) = +15.5; that is, Cathy’s HR was 15.5 beats per minute (bpm) higher than the grand mean for this set of data. Compared with the group mean for the female group, Cathy’s HR had a deviation of (92 − 86) = +6; that is, Cathy’s HR was 6 bpm higher than the mean HR for the female group. Finally, the “effect” component of Cathy’s score is found by subtracting the grand mean (MY) from the mean of the group to which Cathy belongs (Mfemale or M1): (86 − 76.5) = +9.5. This value of 9.5 tells us that the mean HR for the female group (86) was 9.5 bpm higher than the mean HR for the entire sample (and this effect is the same for all members within each group). We could, therefore, say that the “effect” of membership in the female group is to increase the predicted HR score by 9.5. (Note, however, that this example involves comparison of naturally occurring groups, males vs. females, and so we should not interpret a group difference as evidence of causality but merely as a description of group differences.)

Another way to look at these numbers is to set up a predictive equation based on a theoretical model. When we do a one-way ANOVA, we seek to predict each individual score (Yij) on the quantitative outcome variable from the following theoretical components: the population mean (denoted by μ), the “effect” for group i (often denoted by αi, because α is the Greek letter that corresponds to A, which is the name usually given to a single factor), and the residual associated with the score for person j in group i (denoted by εij). Estimates for μ, αi, and εij can be obtained from the sample data as follows:

Note that the deviations that are calculated as estimates for αi and εij (in Equations 6.6 and 6.7) are the same components that appeared in Equation 6.4. An individual observed HR score Yij can be represented as a sum of these theoretical components, as follows:

In words, Equation 6.8 says that we can predict (or reconstruct) the observed score for person j in group i by taking the grand mean μ, adding the “effect” αi that is associated with membership in group i, and finally, adding the residual εij that tells us how much individual j’s score differed from the mean of the group that person j belonged to.

We can now label the terms in Equation 6.4 using this more formal notation:

In other words, we can divide the total deviation of each person’s score from the grand mean into two components: αi, the part of the score that is associated with or predictable from group membership (in this case, gender), and εij, the part of the score that is not associated with or predictable from group membership (the part of the HR score that is due to all other variables that influence HR, such as smoking, anxiety, drug use, level of fitness, health, etc.). In words, then, Equation 6.9 says that we can predict each person’s HR from the following information: Person j’s HR = grand mean + effect of person j’s gender on HR + effects of all other variables that influence person j’s HR, such as person j’s anxiety, health, drug use, fitness, and anxiety.

The collective influence of “all other variables” on scores on the outcome variable, HR in this example, is called the residual, or “error.” In most

research situations, researchers hope that the components of scores that represent group differences (in this case, gender differences in HR) will be relatively large and that the components of scores that represent within-group variability in HR (in this case, differences among females and males in HR and thus, differences due to all variables other than gender) will be relatively small.

When we do an ANOVA, we summarize the information about the sizes of these two deviations or components (Yij − Mi) and (Mi = MY) across all the scores in the sample. We cannot summarize information just by summing these deviations; recall from Chapter 2 that the sum of deviations of scores from a sample mean always equals 0, so ∑(Yij − Mi) = 0 and ∑(Mi − MY) = 0. When we summarized information about distances of scores from a sample mean by computing a sample variance, we avoided this problem by squaring the deviations from the mean prior to summing them. The same strategy is applied in ANOVA. ANOVA begins by computing the following deviations for each score in the data-set (see Table 6.1 for an empirical example; these terms are denoted DevGrand, DevGroup, and Effect in Table 6.1).

(Yij − MY) = (Yij − Mi) + (Mi − MY).

To summarize information about the magnitudes of these score components across all the participants in the dataset, we square each term and then sum the squared deviations across all the scores in the entire dataset:

These sums of squared deviations are given the following names: SStotal, sum of squares total, for the sum of the squared deviations of each score from the grand mean; SSwithin, or residual, for the sum of squared deviations of each score from its group mean, which summarizes information about within- group variability of scores; and SSbetween, or treatment, which summarizes information about the distances among (variability among) group means.

Note that, usually, when we square an equation of the form a = (b + c) (as

in [Yij − MY]2 = ([Yij − Mi] + [Mi − MY])2), the result would be of the form a2

= b2 + c2 + 2bc. However, the cross-product term (which would correspond to 2bc) is missing in Equation 6.10. This cross-product term has an expected value of 0 because the (Yij − Mi) and (Mi − MY) deviations are independent of (i.e., uncorrelated with) each other, and therefore, the expected value of the cross product between these terms is 0.

This example of the partition of scores into components and the partition of SStotal into SSbetween and SSwithin had just two groups, but this type of partition can be done for any number of groups. Researchers usually hope that SSbetween will be relatively large because this would be evidence that the group means are far apart and that the different participant characteristics or different types or amounts of treatments for each group are, therefore, predictively related to scores on the Y outcome variable.

This concept of variance partitioning is a fundamental and important one. In other analyses later on (such as Pearson correlation and bivariate regression), we will come across a similar type of partition; each Y score is divided into a component that is predictable from the X variable and a component that is not predictable from the X variable.

It may have occurred to you that if you knew each person’s scores on physical fitness, drug use, body weight, and anxiety level, and if these variables are predictively related to HR, then you could use information about each person’s scores on these variables to make a more accurate prediction of HR. This is exactly the reasoning that is used in analyses such as multiple regression, where we predict each person’s score on an outcome variable, such as anxiety, by additively combining the estimated effects of numerous predictor variables. It may have also occurred to you that, even if we know several factors that influence HR, we might be missing some important information; we might not know, for example, that the individual person has a family history of elevated HR. Because some important variables are almost always left out, the predicted score that we generate from a simple additive model (such as grand mean + gender effect + effects of smoking = predicted HR) rarely corresponds exactly to the person’s actual HR. The difference between the actual HR and the HR predicted from the model is called a residual or error. In ANOVA, we predict people’s scores by adding (or subtracting) the number of points that correspond to the effect for their treatment group to the grand mean.

Recall that for each subject in the caffeine group in the t test example in the preceding chapter, the best prediction for that person’s HR was the mean HR for all people in the caffeine group. The error or residual corresponded to the difference between each participant’s actual HR and the group mean, and this residual was presumably due to factors that uniquely affected that individual over and above responses to the effects of caffeine.

6.7 Computations for the One-Way Between-S ANOVA

The formulas for computation of the sums of squares, mean squares, and F ratio for the one-way between-S ANOVA are presented in this section; they are applied to the data from the hypothetical stress/anxiety study that appear in Figure 6.1.

6.7.1 Comparison Between the Independent Samples t Test and One-Way Between-S ANOVA

The one-way between-S ANOVA is a generalization of the independent samples t test. The t ratio provides a test of the null hypothesis that two means differ significantly. For the independent samples t test, the null hypothesis has the following form:

For a one-way ANOVA with k groups, the null hypothesis is as follows:

The computation of the independent samples t test requires that we find the following for each group: M, the sample mean; s, the sample standard deviation; and n, the number of scores in each group (see Section 6 of Chapter 5). For a one-way ANOVA, the same computations are performed; the only difference is that we have to obtain (and summarize) this information for k groups (instead of only two groups as in the t test).

Table 6.2 summarizes the information included in the computation of the independent samples t test and the one-way ANOVA to make it easier to see how similar these analyses are. For each analysis, we need to obtain

information about differences between group means (for the t test, we compute M1 − M2; for the ANOVA, because we have more than two groups, we need to find the variance of the group means M1, M2, …, Mk). The variance among the k group means is called MSbetween; the formulas to compute this and other intermediate terms in ANOVA appear below. We need to obtain information about the amount of variability of scores within each group; for both the independent samples t test and ANOVA, we can begin by computing an SS term that summarizes the squared distances of all the scores in each group from their group mean. We then convert that information into a summary about the amount of variability of scores within groups, summarized across all the groups (for the independent samples t test,

a summary of within-group score variability is provided by ; for ANOVA, a summary of within-group score variability is called MSwithin).

Table 6.2 Comparison Between the Independent Samples t Test and One-Way Between-Subjects (Between-S) ANOVA

NOTES: The F ratio MSbetween/MSwithin can be thought of as a ratio of

and the F ratio can also be interpreted as information about the relative magnitude of

For two groups, t2 is equivalent to F.

Any time that we compute a variance (or a mean square), we sum squared deviations of scores from group means. As discussed in Chapter 2, when we look at a set of deviations of n scores from their sample mean, only the first n − 1 of these deviations are free to vary. Because the sum of all the n deviations from a sample mean must equal 0, once we have the first n − 1 deviations, the value of the last deviation is not “free to vary”; it must have whatever value is needed to make the sum of the deviations equal to 0. For

the t ratio, we computed a variance for the denominator, and we need a single df term that tells us how many independent deviations from the mean the divisor of the t test was based on (for the independent samples t test, df = n1 + n2 − 2, where n1 and n2 are the number of scores in Groups 1 and 2. The total number of scores in the entire study is equal to n1 + n2, so we can also say that the df for the independent samples t test = N − 2).

For an F ratio, we compare MSbetween with MSwithin; we need to have a separate df for each of these mean square terms to indicate how many independent deviations from the mean each MS term was based on.

The test statistic for the independent samples t test is

The test statistic for the one-way between-S ANOVA is F = MSbetween /MSwithin. In each case, we have information about differences between or among group means in the numerator and information about variability of scores within groups in the denominator. In a well-designed experiment, differences between group means are attributed primarily to the effects of the manipulated independent variable X, whereas variability of scores within groups is attributed to the effects of all other variables, collectively called error. In most research situations, researchers hope to obtain large enough values of t and F to be able to conclude that there are statistically significant differences among group means. The method for the computation of these between- and within-group mean squares is given in detail in the following few sections.

The by-hand computation for one-way ANOVA (with k groups and a total of N observations) involves the following steps. Complete formulas for the SS terms are provided in the following sections.

- Compute SSbetween groups, SSwithin groups, and SStotal. 2. Compute MSbetween by dividing SSbetween by its df, k − 1. 3. Compute MSwithin by dividing SSwithin by its df, N − k. 4. Compute an F ratio: MSbetween/MSwithin.

- Compare this F value obtained with the critical value of F from a table of the F distribution with (k − 1) and (N − k) df (using the table in Appendix C that corresponds to the desired alpha level; for example, the first table provides critical values for α = .05). If the F value obtained exceeds the tabled critical value of F for the predetermined alpha level and the available degrees of freedom, reject the null hypothesis that all the population means are equal.

6.7.2 Summarizing Information About Distances Between Group Means: Computing MSbetween

The following notation will be used:

Let k be the number of groups in the study.

Let n1, n2, …, nk be the number of scores in Groups 1, 2, …, k.

Let Yij be the score of subject number j in group i; i = 1, 2, …, k.

Let M1, M2, …, Mk be the means of scores in Groups 1, 2, …, k.

Let N be the total N in the entire study; N = n1 + n2 + … + nk.

Let MY be the grand mean of all scores in the study (i.e., the total of all the individual scores, divided by N, the total number of scores).

Once we have calculated the means of each individual group (M1, M2, …, Mk) and the grand mean MY, we can summarize information about the distances of the group means, Mj, from the grand mean, MY, by computing SSbetween as follows:

For the hypothetical data in Figure 6.1, the mean anxiety scores for Groups 1 through 4 were as follows: M1 = 9.86, M2 = 14.29, M3 = 13.57, and M4 = 17.00. The grand mean on anxiety, MY, is 13.68. Each group had n = 7 scores. Therefore, for this study,

SSbetween ≈ 182 (this agrees with the value of SSbetween in the SPSS output that is presented in Section 6.11 except for a small amount of rounding error).

6.7.3 Summarizing Information About Variability of Scores Within Groups: Computing MSwithin

To summarize information about the variability of scores within each group, we compute MSwithin. For each group, for groups numbered from i = 1, 2, …, k, we first find the sum of squared deviations of scores relative to each group mean, SSi. The SS for scores within group i is found by taking this sum:

That is, for each of the k groups, find the deviation of each individual score from the group mean; square and sum these deviations for all the scores in the group. These within-group SS terms for Groups 1, 2, …, k are summed across the k groups to obtain the total SSwithin:

For this dataset, we can find the SS term for Group 1 (for example) by taking the sum of the squared deviations of each individual score in Group 1

from the mean of Group 1, M1. The values are shown for by-hand computations; it can be instructive to do this as a spreadsheet, entering the value of the group mean for each participant as a new variable and computing the deviation of each score from its group mean and the squared deviation for each participant.

For the four groups of scores in the dataset in Figure 6.1, these are the values of SS for each group: SS1 = 26.86, SS2 = 27.43, SS3 = 41.71, and SS4 = 26.00.

Thus, the total value of SSwithin for this set of data is SSwithin = SS1 + SS2 + SS3 + SS4 = 26.86 + 27.43 + 41.71 + 26.00 = 122.00.

We can also find SStotal; this involves taking the deviation of every individual score from the grand mean, squaring each deviation, and summing the squared deviations across all scores and all groups:

The grand mean MY = 13.68. This SS term includes 28 squared deviations, one for each participant in the dataset, as follows:

SStotal = (10 − 13.68)2 + (10 − 13.68)2 + (12 − 13.68)2 + … + (18 − 13.68)2 = 304.

Recall that Equation 6.10 showed that SStotal can be partitioned into SS terms that represent between- and within-group differences among scores; therefore, the SSbetween and SSwithin terms just calculated will sum to SStotal:

For these data, SStotal = 304, SSbetween = 182, and SSwithin = 122, so the sum of SSbetween and SSwithin equals SStotal (due to rounding error, these values differ

slightly from the values that appear on the SPSS output in Section 6.11).

6.7.4 The F Ratio: Comparing MSbetween With MSwithin

An F ratio is a ratio of two mean squares. A mean square is the ratio of a sum of squares to its degrees of freedom. Note that the formula for a sample variance back in Chapter 2 was just the sum of squares for a sample divided by the degrees of freedom that correspond to the number of independent deviations that were used to compute the SS. From Chapter 2, Var(X) = SS/df.

The df terms for the two MS terms in a one-way between-S ANOVA are based on k, the number of groups, and N, the total number of scores in the entire study (where N = n1 + n2 + … + nk). The between-group SS was obtained by summing the deviations of each of the k group means from the grand mean; only the first k − 1 of these deviations are free to vary, so the between-groups df = k − 1, where k is the number of groups.

In ANOVA, the mean square between groups is calculated by dividing SSbetween by its degrees of freedom:

For the data in the hypothetical stress/anxiety study, SSbetween = 182, dfbetween = 4 − 1 = 3, and MSbetween = 182/3 = 60.7.

The df for each SS within-group term is given by n − 1, where n is the number of participants in each group. Thus, in this example, SS1 had n − 1 or df = 6. When we form SSwithin, we add up SS1 + SS2 + … + SSk. There are (n − 1) df associated with each SS term, and there are k groups, so the total dfwithin = k × (n − 1). This can also be written as

where N = the total number of scores = n1 + n2 + … + nk and k is the number of groups.

We obtain MSwithin by dividing SSwithin by its corresponding df:

For the hypothetical stress/anxiety data in Figure 6.1, MSwithin = 122/24 = 5.083.

Finally, we can set up a test statistic for the null hypothesis H0: μ1 = μ2 = … = μk by taking the ratio of MSbetween to MSwithin:

For the stress/anxiety data, F = 60.702/5.083 = 11.94. This F ratio is evaluated using the F distribution with (k − 1) and (N − k) df. For this dataset, k = 4 and N = 28, so df for the F ratio are 3 and 24.

An F distribution has a shape that differs from the normal or t distribution. Because an F is a ratio of two mean squares and MS cannot be less than 0, the minimum possible value of F is 0. On the other hand, there is no fixed upper limit for the values of F. Therefore, the distribution of F tends to be positively skewed, with a lower limit of 0, as in Figure 6.3. The reject region for significance tests with F ratios consists of only one tail (at the upper end of the distribution). The first table in Appendix C shows the critical values of F for α = .05. The second and third tables in Appendix C provide critical values of F for α = .01 and α = .001. In the hypothetical study of stress and anxiety, the F ratio has df equal to 3 and 24. Using α = .05, the critical value of F from the first table in Appendix C with df = 3 in the numerator (across the top of the table) and df = 24 in the denominator (along the left-hand side of the table) is 3.01. Thus, in this situation, the α = .05 decision rule for evaluating statistical significance is to reject H0 when values of F > +3.01 are obtained. A value of 3.01 cuts off the top 5% of the area in the right-hand tail of the F distribution with df equal to 3 and 24, as shown in Figure 6.3. The obtained F = 11.94 would therefore be judged statistically significant.

Figure 6.3 F Distribution With 3 and 24 df

6.7.5 Patterns of Scores Related to the Magnitudes of MSbetween and MSwithin

It is important to understand what information about pattern in the data is contained in these SS and MS terms. SSbetween is a function of the distances among the group means (M1, M2, …, Mk); the farther apart these group means are, the larger SSbetween tends to be. Most researchers hope to find significant differences among groups, and therefore, they want SSbetween (and F) to be relatively large. SSwithin is the total of squared within-group deviations of scores from group means. SSwithin would be 0 in the unlikely event that all scores within each group were equal to each other. The greater the variability of scores within each group, the larger the value of SSwithin.

Consider the example shown in Table 6.3, which shows hypothetical data for which SSbetween would be 0 (because all the group means are equal); however, SSwithin is not 0 (because the scores vary within groups). Table 6.4 shows data for which SSbetween is not 0 (group means differ) but SSwithin is 0 (scores do not vary within groups). Table 6.5 shows data for which both SSbetween and SSwithin are nonzero. Finally, Table 6.6 shows a pattern of scores for which both SSbetween and SSwithin are 0.

Table 6.3 Data for Which SSbetween Is 0 (Because All the Group Means Are Equal), but SSwithin Is Not 0 (Because Scores Vary Within Groups)

Table 6.4 Data for Which SSbetween Is Not 0 (Because Group Means Differ), but SSwithin Is 0 (Because Scores Do Not Vary Within Groups)

Table 6.5 Data for Which SSbetween and SSwithin Are Both Nonzero

Table 6.6 Data for Which Both SSwithin and SSbetween Equal 0

6.7.6 Expected Value of F When H0 Is True

The population variances that are estimated by this ratio of sample mean squares (based on the algebra of expected mean squares2) are as follows:

where is the population variance of the alpha group effects, that is, the amount of variance in the Y scores that is associated with or predictable from the group membership variable, and is the population error variance, that is, the variance in scores that is due to all variables other than the group membership variable or manipulated treatment variable. Earlier, the null hypothesis for a one-way ANOVA was given as

H0: μ1 = μ2 = … = μk.

An alternative way to state this null hypothesis is that all the αi effects are equal to 0 (and therefore equal to each other): α1 = α2 = … αk. Therefore, another form of the null hypothesis for ANOVA is as follows:

It follows that if H0 is true, then the expected value of the F ratio is close to 1.

If F is much greater than 1, we have evidence that may be larger than 0.

How is F distributed across thousands of samples? First of all, note that the MS terms in the F ratio must be positive (MSbetween can be 0 in rare cases

where all the group means are exactly equal; MSwithin can be 0 in even rarer cases where all the scores within each group are equal; but because these are sums of squared terms, neither MS can be negative).

The sums of squared independent normal variables have a chi-square distribution; thus, the distributions of each of the mean squares are chi-square variates. An F distribution is a ratio of two chi-square variates. Like chi- square, the graph of an F distribution has a lower tail that ends at 0, and it tends to be skewed with a long tail off to the right. (See Figure 6.3 for a graph of the distribution of F with df = 3 and 24.) We reject H0 for large F values (which cut off the upper 5% in the right-hand tail). Thus, F is almost always treated as a one-tailed test. (It is possible to look at the lower tail of the F distribution to evaluate whether the obtained sample F is too small for it to be likely to have arisen by chance, but this is rarely done.)

Note that for the two-group situation, F is equivalent to t2. Both F and t are ratios of an estimate of between-group differences to within-group differences; between-group differences are interpreted as being primarily due to the manipulated independent variable, while within-group variability is due to the effects of extraneous variables. Both t and F are interpretable as signal-to-noise ratios, where the “signal” is the effect of the manipulated independent variable; in most research situations, we hope that this term will be relatively large. The size of the signal is evaluated relative to the magnitude of “noise,” the variability due to all other extraneous variables; in most research situations, we hope that this will be relatively small. Thus, in most research situations, we hope for values of t or F that are large enough for the null hypothesis to be rejected. A significant F can be interpreted as evidence that the between-groups independent variable had a detectable effect on the outcome variable. In rare circumstances, researchers hope to affirm the null hypothesis, that is, to demonstrate that the independent variable has no detectable effect, but this claim is actually quite difficult to prove.

If we obtain an F ratio large enough to reject H0, what can we conclude? The alternative hypothesis is not that all group means differ from each other significantly but that there is at least one (and possibly more than one) significant difference between group means:

A significant F tells us that there is probably at least one significant difference among group means; by itself, it does not tell us where that difference lies. It is necessary to do additional tests to identify the one or more significant differences (see Section 6.10).

6.7.7 Confidence Intervals (CIs) for Group Means

Once we have information about means and variances for each group, we can set up a CI around the mean for each group (using the formulas from Chapter 2) or a CI around any difference between a pair of group means (as described in Chapter 5). SPSS provides a plot of group means with error bars that represent 95% CIs (the SPSS menu selections to produce this graph appear in Figures 6.12 and 6.13, and the resulting graph appears in Figure 6.14; these appear at the end of the chapter along with the output from the Results section for the one-way ANOVA).

6.8 Effect-Size Index for One-Way Between-S ANOVA

By comparing the sizes of these SS terms that represent variability of scores between and within groups, we can make a summary statement about the comparative size of the effects of the independent and extraneous variables. The proportion of the total variability (SStotal) that is due to between-group differences is given by

In the context of a well-controlled experiment, these between-group differences in scores are, presumably, primarily due to the manipulated independent variable; in a nonexperimental study that compares naturally occurring groups, this proportion of variance is reported only to describe the magnitudes of differences between groups, and it is not interpreted as evidence of causality. An eta squared (η2) is an effect-size index given as a proportion of variance; if η2 = .50, then 50% of the variance in the Yij scores is due to between-group differences. This is the same eta squared that was

introduced in the previous chapter as an effect-size index for the independent samples t test; verbal labels that can be used to describe effect sizes are provided in Table 5.2. If the scores in a two-group t test are partitioned into components using the logic just described here and then summarized by creating sums of squares, the η2 value obtained will be identical to the η2 that was calculated from the t and df terms.

It is also possible to calculate eta squared from the F ratio and its df; this is useful when reading journal articles that report F tests without providing effect-size information:

An eta squared is interpreted as the proportion of variance in scores on the Y outcome variable that is predictable from group membership (i.e., from the score on X, the predictor variable). Suggested verbal labels for eta-squared effect sizes are given in Table 5.2.

An alternative effect-size measure sometimes used in ANOVA is called omega squared (ω2) (see W. Hays, 1994). The eta-squared index describes the proportion of variance due to between-group differences in the sample, but it is a biased estimate of the proportion of variance that is theoretically due to differences among the populations. The ω2 index is essentially a (downwardly) adjusted version of eta squared that provides a more conservative estimate of variance among population means; however, eta squared is more widely used in statistical power analysis and as an effect-size measure in the literature.

6.9 Statistical Power Analysis for One-Way Between-S ANOVA

Table 6.7 is an example of a statistical power table that can be used to make decisions about sample size when planning a one-way between-S ANOVA with k = 3 groups and α = .05. Using Table 6.7, given the number of groups, the number of participants, the predetermined alpha level, and the anticipated population effect size estimated by eta squared, the researcher can look up the minimum n of participants per group that is required to obtain various levels of statistical power. The researcher needs to make an educated guess: How

large an effect is expected in the planned study? If similar studies have been conducted in the past, the eta-squared values from past research can be used to estimate effect size; if not, the researcher may have to make a guess based on less exact information. The researcher chooses the alpha level (usually .05), calculates dfbetween (which equals k = 1, where k is the number of groups in the study), and decides on the desired level of statistical power (usually .80, or 80%). Using this information, the researcher can use the tables in Cohen (1988) or in Jaccard and Becker (2002) to look up the minimum sample size per group that is needed to achieve the power of 80%. For example, using Table 6.7, for an alpha level of .05, a study with three groups and dfbetween = 2, a population eta-squared value of .15, and a desired level of power of .80, the minimum number of participants required per group would be 19.

Java applets are available on the Web for statistical power analysis; typically, if the user identifies a Java applet that is appropriate for the specific analysis (such as between-S one-way ANOVA) and enters information about alpha, the number of groups, population effect size, and desired level of power, the applet provides the minimum per-group sample size required to achieve the user-specified level of statistical power.

Table 6.7 Statistical Power for One-Way Between-S ANOVA With k = 3 Groups Using α = .05

SOURCE: Adapted from Jaccard and Becker (2002).

NOTE: Each table entry corresponds to the minimum n required in each group to obtain the level of statistical power shown.

6.10 Nature of Differences Among Group Means

6.10.1 Planned Contrasts

The idea behind planned contrasts is that the researcher identifies a limited number of comparisons between group means before looking at the data. The test statistic that is used for each comparison is essentially identical to a t ratio except that the denominator is usually based on the MSwithin for the entire ANOVA, rather than just the variances for the two groups involved in the comparison. Sometimes an F is reported for the significance of each contrast, but F is equivalent to t2 in situations where only two group means are compared or where a contrast has only 1 df.

For the means of Groups a and b, the null hypothesis for a simple contrast between Ma and Mb is as follows:

H0: μa = μb

or

H0: μa − μb = 0.

The test statistic can be in the form of a t test:

where n is the number of cases within each group in the ANOVA. (If the ns are unequal across groups, then an average value of n is used; usually, this is the harmonic3 mean of the ns.)

Note that this is essentially equivalent to an ordinary t test. In a t test, the

measure of within-group variability is ; in a one-way ANOVA, information about within-group variability is contained in the term MSwithin. In cases where an F is reported as a significance test for a contrast between a pair of group means, F is equivalent to t2. The df for this t test equal N − k, where N is the total number of cases in the entire study and k is the number of groups.

When a researcher uses planned contrasts, it is possible to make other kinds of comparisons that may be more complex in form than a simple pairwise comparison of means. For instance, suppose that the researcher has a study in which there are four groups; Group 1 receives a placebo and Groups 2 to 4 all receive different antidepressant drugs. One hypothesis that may be of interest is whether the average depression score combined across the three drug groups is significantly lower than the mean depression score in Group 1, the group that received only a placebo.

The null hypothesis that corresponds to this comparison can be written in any of the following ways:

or

In words, this null hypothesis says that when we combine the means using certain weights (such as +1, −1/3, −1/3, and −1/3), the resulting composite is predicted to have a value of 0. This is equivalent to saying that the mean outcome averaged or combined across Groups 2 to 4 (which received three different types of medication) is equal to the mean outcome in Group 1 (which received no medication). Weights that define a contrast among group means are called contrast coefficients. Usually, contrast coefficients are constrained to sum to 0, and the coefficients themselves are usually given as integers for reasons of simplicity. If we multiply this set of contrast coefficients by 3 (to get rid of the fractions), we obtain the following set of contrast coefficients that can be used to see if the combined mean of Groups 2 to 4 differs from the mean of Group 1 (+3, −1, −1, −1). If we reverse the signs, we obtain the following set (−3, +1, +1, +1), which still corresponds to the same contrast. The F test for a contrast detects the magnitude, and not the direction, of differences among group means; therefore, it does not matter if the signs on a set of contrast coefficients are reversed.

In SPSS, users can select an option that allows them to enter a set of contrast coefficients to make many different types of comparisons among group means. To see some possible contrasts, imagine a situation in which

there are k = 5 groups.

This set of contrast coefficients simply compares the means of Groups 1 and 5 (ignoring all the other groups): (+1, 0, 0, 0, −1).

This set of contrast coefficients compares the combined mean of Groups 1 to 4 with the mean of Group 5: (+1, +1, +1, +1, −4).

Contrast coefficients can be used to test for specific patterns, such as a linear trend (scores on the outcome variable might tend to increase linearly if Groups 1 through 5 correspond to equally spaced dosage levels of a drug): (−2, −1, 0, +1, +2).

A curvilinear trend can also be tested; for instance, the researcher might expect to find that the highest scores on the outcome variable occur at moderate dosage levels of the independent variable. If the five groups received five equally spaced different levels of background noise and the researcher predicts the best task performance at a moderate level of noise, an appropriate set of contrast coefficients would be (−1, 0, +2, 0, −1).

When a user specifies contrast coefficients, it is necessary to have one coefficient for each level or group in the ANOVA; if there are k groups, each contrast that is specified must include k coefficients. A user may specify more than one set of contrasts, although usually the number of contrasts does not exceed k − 1 (where k is the number of groups). The following simple guidelines are usually sufficient to understand what comparisons a given set of coefficients makes:

- Groups with positive coefficients are compared with groups with negative coefficients; groups that have coefficients of 0 are omitted from such comparisons.

- It does not matter which groups have positive versus negative coefficients; a difference can be detected by the contrast analysis whether or not the coefficients code for it in the direction of the difference.

- For contrast coefficients that represent trends, if you draw a graph that shows how the contrast coefficients change as a function of group number (X), the line shows pictorially what type of trend the contrast coefficients will detect. Thus, if you plot the coefficients (−2, −1, 0, +1,

+2) as a function of the group numbers 1, 2, 3, 4, 5, you can see that these coefficients test for a linear trend. The test will detect a linear trend whether it takes the form of an increase or a decrease in mean Y values across groups.

When a researcher uses more than one set of contrasts, he or she may want to know whether those contrasts are logically independent, uncorrelated, or orthogonal. There is an easy way to check whether the contrasts implied by two sets of contrast coefficients are orthogonal or independent. Essentially, to check for orthogonality, you just compute a (shortcut) version of a correlation between the two lists of coefficients. First, you list the coefficients for Contrasts 1 and 2 (make sure that each set of coefficients sums to 0, or this shortcut will not produce valid results).

You cross-multiply each pair of corresponding coefficients (i.e., the coefficients that are applied to the same group) and then sum these cross products. In this example, you get

In this case, the result is −1. This means that the two contrasts above are not independent or orthogonal; some of the information that they contain about differences among means is redundant.

Consider a second example that illustrates a situation in which the two contrasts are orthogonal or independent:

In this second example, the curvilinear contrast is orthogonal to the linear

trend contrast. In a one-way ANOVA with k groups, it is possible to have up to (k − 1) orthogonal contrasts. The preceding discussion of contrast coefficients assumed that the groups in the one-way ANOVA had equal ns. When the ns in the groups are unequal, it is necessary to adjust the values of the contrast coefficients so that they take unequal group size into account; this is done automatically in programs such as SPSS.

6.10.2 Post Hoc or “Protected” Tests

If the researcher wants to make all possible comparisons among groups or does not have a theoretical basis for choosing a limited number of comparisons before looking at the data, it is possible to use test procedures that limit the risk of Type I error by using “protected” tests. Protected tests use a more stringent criterion than would be used for planned contrasts in judging whether any given pair of means differs significantly. One method for setting a more stringent test criterion is the Bonferroni procedure, described in Chapter 3. The Bonferroni procedure requires that the data analyst use a more conservative (smaller) alpha level to judge whether each individual comparison between group means is statistically significant. For instance, in a one-way ANOVA with k = 5 groups, there are k × (k − 1)/2 = 10 possible pairwise comparisons of group means. If the researcher wants to limit the overall experiment-wise risk of Type I error (EWα) for the entire set of 10 comparisons to .05, one possible way to achieve this is to set the per- comparison alpha (PCα) level for each individual significance test between means at αEW/(number of post hoc tests to be performed). For example, if the experimenter wants an experiment-wise α of .05 when doing k = 10 post hoc comparisons between groups, the alpha level for each individual test would be set at EWα/k, or .05/10, or .005 for each individual test. The t test could be calculated using the same formula as for an ordinary t test, but it would be judged significant only if its obtained p value were less than .005. The Bonferroni procedure is extremely conservative, and many researchers prefer less conservative methods of limiting the risk of Type I error. (One way to make the Bonferroni procedure less conservative is to set the experiment- wise alpha to some higher value such as .10.)

Dozens of post hoc or protected tests have been developed to make comparisons among means in ANOVA that were not predicted in advance.

Some of these procedures are intended for use with a limited number of comparisons; other tests are used to make all possible pairwise comparisons among group means. Some of the better-known post hoc tests include the Scheffé, the Newman-Keuls, and the Tukey honestly significant difference (HSD). The Tukey HSD has become popular because it is moderately conservative and easy to apply; it can be used to perform all possible pairwise comparisons of means and is available as an option in widely used computer programs such as SPSS. The menu for the SPSS one-way ANOVA procedure includes the Tukey HSD test as one of many options for post hoc tests; SPSS calls it the Tukey procedure.

The Tukey HSD test (and several similar post hoc tests) uses a different method of limiting the risk of Type I error. Essentially, the Tukey HSD test uses the same formula as a t ratio, but the resulting test ratio is labeled “q” rather than “t,” to remind the user that it should be evaluated using a different sampling distribution. The Tukey HSD test and several related post hoc tests use critical values from a distribution called the “Studentized range statistic,” and the test ratio is often denoted by the letter q:

Values of the q ratio are compared with critical values from tables of the Studentized range statistic (see the table in Appendix F). The Studentized range statistic is essentially a modified version of the t distribution. Like t, its distribution depends on the numbers of subjects within groups, but the shape of this distribution also depends on k, the number of groups. As the number of groups (k) increases, the number of pairwise comparisons also increases. To protect against inflated risk of Type I error, larger differences between group means are required for rejection of the null hypothesis as k increases. The distribution of the Studentized range statistic is broader and flatter than the t distribution and has thicker tails; thus, when it is used to look up critical values of q that cut off the most extreme 5% of the area in the upper and lower tails, the critical values of q are larger than the corresponding critical values of t.

This formula for the Tukey HSD test could be applied by computing a q ratio for each pair of sample means and then checking to see if the obtained q for each comparison exceeded the critical value of q from the table of the

Studentized range statistic. However, in practice, a computational shortcut is often preferred. The formula is rearranged so that the cutoff for judging a difference between groups to be statistically significant is given in terms of differences between means rather than in terms of values of a q ratio.

Then, if the obtained difference between any pair of means (such as Ma − Mb) is greater in absolute value than this HSD, this difference between means is judged statistically significant.

An HSD criterion is computed by looking up the appropriate critical value of q, the Studentized range statistic, from a table of this distribution (see the table in Appendix F). The critical q value is a function of both n, the average number of subjects per group, and k, the number of groups in the overall one-way ANOVA. As in other test situations, most researchers use the critical value of q that corresponds to α = .05, two-tailed. This critical q value obtained from the table is multiplied by the error term to yield HSD. This HSD is used as the criterion to judge each obtained difference between sample means. The researcher then computes the absolute value of the difference between each pair of group means (M1 − M2), (M1 − M3), and so forth. If the absolute value of a difference between group means exceeds the HSD value just calculated, then that pair of group means is judged to be significantly different.

When a Tukey HSD test is requested from SPSS, SPSS provides a summary table that shows all possible pairwise comparisons of group means and reports whether each of these comparisons is significant. If the overall F for the one-way ANOVA is statistically significant, it implies that there should be at least one significant contrast among group means. However, it is possible to have situations in which a significant overall F is followed by a set of post hoc tests that do not reveal any significant differences among means. This can happen because protected post hoc tests are somewhat more conservative, and thus require slightly larger between-group differences as a basis for a decision that differences are statistically significant, than the overall one-way ANOVA.

****************

6.11 SPSS Output and Model Results

To run the one-way between-S ANOVA procedure in SPSS, make the following menu selections from the menu at the top of the data view worksheet, as shown in Figure 6.4: <Analyze> <Compare Means> <One- Way ANOVA>. This opens up the dialog window in Figure 6.5. Enter the name of one (or several) dependent variables into the window called Dependent List; enter the name of the categorical variable that provides group membership information into the window named Factor. For this example, additional windows were accessed by clicking on the buttons marked Post Hoc, Contrasts, and Options. The SPSS screens that correspond to this series of menu selections are shown in Figures 6.6 through 6.8.

From the menu of post hoc tests, this example used the one SPSS calls Tukey (this corresponds to the Tukey HSD test). To define a contrast that compared the mean of Group 1 (no stress) with the mean of the three stress treatment groups combined, these contrast coefficients were entered one at a time: +3, −1, −1, −1. From the list of options, “Descriptive” statistics and “Homogeneity of variance test” were selected by placing checks in the boxes next to the names of these tests.

The output for this one-way ANOVA is reported in Figure 6.9. The first panel provides descriptive information about each of the groups: mean, standard deviation, n, a 95% CI for the mean, and so forth. The second panel shows the results for the Levene test of the homogeneity of variance assumption; this is an F ratio with (k − 1) and (N − k) df. The obtained F was not significant for this example; there was no evidence that the homogeneity of variance assumption had been violated. The third panel shows the ANOVA source table with the overall F; this was statistically significant, and this implies that there was at least one significant contrast between group means. In practice, a researcher would not report both planned contrasts and post hoc tests; however, both were presented for this example to show how they are obtained and reported.

Figure 6.4 SPSS Menu Selections for One-Way Between-S ANOVA

****************

Figure 6.5 One-Way Between-S ANOVA Dialog Window

Figure 6.10 shows the output for the planned contrast that was specified by entering these contrast coefficients: (+3, −1, −1, −1). These contrast

****************

coefficients correspond to a test of the null hypothesis that the mean anxiety of the no-stress group (Group 1) was not significantly different from the mean anxiety of the three stress intervention groups (Groups 2–4) combined. SPSS reported a t test for this contrast (some textbooks and programs use an F test). This t was statistically significant, and examination of the group means indicated that the mean anxiety level was significantly higher for the three stress intervention groups combined, compared with the control group.

Figure 6.6 SPSS Dialog Window, One-Way ANOVA, Post Hoc Test Menu

Figure 6.7 Specification of a Planned Contrast

****************

NOTES:

Group Contrast Coefficient 1 No stress +3 2 Mental arithmetic −1 3 Stress role play −1 4 Mock job interview −1

Null hypothesis about weighted linear composite of means that is represented by this set of contrast coefficients:

Figure 6.8 SPSS Dialog Window for One-Way ANOVA: Options

****************

Figure 6.11 shows the results for the Tukey HSD tests that compared all possible pairs of group means. The table “Multiple Comparisons” gives the difference between means for all possible pairs of means (note that each comparison appears twice; that is, Group a is compared with Group b, and in another line, Group b is compared with Group a). Examination of the “sig” or p values indicates that several of the pairwise comparisons were significant at the .05 level. The results are displayed in a more easily readable form in the last panel under the heading “Homogeneous Subsets.” Each subset consists of group means that were not significantly different from each other using the Tukey test. The no-stress group was in a subset by itself; in other words, it had significantly lower mean anxiety than any of the three stress intervention groups. The second subset consisted of the stress role play and mental arithmetic groups, which did not differ significantly in anxiety. The third subset consisted of the mental arithmetic and mock job interview groups.

Note that it is possible for a group to belong to more than one subset; the anxiety score for the mental arithmetic group was not significantly different from the stress role play or the mock job interview groups. However, because the stress role play group differed significantly from the mock job interview group, these three groups did not form one subset. ****************

Figure 6.9 SPSS Output for One-Way ANOVA

Figure 6.10 SPSS Output for Planned Contrasts

****************

Figure 6.11 SPSS Output for Post Hoc Test (Tukey HSD)

Note also that it is possible for all the Tukey HSD comparisons to be nonsignificant even when the overall F for the one-way ANOVA is statistically significant. This can happen because the Tukey HSD test requires a slightly larger difference between means to achieve significance.

In this imaginary example, as in some research studies, the outcome ****************

measure (anxiety) is not a standardized test for which we have norms. The numbers by themselves do not tell us whether the mock job interview participants were moderately anxious or twitching, stuttering wrecks. Studies that use standardized measures can make comparisons to test norms to help readers understand whether the group differences were large enough to be of clinical or practical importance. Alternatively, qualitative data about the behavior of participants can also help readers understand how substantial the group differences were.

SPSS one-way ANOVA does not provide an effect-size measure, but this can easily be calculated by hand. In this case, eta squared is found by taking the ratio SSbetween/SStotal from the ANOVA source table: η2 = .60.

To obtain a graphic summary of the group means that shows the 95% CI for each mean, use the SPSS menu selections and commands in Figures 6.12, 6.13, and 6.14; this graph appears in Figure 6.15. Generally, the group means are reported either in a summary table (as in Table 6.8) or as a graph (as in Figure 6.15); it is not necessary to include both.

Figure 6.12 SPSS Menu Selections for Graph of Cell Means With Error Bars or Confidence Intervals

Figure 6.13 Initial SPSS Dialog Window for Error Bar Graph ****************

Figure 6.14 Dialog Window for Graph of Means With Error Bars

****************

Figure 6.15 Means for Stress/Anxiety Data With 95% Confidence Intervals (CIs) for Each Group Mean

****************

6.12 Summary

One-way between-S ANOVA provides a method for comparison of more than two group means. However, the overall F test for the ANOVA does not provide enough information to completely describe the pattern in the data. It is often necessary to perform additional comparisons among specific group means to provide a complete description of the pattern of differences among group means. These can be a priori or planned contrast comparisons (if a limited number of differences were predicted in advance). If the researcher did not make a limited number of predictions in advance about differences between group means, then he or she may use protected or post hoc tests to do any follow-up comparisons; the Bonferroni procedure and the Tukey HSD test were described here, but many other post hoc procedures are available.

Table 6.8 Mean Anxiety Scores Across Types of Stress

****************

Results A one-way between-S ANOVA was done to compare the mean scores on an anxiety scale (0 = not at all anxious, 20 = extremely anxious) for participants who were randomly assigned to one of four groups: Group 1 = control group/no stress, Group 2 = mental arithmetic, Group 3 = stressful role play, and Group 4 = mock job interview. Examination of a histogram of anxiety scores indicated that the scores were approximately normally distributed with no extreme outliers. Prior to the analysis, the Levene test for homogeneity of variance was used to examine whether there were serious violations of the homogeneity of variance assumption across groups, but no significant violation was found: F(3, 24) = .718, p = .72.

The overall F for the one-way ANOVA was statistically significant, F(3, 24) = 11.94, p < .001. This corresponded to an effect size of η2 = .60; that is, about 60% of the variance in anxiety scores was predictable from the type of stress intervention. This is a large effect. The means and standard deviations for the four groups are shown in Table 6.8.

One planned contrast (comparing the mean of Group 1, no stress, to the combined means of Groups 2 to 4, the stress intervention groups) was performed. This contrast was tested using α = .05, two-tailed; the t test that assumed equal variances was used because the homogeneity of variance assumption was not violated. For this contrast, t(24) = −5.18, p < .001. The mean anxiety score for the no-stress group (M = 9.86) was significantly lower than the mean anxiety score for the three combined stress intervention groups (M = 14.95).

In addition, all possible pairwise comparisons were made using the Tukey HSD test. Based on this test (using α = .05), it was found that the no-stress group scored significantly lower on anxiety than all three stress intervention groups. The stressful role play (M = 13.57) was significantly less anxiety producing than the mock job interview (M = 17.00). The mental arithmetic task produced a mean level of anxiety (M = 14.29) that was intermediate between the other stress conditions, and it did not differ significantly from either the stress role play or the mock job interview. Overall, the mock job interview produced the highest levels of anxiety. Figure 6.15 shows a 95% CI around each group mean.

The most important concept from this chapter is the idea that a score can be divided into components (one part that is related to group membership or treatment effects and a second part that is due to the effects of all other “extraneous” variables that uniquely influence individual participants). Information about the relative sizes of these components can be summarized across all the participants in a study by computing the sum of squared deviations (SS) for the between-group and within-group deviations. Based on the SS values, it is possible to compute an effect-size estimate (η2) that

****************

describes the proportion of variance predictable from group membership (or treatment variables) in the study. Researchers usually hope to design their studies in a manner that makes the proportion of explained variance reasonably high and that produces statistically significant differences among group means. However, researchers have to keep in mind that the proportion of variance due to group differences in the artificial world of research may not correspond to the “true” strength of the influence of the variable out in the “real world.” In experiments, we create an artificial world by holding some variables constant and by manipulating the treatment variable; in nonexperimental research, we create an artificial world through our selection of participants and measures. Thus, results of our research should be interpreted with caution.

Notes 1. If the groups are not equivalent—that is, if the groups have different means on participant

characteristics that may be predictive of the outcome variable—ANCOVA (see Chapter 17) may be used to correct for the nonequivalence by statistical control. However, it is preferable to identify and correct any differences in the composition of groups prior to data collection, if possible.

- When the formal algebra of expected mean squares is applied to work out the population variances that are estimated by the sample values of MSwithin and MSbetween, the following results are obtained if the group ns are equal (Winer, Brown, & Michels, 1991):

, the population variance due to error (all other extraneous variables)

, where is the population variance due to the effects of membership in the naturally occurring group or to the manipulated treatment variable. Thus, when we look at F ratio estimates, we obtain

The null hypothesis (that all the population means are equal) can be stated in a different form, which says the variance of the population means is zero:

If H0 is true and , then the expected value of the F ratio is 1. Values of F that are substantially larger than 1, and that exceed the critical value of F from tables of the F distribution, are taken as evidence that this null hypothesis may be false.

- Calculating the harmonic mean of cell or group ns: When the groups have different ns, the harmonic mean H of a set of unequal ns is obtained by using the following equation:

****************

That is, sum the inverses of the ns of the groups, divide this sum by the number of groups, and then take the inverse of that result to obtain the harmonic mean (H) of the ns.

Comprehension Questions

1.

A nonexperimental study was done to assess the impact of the accident at the Three Mile Island (TMI) nuclear power plant on nearby residents (Baum, Gatchel, & Schaeffer, 1983). Data were collected from residents of the following four areas: Group 1: Three Mile Island, where a nuclear accident occurred (n = 38) Group 2: Frederick, with no nuclear power plant nearby (n = 27) Group 3: Dickerson, with an undamaged coal power plant nearby (n = 24) Group 4: Oyster Creek, with an undamaged nuclear power plant nearby (n = 32)

Several different measures of stress were taken for people in these four groups. The researchers hypothesized that residents who lived near TMI (Group 1) would score higher on a wide variety of stress measures than people who lived in the other three areas included as comparisons. One- way ANOVA was performed to assess differences among these four groups on each outcome. Selected results are reported below for you to discuss and interpret. Here are results for two of their outcome measures: Stress (total reported stress symptoms) and Depression (score on the Beck Depression Inventory). Each cell lists the mean, followed by the standard deviation in parentheses.

****************

a.

Write a Results section in which you report whether these overall differences were statistically significant for each of these two outcome variables (using α = .05). You will need to look up the critical value for F, and in this instance, you will not be able to include an exact p value. Include an eta-squared effectsize index for each of the F ratios (you can calculate this by hand from the information given in the table). Be sure to state the nature of the differences: Did the TMI group score higher or lower on these stress measures relative to the other groups?

- Would your conclusions change if you used α = .01 instead of α = .05 asyour criterion for statistical significance?

- Name a follow-up test that could be done to assess whether all possiblepairwise comparisons of group means were significant.

- Write out the contrast coefficients to test whether the mean for Group 1 (people who lived near TMI) differed from the average for the other three comparison groups.

- Here is some additional information about scores on the Beck Depression Inventory. For purposes of clinical diagnosis, Beck, Steer, and Brown (1996) suggested the following cutoffs:

0–13: Minimal depression 14–19: Mild depression 20–28: Moderate depression 29–63: Severe depression

In light of this additional information, what would you add to your discussion of the outcomes for depression in the TMI group versus groups from other regions? (Did the TMI accident make people severely depressed?)

****************

2.

Sigall and Ostrove (1975) did an experiment to assess whether the physical attractiveness of a defendant on trial for a crime had an effect on the severity of the sentence given in mock jury trials. Each of the participants in this study was randomly assigned to one of the following three treatment groups; every participant received a packet that described a burglary and gave background information about the accused person. The three treatment groups differed in the type of information they were given about the accused person’s appearance. Members of Group 1 were shown a photograph of an attractive person; members of Group 2 were shown a photograph of an unattractive person; members of Group 3 saw no photograph. Some of their results are described here. Each participant was asked to assign a sentence (in years) to the accused person; the researchers predicted that more attractive persons would receive shorter sentences.

a.

Prior to assessment of the outcome, the researchers did a manipulation check. Members of Groups 1 and 2 rated the attractiveness (on a 1 to 9 scale, with 9 being the most attractive) of the person in the photo. They reported that for the attractive photo, M = 7.53; for the unattractive photo, M = 3.20, F(1, 108) = 184.29. Was this difference statistically significant (using α = .05)?

- What was the effect size for the difference in (2a)? c. Was their attempt to manipulate perceived attractiveness successful? d. Why does the F ratio in (2a) have just df = 1 in the numerator? e. The mean length of sentence given in the three groups was as follows:

Group 1: Attractive photo, M = 2.80 Group 2: Unattractive photo, M = 5.20 Group 3: No photo, M = 5.10

They did not report a single overall F comparing all three groups; instead, they reported selected pairwise comparisons. For Group 1 versus Group 2, F(1, 108) = 6.60, p < .025. Was this difference statistically significant? If they had done an overall F to assess the significance of differences of means among all three groups, do you think this overall F would have been statistically significant?

- Was the difference in mean length of sentence in part (2e) in the

****************

- predicted direction?

- Calculate and interpret an effect-size estimate for this obtained F.

- What additional information would you need about these data to do a Tukey honestly significant difference test to see whether Groups 2 and 3, as well as 1 and 3, differed significantly?

3.

Suppose that a researcher has conducted a simple experiment to assess the effect of background noise level on verbal learning. The manipulated independent level is the level of white noise in the room (Group 1 = low level = 65 dB; Group 2 = high level = 70 dB). (Here are some approximate reference values for decibel noise levels: 45 dB, whispered conversation; 65 dB, normal conversation; 80 dB, vacuum cleaner; 90 dB, chain saw or jack hammer; 120 dB, rock music played very loudly.) The outcome measure is number of syllables correctly recalled from a 20-item list of nonsense syllables. Participants ranged in age from 17 to 70 and had widely varying levels of hearing acuity; some of them habitually studied in quiet places and others preferred to study with the television or radio turned on. There were 5 participants in each of the two groups. The researcher found no significant difference in mean recall scores between these groups.

a.

Describe three specific changes to the design of this noise/learning study that would be likely to increase the size of the t ratio (and, therefore, make it more likely that the researcher would find a significant effect).

- Also, suppose that the researcher has reason to suspect that there is a curvilinear relation between noise level and task performance. What change would this require in the research design?

4.

Suppose that Kim is a participant in a study that compares several coaching methods to see how they affect math Scholastic Aptitude Test (SAT) scores. The grand mean of math SAT scores for all participants (MY) in the study is 550. The group that Kim participated in had a mean math SAT score of 565. Kim’s individual score on the math SAT was 610.

a.

What was the estimated residual component (εij) of Kim’s score, that is, the part of Kim’s score that was not related to the coaching method? (Both parts of this question call for specific numerical values as

answers.) ****************

- What was the “effect” (αi) component of Kim’s score?

- What pattern in grouped data would make SSwithin = 0? What pattern within data would make SSbetween = 0?

- Assuming that a researcher hopes to demonstrate that a treatment or group membership variable makes a significant difference in outcomes, which term does the researcher hope will be larger, MSbetween or MSwithin? Why?

- Explain the following equation:

(Yij − MY) = (Yij − Mj) + (Mj − MY).

What do we gain by breaking the (Yij − MY) deviation into two separate components, and what do each of these components represent? Which of the terms on the right-hand side of the equation do researchers typically hope will be large, and why?

- What is H0 for a one-way ANOVA? If H0 is rejected, does that imply that each mean is significantly different from every other mean?

- What information do you need to decide on a sample size that willprovide adequate statistical power? 10. In the equation αj = (Mj − MY), what do we call the αj term?

- If there is an overall significant F in a one-way ANOVA, can we conclude that the group membership or treatment variable caused the observed differences in the group means? Why or why not?

- How can eta squared be calculated from the SS values in an ANOVAsummary table?

- Which of these types of tests is more conservative: planned contrasts orpost hoc/protected tests?

- Name two common post hoc procedures for the comparison of means inan ANOVA.

****************

Applied Statics 164

Applied Statics 165

Applied Statics 166

Applied Statics 167

Applied Statics 168

Applied Statics 169

Applied Statics 170

Applied Statics 171

Applied Statics 172

Applied Statics 173

Applied Statics 174

Applied Statics 175

Applied Statics 176

Applied Statics 177

Applied Statics 178

Applied Statics 179

Applied Statics 180

Applied Statics 181

Applied Statics 182

Applied Statics 183

Applied Statics 184

Applied Statics 185

Applied Statics 186

Applied Statics 187

Applied Statics 188

Applied Statics 189

Applied Statics 190

Applied Statics 191

Applied Statics 192

Applied Statics 193

Applied Statics 194

Applied Statics 195

Applied Statics 196

Applied Statics 197

Applied Statics 198

Applied Statics 199

Applied Statics 200

Applied Statics 201

Applied Statics 202

Applied Statics 203

Applied Statics 204

Applied Statics 205

Applied Statics 206

Applied Statics 207

Applied Statics 208

Applied Statics 209

Applied Statics 210

Applied Statics 211

Applied Statics 212

Applied Statics 213

Applied Statics 214

Applied Statics 215

Applied Statics 216

Applied Statics 217

Applied Statics 218