Quantitative Reasoning and Analysis

PRINTED BY: Patricia Sellers <PATRICIA.SELLERS@WALDENU.EDU>. Printing is for personal, private use only. No part of this book may be reproduced or transmitted without publisher’s prior permission. Violators will be prosecuted.

I

Chapter 5

Measures of Variability

Chapter Learning Objectives

Understanding the importance of measuring variability Learning how to calculate and interpret the index of qualitative variation

(IQV), range, interquartile range, the variance, and the standard deviation Understanding the criteria for choosing a measure of variation

n the previous chapter, we looked at measures of central tendency: the mean, the median, and the mode. With these measures, we can use a single number to describe what is average for or typical of a distribution. Although measures of central tendency can be very helpful, they tell only part of the story. In fact, when

used alone, they may mislead rather than inform. Another way of summarizing a distribution of data is by selecting a single number that describes how much variation and diversity there is in the distribution. Numbers that describe diversity or variation are called measures of variability. Researchers often use measures of central tendency along with measures of variability to describe their data.

Measures of variability Numbers that describe diversity or variability in the distribution of a variable.

In this chapter, we discuss five measures of variability: the index of qualitative variation, the range, the interquartile range, the standard deviation, and the variance. Before we discuss these measures, let’s explore why they are important.

THE IMPORTANCE OF MEASURING VARIABILITY

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/glossary.xlink.html#glo63

PRINTED BY: Patricia Sellers <PATRICIA.SELLERS@WALDENU.EDU>. Printing is for personal, private use only. No part of this book may be reproduced or transmitted without publisher’s prior permission. Violators will be prosecuted.

The importance of looking at variation and diversity can be illustrated by thinking about the differences in the experiences of U.S. women. Are women united by their similarities or divided by their differences? The answer is both. To address the similarities without dealing with differences is “to misunderstand and distort that which separates as well as that which binds women together.”1 Even when we focus on one particular group of women, it is important to look at the differences as well as the commonalities. Take, for example, Asian American women. As a group, they share a number of characteristics.

Their participation in the workforce is higher than that of women in any other ethnic group. Many . . . live life supporting others, often allowing their lives to be subsumed by the needs of the extended family. . . . However, there are many circumstances when these shared experiences are not sufficient to accurately describe the condition of a particular Asian- American woman. Among Asian-American women there are those who were born in the United States . . . and . . . those who recently arrived in the United States. Asian-American women are diverse in their heritage or country of origin: China, Japan, the Philippines, Korea . . . and . . . India. . . . Although the majority of Asian-American women are working class—contrary to the stereotype of the “ever successful” Asians—there are poor, “middle- class,” and even affluent Asian-American women.2

As this example illustrates, one basis of stereotyping is treating a group as if it were totally represented by its central value, ignoring the diversity within the group. Sociologists often contribute to this type of stereotyping when their empirical generalizations, based on a statistical difference between averages, are interpreted in an overly simplistic way. All this argues for the importance of using measures of variability as well as central tendency whenever we want to characterize or compare groups. Whereas the similarities and commonalities in the experiences of Asian American women are depicted by a measure of central tendency, the diversity of their experiences can be described only by using measures of variation.

The concept of variability has implications not only for describing the diversity of social groups such as Asian American women but also for issues that are important in your everyday life. One of the most important issues facing the academic community is how to reconstruct the curriculum to make it more responsive to the needs of students. Let’s consider the issue of statistics instruction on the college level.

Statistics is perhaps the most anxiety-provoking course in any social science curriculum. Statistics courses are often the last “roadblock” preventing students from completing their major requirements. One factor, identified in numerous studies as a handicap for many students, is the “math anxiety syndrome.” This anxiety often leads to a less than optimum learning environment, with students often trying to memorize

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note1

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note2

PRINTED BY: Patricia Sellers <PATRICIA.SELLERS@WALDENU.EDU>. Printing is for personal, private use only. No part of this book may be reproduced or transmitted without publisher’s prior permission. Violators will be prosecuted.

every detail of a statistical procedure rather than understand the general concept involved.

Let’s suppose that a university committee is examining the issue of how to better respond to the needs of students. In its attempt to evaluate statistics courses offered in different departments, the committee compares the grading policy in two courses. The first, offered in the sociology department, is taught by Professor Brown; the second, offered through the school of social work, is taught by Professor Yamato. The committee finds that over the years, the average grade for Professor Brown’s class has been C+. The average grade in Professor Yamato’s class is also C+. We could easily be misled by these statistics into thinking that the grading policy of both instructors is about the same. However, we need to look more closely into how the grades are distributed in each of the classes. The differences in the distribution of grades are illustrated in Figure 5.1, which displays the frequency polygon for the two classes.

Figure 5.1 Distribution of Grades for Professors Brown’s and Yamato’s Statistics Classes

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.1

Compare the shapes of these two distributions. Notice that while both distributions have the same mean, they are shaped very differently. The grades in Professor Yamato’s class are more spread out ranging from A to F, whereas the grades for Professor Brown’s class are clustered around the mean and range only from B to C. Although the means for both distributions are identical, the grades in Professor Yamato’s class vary considerably more than the grades given by Professor Brown. The comparison between the two classes is more complex than we first thought it would be.

As this example demonstrates, information on how scores are spread from the center of a distribution is as important as information about the central tendency in a distribution. This type of information is obtained by measures of variability.

Learning Check

Look closely at Figure 5.1. Whose class would you choose to take? If you were worried that you might fail statistics, your best bet would be Professor Brown’s class where no one fails. However, if you want to keep up your GPA and are willing to work, Professor Yamato’s class is the better choice. If you had to choose one of these classes based solely on the average grades, your choice would not be well informed.

THE INDEX OF QUALITATIVE VARIATION: A BRIEF INTRODUCTION

The United States is undergoing a demographic shift from a predominantly European population to one characterized by increased racial, ethnic, and cultural diversity. These changes challenge us to rethink every conceptualization of society based solely on the experiences of European populations and force us to ask questions that focus on the experiences of different racial/ethnic groups. For instance, we may want to compare the racial/ethnic diversity in different cities, regions, or states or may want to find out if a group has become more racially and ethnically diverse over time.

The index of qualitative variation (IQV) is a measure of variability for nominal variables such as race and ethnicity. The index can vary from 0.00 to 1.00. When all the cases in the distribution are in one category, there is no variation (or diversity) and the IQV is 0.00. In contrast, when the cases in the distribution are distributed evenly across the categories, there is maximum variation (or diversity) and the IQV is 1.00.

Index of qualitative variation (IQV) A measure of variability for nominal variables. It is based on the ratio of the total number of differences in the distribution to the maximum number of possible differences within the same distribution.

Suppose you live in Maine, where the majority of residents are white and a small minority are Latino or Asian. Also suppose that your best friend lives in Hawaii, where almost half of the population is either Asian or Native Hawaiian. The distributions for these two states are presented in Table 5.1. Which is more diverse? Clearly, Hawaii, where half the population is either Asian or Native Hawaiian, is more diverse than Maine, where Asians and Latinos are but a small minority.

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.1

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/glossary.xlink.html#glo41

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.1

Table 5.1 Top Five Racial/Ethnic Groups for Two States by Percentage, 2010

*The category “Other” counts as a racial/ethnic group.

Source: U.S. Census Bureau, Statistical Abstract of the United States: 2012 Tables 18–19.

Steps for Calculating the IQV To calculate the IQV, we use this formula:

where

K = the number of categories

ΣPct2 = the sum of all squared percentages in the distribution

In Table 5.2, we present the squared percentages for each racial/ethnic group for Maine and Hawaii.

Table 5.2 Squared Percentages for Five Racial/Ethnic Groups for Two States

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.2

* The category “Other” counts as a racial/ethnic group.

The IQV for Maine is

The IQV for Hawaii is

Note that the values of the IQV for the two states support our earlier observation: In Hawaii, where the IQV is 0.84, there is considerably more racial/ethnic variation than in Maine, where the IQV is 0.07.

It is important to remember that the IQV is partially a function of the number of categories. In this example, there were four and five racial/ethnic categories in Maine and Hawaii, respectively. Had we used more categories, the IQV for both states would have been considerably more.

To summarize, these are the steps we follow to calculate the IQV:

- Construct a percentage distribution. 2. Square the percentages for each category. 3. Sum the squared percentages. 4. Calculate the IQV using the formula.

A Closer Look 5.1 Statistics in Practice: Diversity at Berkeley Through the Years*,†

“Berkeley, Calif.—The photograph in Sproul Hall of the 10 Cal ‘yell leaders’ from the early 1960s, in their Bermuda shorts and letter sweaters, leaps out like an artifact from an ancient civilization. They are all fresh-faced, and in a way that is unimaginable now, they are all white.”‡

On the flagship campus of the University of California system, the center of the affirmative action debate in higher education today, the ducktails and bouffant hairdos of those 1960s cheerleaders seem indeed out of date. The University of California’s Berkeley campus was among the first of the nation’s leading universities to embrace elements of affirmative action in its admission policies and now boasts that it has one of the most diverse campuses in the United States.

The following pie charts show the racial and ethnic breakdown of undergraduates at U.C. Berkeley for 1984 and 2012. The IQVs were calculated using the percentage distribution (as shown in the pie charts) for race and ethnicity for each year. Not only has the modal category of Berkeley’s student body changed from white in 1984 to Asian in 2012, but the campus has become one of the most diverse in the United States.

Racial/Ethnic Composition of Student Body at U.C. Berkeley, 1984 and 2012

Source: *From The New York Times, June 4. © 1995 The New York Times.

†2012 data are from the Office of Planning and Analysis, University of California, Berkeley, UC Berkeley Enrollment Data.

‡The Office of Planning and Analysis, University of California, Berkeley, Cal Answers 2012.

Expressing the IQV as a Percentage

The IQV can also be expressed as a percentage rather than a proportion: Simply multiply the IQV by 100. Expressed as a percentage, the IQV would reflect the percentage of racial/ethnic differences relative to the maximum possible differences in each distribution. Thus, an IQV of 0.07 indicates that the number of racial/ethnic differences in Maine is 7.0% (0.07 × 100) of the maximum possible differences. Similarly, for Hawaii, an IQV of 0.84 means that the number of racial/ethnic differences is 84.0% (0.84 × 100) of the maximum possible differences.

Learning Check

Examine A Closer Look 5.1 and consider the impact that the number of categories of a variable has on the IQV. What would happen to the Berkeley case if Asians were broken down into two categories with 19% Chinese Americans in one and 17% Other Asians in the other? (To answer this question you will need to recalculate the IQV with these new data.)

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#case5.1

STATISTICS IN PRACTICE: DIVERSITY IN U.S. SOCIETY

According to demographers’ projections, by the middle of this century the United States will no longer be a predominantly white society. It is estimated that the combined population of the three largest minority groups—African Americans, Asian Americans, and Latino Americans—will reach an estimated 116 million by 2015.3 Population shifts during the 1990s indicate geographic concentration of minority groups in specific regions and metropolitan areas of the United States.4 Demographers call it chain migration: Essentially, migrants use social capital–specific knowledge of the migration process (i.e., to move from one area and settle in another).5 For example, as of 2008, the Los Angeles metro area was home to 4.7 million Latino Americans, placing it first among cities in total growth in the number of Latino Americans.

How do you compare the amount of diversity in different cities, states, or regions? Diversity is a characteristic of a population many of us can sense intuitively. For example, the ethnic diversity of a large city is seen in the many members of various groups encountered when walking down its streets or traveling through its neighborhoods.6

We can use the IQV to measure the amount of diversity in different regions. Table 5.3 displays the 2011 percentage breakdown of the population by race for all four regions of the United States. Based on these data, and using Formula 5.1 as in our earlier example, we have also calculated the IQV for each region. The advantage of using a single number to express diversity is demonstrated in Figure 5.2, which depicts the regional variations in diversity as expressed by the IQVs from Table 5.3. Figure 5.2 shows the wide variation in racial diversity that exists in the United States. Note that the West, with an IQV of 0.79, is the most diverse region. At the other extreme, the Midwest, whose population is overwhelmingly white, is the most homogeneous region with an IQV of 0.48.

Table 5.3 Percentage Makeup of Population for Regions by Race, 2011

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note3

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note4

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note5

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note6

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.3

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#form5.1

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.2

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.3

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.2

Source: U.S. Census Bureau, American Community Survey, 2011.

Figure 5.2 Racial Diversity in the United States in 2011 (IQV)

THE RANGE

The simplest and most straightforward measure of variation is the range, which measures variation in interval-ratio variables. It is the difference between the highest

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/glossary.xlink.html#glo95

(maximum) and the lowest (minimum) scores in the distribution:

Range = Highest score – Lowest score

In the 2010 General Social Survey (GSS), the oldest person included in the study was 89 years old and the youngest was 18. Thus, the range was 89 – 18 = 71 years.

Range A measure of variation in interval-ratio variables. It is the difference between the highest (maximum) and the lowest (minimum) scores in the distribution.

The range can also be calculated on percentages. For example, since the 1980s, relatively large communities of the elderly have become noticeable not just in the traditional retirement meccas of the Sun Belt7 but also in the Ozarks of Arkansas and the mountains of Colorado and Montana. The number of elderly persons increased in every state during the 1990s and the 2000s (Washington, D.C., is the exception), but by different amounts. As the baby boomers age into retirement, we would expect this trend to continue. Table 5.4 displays the percentage change in the elderly population from 2008 to 2015 by region and by state as predicted by the U.S. Census Bureau.8

Table 5.4 Projected Percentage Change in the Population 65 Years and Over by Region and State, 2008–2015

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note7

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.4

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note8

Source: U.S. Census Bureau, Statistical Abstract of the United States: 2010, Tables 16 and 18.

What is the range in the percentage change in state elderly population for the United States? To find the ranges in a distribution, simply pick out the highest and the lowest scores in the distribution and subtract. Alaska has the highest percentage change, with 50%, and Washington, D.C., has the lowest change, with −14.1%. The range is 64.1 percentage points, or 50% to −14.1%.

Although the range is simple and quick to calculate, it is a rather crude measure

because it is based on only the lowest and the highest scores. These two scores might be extreme and rather atypical, which might make the range a misleading indicator of the variation in the distribution. For instance, note that among the 50 states and Washington, D.C., listed in Table 5.4, no state has a percentage decrease as that of Washington, D.C., and only Nevada has a percentage increase nearly as high as Alaska’s. The range of 64.1 percentage points does not give us information about the variation in states between Washington, D.C., and Alaska.

Learning Check

Why can’t we use the range to describe diversity in nominal variables? The range can be used to describe diversity in ordinal variables (e.g., we can say that responses to a question ranged from “somewhat satisfied” to “very dissatisfied”), but it has no quantitative meaning. Why not?

THE INTERQUARTILE RANGE: INCREASES IN ELDERLY POPULATIONS

To remedy the limitation of the range, we can employ an alternative—the interquartile range. The interquartile range (IQR), a measure of variation for interval-ratio variables, is the width of the middle 50% of the distribution. It is defined as the difference between the lower and upper quartiles (Q1 and Q3).

IQR = Q3 − Q1 Recall that the first quartile (Q1) is the 25th percentile, the point at which 25% of

the cases fall below it and 75% above it. The third quartile (Q3) is the 75th percentile, the point at which 75% of the cases fall below it and 25% above it. The IQR, therefore, defines variation for the middle 50% of the cases.

Interquartile range (IQR) The width of the middle 50% of the distribution. It is defined as the difference between the lower and upper quartiles (Q1 and Q3).

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.4

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/glossary.xlink.html#glo43

Like the range, the IQR is based on only two scores. However, because it is based on intermediate scores, rather than on the extreme scores in the distribution, it avoids some of the instability associated with the range.

These are the steps for calculating the IQR

- To find Q1 and Q3, order the scores in the distribution from the highest to the lowest score, or vice versa. Table 5.5 presents the data of Table 5.4 arranged in order from Alaska (50.0%) to Washington, D.C. (−14.1%).

- Next, we need to identify the first quartile, Q1 or the 25th percentile. We have to identify the percentage increase in the elderly population associated with the state that divides the distribution so that 25% of the states are below it and 75% of the states are above it. To find Q1, we multiply N by 0.25:

(N)(0.25) = (51)(0.25) = 12.75 The first quartile falls between the 12th and the 13th states. Counting from the bottom, the 12th state is Illinois, and the percentage increase associated with it is 12.9. The 13th state is Utah, with a percentage increase of 13.8. To find the first quartile, we take the average of 12.9 and 13.8. Therefore, (12.9 + 13.8)/2 = 13.35 is the first quartile (Q1).

- To find Q3, we have to identify the state that divides the distribution in such a way that 75% of the states are below it and 25% of the states are above it. We multiply N this time by 0.75:

(N)(0.75) = (51)(0.75) = 38.25 The third quartile falls between the 38th and the 39th states. Counting from the bottom, the 38th state is Washington, and the percentage increase associated with it is 23.2. The 39th state is Maine, with a percentage increase of 25.6. To find the third quartile, we take the average of 23.2 and 25.6. Therefore, (23.2+ 25.6)/2 = 24.4 is the third quartile (Q3).

- We are now ready to find the IQR:

IQR = Q3 − Q1 = 24.4 − 13.35 = 11.05

The IQR of percentage increase in the elderly population is 11.05 percentage points.

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.5

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.4

Notice that the IQR gives us better information than the range. The range gave us a 64.1-point spread, from 50% to −14.1%, but the IQR tells us that half the states are clustered between 24.4 and 13.35—a much narrower spread. The extreme scores represented by Alaska and Washington, D.C., have no effect on the IQR because they fall at the extreme ends of the distribution. This difference between the range and the interquartile range is also illustrated in Figure 5.3 where the extreme values of 10 children affects the value of the range but not of the IQR.

Learning Check

Why is the IQR better than the range as a measure of variability, especially when there are extreme scores in the distribution? To answer this question, you may want to examine Figure 5.3.

Figure 5.3 The Range Versus the Interquartile Range: Number of Children Among Two Groups of Women

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.3

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.3

THE BOX PLOT

A graphic device called the box plot can visually present the range, the IQR, the median, the lowest (minimum) score, and the highest (maximum) score. The box plot provides us with a way to visually examine the center, the variation, and the shape of distributions of interval-ratio variables.

Figure 5.4 is a box plot of the distribution of the 2008–2015 projected percentage increase in the elderly population displayed in Table 5.5. To construct the box plot in Figure 5.4, we used the lowest and the highest values in the distribution, the upper and lower quartiles, and the median. We can easily draw a box plot by hand following these instructions:

- Draw a box between the lower and upper quartiles. 2. Draw a solid line within the box to mark the median. 3. Draw vertical lines (called whiskers) outside the box, extending to the lowest

and highest values.

Table 5.5 Projected Percentage Change in the Population 65 Years and Over, 2008–2015, by State, Ordered From the Highest to the Lowest

Source: U.S. Census Bureau, Statistical Abstract of the United States: 2010, Tables 16 and 18.

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.4

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.5

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.4

Figure 5.4 Box Plot of the Distribution of the Projected Percentage Increase in the Elderly Population, 2008–2015

What can we learn from creating a box plot? We can obtain a visual impression of the following properties: First, the center of the distribution is easily identified by the solid line inside the box. Second, since the box is drawn between the lower and upper quartiles, the IQR is reflected in the height of the box. Similarly, the length of the vertical lines drawn outside the box (on both ends) represents the range of the distribution.9 Both the IQR and the range give us a visual impression of the spread in the distribution. Finally, the relative position of the box and the position of the median within the box tell us whether the distribution is symmetrical or skewed. A perfectly symmetrical distribution would have the box at the center of the range as well as the median in the center of the box. When the distribution departs from symmetry, the box and/or the median will not be centered; it will be closer to the lower quartile when there are more cases with lower scores or to the upper quartile when there are more cases with higher scores.

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note9

Learning Check

Is the distribution shown in the box plot in Figure 5.4 symmetrical or skewed?

Figure 5.5 Box Plots of the Projected Percentage Increase in the Elderly Population, 2008–2015, for the Northeastern and Western Regions of the United States

Box plots are particularly useful for comparing distributions. To demonstrate box plots that are shaped quite differently, in Figure 5.5 we have used the data on the percentage increase in the elderly population (Table 5.5) to compare the pattern of change occurring between 2008 and 2015 in the northeastern and western regions of the United States.

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.4

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.5

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.5

As you can see, the box plots differ from each other considerably. What can you learn from comparing the box plots for the two regions? First, the positions of the medians highlight the dramatic increase in the elderly population in the western United States. While the Northeast (median = 20.7%) is projected to experience a steady rise in its elderly population, the West shows a much higher projected percentage increase (median = 27%). Second, both the range (illustrated by the position of the whiskers in each box plot) and the IQR (illustrated by the height of the box) are much wider in the West (range = 36.2%; IQR = 14.85%) than in the Northeast (range = 18.9%; IQR = 8.7%), indicating that there is more variability among states in the West than among those in the Northeast. Finally, the relative positions of the boxes tell us something about the different shapes of these distributions. Because its box is at about the center of its range, the northeast distribution is almost symmetrical. In contrast, with its box off center and closer to the lower end of the distribution, the distribution of percentage change in the elderly population for the western states is positively skewed.

THE VARIANCE AND THE STANDARD DEVIATION: CHANGES IN THE ELDERLY POPULATION

As of 2010, the elderly population in the United States is 13 times as large as in 1900, and it is projected to continue to increase.10 The pace and direction of these demographic changes will create compelling social, economic, and ethical choices for individuals, families, and governments.

Table 5.6 presents the projected percentage change in the elderly population for all regions of the United States.

Table 5.6 Projected Percentage Change in the Elderly Population by Region, 2008–2015

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note10

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.6

Source: U.S. Census Bureau, Statistical Abstract of the United States: 2010, Tables 16 and 18.

These percentage changes were calculated by the U.S. Census Bureau using the following formula:

For example, the elderly population in the west region was 6,922,129 in 2008. In 2015, the elderly population will increase to 8,881,215. Therefore, the percentage change from 2008 to 2015 is

Table 5.6 shows that between 2008 and 2015, the size of the elderly population in the United States is projected to increase by an average of 19.95%. But this average increase does not inform us about the regional variation in the elderly population. For example, will the northeastern states show a smaller-than-average increase because of the out-migration of the elderly population to the warmer climate of the Sun Belt states? Is the projected increase higher in the South because of the immigration of the elderly?

Although it is important to know the average projected percentage increase for the nation as a whole, you may also want to know whether regional increases might differ from the national average. If the regional projected increases are close to the national average, the figures will cluster around the mean, but if the regional increases deviate much from the national average, they will be widely dispersed around the mean.

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.6

Table 5.6 suggests that there is considerable regional variation. The percentage change ranges from 27.0% in the West to 14.0% in the Midwest, so the range is 13.0% (27.0% – 14.0% = 13.0%). Moreover, most of the regions are projected to deviate considerably from the national average of 19.95%. How large are these deviations on the average? We want a measure that will give us information about the overall variations among all regions in the United States and, unlike the range or the IQR, will not be based on only two scores.

Such a measure will reflect how much, on the average, each score in the distribution deviates from some central point, such as the mean. We use the mean as the reference point rather than other kinds of averages (the mode or the median) because the mean is based on all the scores in the distribution. Therefore, it is more useful as a basis from which to calculate average deviation. The sensitivity of the mean to extreme values carries over the calculation of the average deviation, which is based on the mean. Another reason for using the mean as a reference point is that more advanced measures of variation require the use of algebraic properties that can be assumed only by using the arithmetic mean.

The variance and the standard deviation are two closely related measures of variation that increase or decrease based on how closely the scores cluster around the mean. The variance is the average of the squared deviations from the center (mean) of the distribution, and the standard deviation is the square root of the variance. Both measure variability in interval-ratio variables.

Variance A measure of variation for interval-ratio variables; it is the average of the squared deviations from the mean. Standard deviation A measure of variation for interval-ratio variables; it is equal to the square root of the variance.

Calculating the Deviation From the Mean Consider again the distribution of the percentage change in the elderly population

for the four regions of the United States. Because we want to calculate the average difference of all the regions from the national average (the mean), it makes sense to first look at the difference between each region and the mean. This difference, called a deviation from the mean, is symbolized as (Y–Ȳ). The sum of these deviations can be symbolized as Σ(Y–Ȳ).

The calculations of these deviations for each region are displayed in Table 5.7 and

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.6

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/glossary.xlink.html#glo136

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/glossary.xlink.html#glo113

Figure 5.6. We have also summed these deviations. Note that each region has either a positive or a negative deviation score. The deviation is positive when the percentage change in the elderly home population is above the mean. It is negative when the percentage change is below the mean. Thus, for example, the Northeast’s deviation score of -3.95 means that its percentage change in the elderly population was 3.95 percentage points below the mean.

Figure 5.6 Illustrating Deviations From the Mean

Table 5.7 Projected Percentage Change in the Elderly Population, 2008–2015, by Region and Deviation From the Mean

You may wonder if we could calculate the average of these deviations by simply adding up the deviations and dividing them. Unfortunately we cannot, because the sum of the deviations of scores from the mean is always zero, or algebraically Σ(Y– Ȳ). In other words, if we were to subtract the mean from each score and then add up all the deviations as we did in Table 5.7, the sum would be zero, which in turn would cause the average deviation (i.e., average difference) to compute to zero. This is always true because the mean is the center of gravity of the distribution.

Mathematically, we can overcome this problem either by ignoring the plus and minus signs, using instead the absolute values of the deviations, or by squaring the deviations—that is, multiplying each deviation by itself to get rid of the negative sign. Since absolute values are difficult to work with mathematically, the latter method is used to compensate for the problem.

Table 5.8 presents the same information as Table 5.7, but here we have squared the actual deviations from the mean and added together the squares. The sum of the squared deviations is symbolized as Σ(Y–Ȳ)2. Note that by squaring the deviations, we end up with a sum representing the deviation from the mean, which is positive. (Note that this sum will equal zero if all the cases have the same value as the mean.) In our example, this sum is Σ(Y–Ȳ)2 = 108.82.

Examine Table 5.8 again and note the disproportionate contribution of the western region to the sum of the squared deviations from the mean (it actually accounts for about 45% of the sum of squares). Can you explain why? (Hint: It has something to do with the sensitivity of the mean to extreme values.)

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.7

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.8

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.7

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.8

Table 5.8 Projected Percentage Change in the Elderly Population, 2008–2015, by Region, Deviation From the Mean, and Deviation From the Mean Squared

Calculating the Variance and the Standard Deviation The average of the squared deviations from the mean is known as the variance. The

variance is symbolized as . Remember that we are interested in the average of the squared deviations from the mean. Therefore, we need to divide the sum of the squared deviations by the number of scores (N) in the distribution. However, unlike the calculation of the mean, we will use N – 1 rather than N in the denominator.11 The formula for the variance can be stated as

where

= the variance

(Y–Ȳ) = the deviation from the mean

Σ(Y–Ȳ)2 = the sum of the squared deviations from the mean N = the number of scores

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note11

Note that the formula incorporates all the symbols we defined earlier. This formula means that the variance is equal to the average of the squared deviations from the mean.

Follow these steps to calculate the variance:

- Calculate the mean, Ȳ= Σ(Y)/N. 2. Subtract the mean from each score to find the deviation, Y–Ȳ. 3. Square each deviation, (Y–Ȳ)2. 4. Sum the squared deviations, Σ(Y–Ȳ)2. 5. Divide the sum by N – 1, Σ(Y–Ȳ)2 /(N–1). 6. The answer is the variance.

To assure yourself that you understand how to calculate the variance, go back to Table 5.8 and follow this step-by-step procedure for calculating the variance. Now plug the required quantities into Formula 5.2. Your result should look like this:

One problem with the variance is that it is based on squared deviations and therefore is no longer expressed in the original units of measurement. For instance, it is difficult to interpret the variance of 36.27, which represents the distribution of the percentage change in the elderly population, because this figure is expressed in squared percentages. Thus, we often take the square root of the variance and interpret it instead. This gives us the standard deviation, .

The standard deviation, symbolized as , is the square root of the variance, or

The standard deviation for our example is

The formula for the standard deviation uses the same symbols as the formula for the variance:

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.8

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#form5.2

As we interpret the formula, we can say that the standard deviation is equal to the square root of the average of the squared deviations from the mean.

The advantage of the standard deviation is that unlike the variance, it is measured in the same units as the original data. For instance, the standard deviation for our example is 6.02. Because the original data were expressed in percentages, this number is expressed as a percentage as well. In other words, you could say, “The standard deviation is 6.02%.” But what does this mean? The actual number tells us very little by itself, but it allows us to evaluate the dispersion of the scores around the mean.

In a distribution where all the scores are identical, the standard deviation is zero (0). Zero is the lowest possible value for the standard deviation; in an identical distribution, all the points would be the same, with the same mean, mode, and median. There is no variation or dispersion in the scores.

The more the standard deviation departs from zero, the more variation there is in the distribution. There is no upper limit to the value of the standard deviation. In our example, we can conclude that a standard deviation of 6.02% means that the projected percentage change in the elderly population for the four regions of the United States is widely dispersed around the mean of 19.95%.

The standard deviation can be considered a standard against which we can evaluate the positioning of scores relative to the mean and to other scores in the distribution. As we will see in more detail in Chapter 6, in most distributions, unless they are highly skewed, about 34% of all scores fall between the mean and 1 standard deviation above the mean. Another 34% of scores fall between the mean and 1 standard deviation below it. Thus, we would expect the majority of scores (68%) to fall within 1 standard deviation of the mean.

FOCUS ON INTERPRETATION: GDP (PER CAPITA) FOR SELECT COUNTRIES

The table below shows considerable variability in GDP per capita for the selected countries. We can use the standard deviation to assess the variability around the mean GDP per capita.

GDP per Capita for Select Countries

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0006.xlink.html

Source: GLOBAL13SSDS.

1 Current U.S. dollar in millions.

The SPSS output showing the mean and standard deviation for GDP per capita is presented below.

The table titled “Descriptive Statistics” has six columns. In the first column, we see the name of the variable “Gross Domestic Product per capita” (GDP). The next several columns tell us that there were 70 countries in our sample and that the minimum GDP per capita was $400 and the maximum was $80,700. This is quite a gap between the poorest and richest countries in our sample. The mean and standard deviation are listed in the final two columns.

The mean GDP per capita is $19,247.14, with a standard deviation of $17,628.28. We can expect about 68% of these countries to have GDP per capita values within a range of $1,618.86 ($19,247.14 – $17,628.28) to $36,875.42 ($19,274.14 + $17,628.28). Hence, based on the mean and the standard deviation, we have a pretty good indication of what would be considered a typical GDP per capita value for the majority of countries in our sample. For example, we would consider a country with a GDP per capita value of $80,700 to be extremely wealthy in comparison to other countries. More than two thirds of all countries in our sample fall closer to the mean than the country with a GDP per capita value of $80,700.

Another way to interpret the standard deviation is to compare it with another distribution. For instance, Table 5.9 displays the means and standard deviations of employee age for two samples drawn from a Fortune 100 corporation. Samples are divided into female clerical and female technical. Note that the mean ages for both samples are about the same—approximately 39 years of age. However, the standard deviations suggest that the distribution of age is dissimilar between the two groups. Figure 5.7 loosely illustrates this dissimilarity in the two distributions.

Figure 5.7 Illustrating the Means and Standard Deviations for Age Characteristics

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.9

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.7

Table 5.9 Age Characteristics of Female Clerical and Technical Employees

Source: Adapted from Marjorie Armstrong-Srassen, “The Effect of Gender and Organizational Level on How Survivors Appraise and Cope with Organizational Downsizing,” Journal of Applied Behavioral Science, 34, no. 2 (June 1998): 125–142. Reprinted with permission.

The relatively low standard deviation for female technical indicates that this group is relatively homogenous in age. That is to say, most of the women’s ages, while not identical, are fairly similar. The average deviation from the mean age of 39.87 is 3.75 years. In contrast, the standard deviation for female clerical employees is about twice the standard deviation for female technical. This suggests a wider dispersion or greater heterogeneity in the ages of clerical workers. We can say that the average deviation from the mean age of 39.46 is 7.80 years for clerical workers. The larger standard deviation indicates a wider dispersion of points below or above the mean. On

average, clerical employees are farther in age from their mean of 39.46.12

Learning Check

Take time to understand the section on standard deviation and variance. You will see these statistics in more advanced procedures. Although your instructor may require you to memorize the formulas, it is more important for you to understand how to interpret standard deviation and variance and when they can be appropriately used. Many hand calculators and all statistical software programs will calculate these measures of diversity for you, but they won’t tell you what they mean. Once you understand the meaning behind these measures, the formulas will be easier to remember.

CONSIDERATIONS FOR CHOOSING A MEASURE OF VARIATION

So far, we have considered five measures of variation: (1) the IQV, (2) the range, (3) the IQR, (4) the variance, and (5) the standard deviation. Each measure can represent the degree of variability in a distribution. But which one should we use? There is no simple answer to this question. However, in general, we tend to use only one measure of variation, and the choice of the appropriate one involves a number of considerations. These considerations and how they affect our choice of the appropriate measure are presented in the form of a decision tree in Figure 5.8.

As in choosing a measure of central tendency, one of the most basic considerations in choosing a measure of variability is the variable’s level of measurement. Valid use of any of the measures requires that the data are measured at the level appropriate for that measure or higher, as shown in Figure 5.8.

Nominal level: With nominal variables, your choice is restricted to the IQV as a measure of variability.

Ordinal level: The choice of measure of variation for ordinal variables is more problematic. The IQV can be used to reflect variability in distributions of ordinal variables, but because it is not sensitive to the rank ordering of values implied in ordinal variables, it loses some information. Another possibility is to use the IQR. However, the IQR relies on distance between two scores to express variation, information that cannot be obtained from ordinal-measured scores. The compromise

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note12

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.8

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.8

is to use the IQR (reporting Q1 and Q3) alongside the median, interpreting the IQR as the range of rank-ordered values that includes the middle 50% of the observations.13

Interval-ratio level: For interval-ratio variables, you can choose the variance (or standard deviation), the range, or the IQR. Because the range, and to a lesser extent the IQR, is based on only two scores in the distribution (and therefore tends to be sensitive if either of the two points is extreme), the variance and/or standard deviation is usually preferred. However, if a distribution is extremely skewed so that the mean is no longer representative of the central tendency in the distribution, the range and the IQR can be used. The range and the IQR will also be useful when you are reading tables or quickly scanning data to get a rough idea of the extent of dispersion in the distribution.

Figure 5.8 How to Choose a Measure of Variation

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note13

READING THE RESEARCH LITERATURE: DIFFERENCES IN COLLEGE ASPIRATIONS AND EXPECTATIONS AMONG LATINO ADOLESCENTS

In Chapter 2, we discussed how frequency distributions are presented in the

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0002.xlink.html

professional literature. We noted that most statistical tables presented in the social science literature are considerably more complex than those we describe in this book. The same can be said about measures of central tendency and variation. Most research articles use measures of central tendency and variation in ways that go beyond describing the central tendency and variation of a single variable. In this section, we refer to both the mean and the standard deviation because in most research reports the standard deviation is given along with the mean.

Table 5.10 displays data taken from a research article published in Social Problems.14 This table illustrates a common research application of the mean and standard deviation. The authors of this article examine how ethnicity plays into one’s college aspirations and expectations. Their major focus is to explore “potential differences in college aspirations and expectations across the three largest Latino groups and the potential sources of such differences.”15 We focus only on their data for Cubans and Mexicans to present a simplified example of the mean and standard deviation. Understanding the relationship between ethnicity and college aspirations and expectations is, nonetheless, critical in that the U.S. Latino population is growing faster than any other minority group, yet Latinos remain the least educated of all other people of color.

Table 5.10 Ethnicity and College Aspirations and Expectations

Source: Adapted from Stephanie A. Bohon, Monica Kirkpatrick Johnson, and Bridget K. Gorman, “College Aspirations and Expectations Among Latino Adolescents in the United States,” Social Problems 53, no. 2 (2006): 207–225. Published by the University of California Press.

Note: The authors examine variation among other ethnicities in this paper. However, to simplify this example, we focus on the descriptive for Cubans and Mexicans.

Data for this study come from the National Longitudinal Study of Adolescent Health survey. This survey is a representative sample of American adolescents in

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.10

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note14

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note15

Grades 7 to 12. Complex sampling strategies are employed to ensure a representative sample. Thus, factors such as variation in geographic location, type of school, racial makeup, and so on are accounted for during data collection.

Respondents were asked a variety of questions, but the authors focused specifically on questions about college aspirations and expectations. Their measure of college aspirations is based on a scale of 1 to 5 derived from the question, “How much do you want to go to college?” An answer of 1 indicated a low desire to go to college, while an answer of 5 indicated a high desire to go to college. Their measure of college expectations was also based on the same scale ranging from 1 (low) to 5 (high). However, it was derived from the question, “How likely is it that you will go to college?”

What can we conclude from examining the means and standard deviations for these variables? The first thing we should look at is the means. Are they similar or different? For both the expectations and aspirations measure, we can see that Mexicans have slightly lower aspirations (4.20) and expectations (3.70) than Cubans (4.50 and 4.30, respectively). Furthermore, the standard deviations indicate that there is more variability in each of these measures for Cubans than for Mexicans.

The researchers of this study described the data displayed in Table 5.10 as follows:

They show strong aspirations for and expectations of college attendance across each of the five groups. Important differences across ethnic groups exist, however. As anticipated, Mexicans have weaker than average . . . and Cubans have stronger than average aspirations and expectations.16

Why might this be? The authors conclude their discussion of the data presented in Table 5.10 by arguing as follows:

Differential aspirations and expectations may be explained by the considerable differences in family and household characteristics, parental hopes for their child’s educational success, and academic skills and disengagement.17

M A I N P O I N T S • Measures of variability are numbers that describe how much variation or

diversity there is in a distribution.

- The index of qualitative variation (IQV) is used to measure variation in nominal variables. It is based on the ratio of the total number of differences in the distribution to the maximum number of possible differences within the same distribution. IQV can

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.10

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note16

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#tab5.10

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/endnote.xlink.html#ch5-note17

vary from 0.00 to 1.00.

- The range measures variation in interval-ratio variables and is the difference between the highest (maximum) and the lowest (minimum) scores in the distribution. To find the range, subtract the lowest from the highest score in a distribution. For an ordinal variable, just report the lowest and the highest values without subtracting.

- The interquartile range (IQR) measures the width of the middle 50% of the distribution. It is defined as the difference between the lower and upper quartiles (Q1 and Q3). For an ordinal variable, just report Q1 and Q3 without subtracting.

- The box plot is a graphical device that visually presents the range, the IQR, the median, the lowest (minimum) score, and the highest (maximum) score. The box plot provides us with a way to visually examine the center, the variation, and the shape of a distribution.

- The variance and the standard deviation are two closely related measures of variation for interval-ratio variables that increase or decrease based on how closely the scores cluster around the mean. The variance is the average of the squared deviations from the center (mean) of the distribution; the standard deviation is the square root of the variance.

K E Y T E R M S

index of qualitative variation (IQV) interquartile range (IQR) measures of variability range standard deviation variance

Sharpen your skills with SAGE edge at edge.sagepub.com/frankfort7e. SAGE edge for students provides a personalized approach to help you accomplish your coursework goals in an easy-to-use learning environment.

Exercises

http://edge.sagepub.com/frankfort7e

S P S S D E M O N S T R A T I O N S

[GSS10SDSS]

Demonstration 1: Producing Measures of Variability With Frequencies Except for the IQV, the SPSS Frequencies procedure can produce all the measures

of variability we’ve reviewed in this chapter. (SPSS can be programmed to calculate the IQV, but the programming procedures are beyond the scope of our book.)

We’ll begin with Frequencies and calculate various statistics for AGE. If we click on Analyze, Descriptive Statistics, Frequencies, then on the Statistics button, we can select the appropriate measures of variability.

The measures of variability available are listed in the Dispersion box at the bottom of the dialog box (see Figure 5.9). We’ve selected the standard deviation, variance, and range, plus the mean and median (in the Central Tendency box) for reference. In the Percentile Values box, we’ve selected Quartiles to tell SPSS to calculate the values for the 25th, 50th, and 75th percentiles. SPSS also allows us to specify exact percentiles in this section (such as the 34th percentile) by typing a number in the box after “Percentile(s)” and then clicking on the Add button.

Earlier, we had seen the frequency table for the variable AGE, so after clicking on Continue, we click on Format to turn off the display table. This is done by clicking on the button for “Suppress tables with many categories” (see Figure 5.10). There are other formatting options here that you may explore later when using SPSS.

Click on Continue, then OK to run the procedure. SPSS produces the mean and the other statistics we requested (Figure 5.11). The range of age is 71 years (from 18 to 89). The standard deviation is 17.557, which indicates that there is a moderate amount of dispersion in the ages (this can also be seen from the histogram of AGE in Chapter 3). The variance, 308.248, is the square of the standard deviation (17.557).

The value of the 25th percentile is 35, the value of the 50th percentile (which is also the median) is 49, and the value of the 75th percentile is 62. Although Frequencies does not calculate the IQR, it can easily be calculated by subtracting the value of the 25th percentile from the 75th percentile, which yields a value of 27 years. Compare this value with the standard deviation.

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.9

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.10

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.11

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0003.xlink.html

Figure 5.9 Statistics Dialog Box

Figure 5.10 Format Dialog Box

[GSS10SSDS]

Demonstration 2: Producing Variability Measures and Box Plots With Explore

Another SPSS procedure that can produce the usual measures of variability is Explore, which also produces box plots. The Explore procedure is located in the Descriptive Statistics section of the Analyze menu. In its main dialog box (Figure 5.12), the variables for which you want statistics are placed in the Dependent List box. You have the option of putting one or more nominal variables in the Factor List box; Explore will display separate statistics for each category of the nominal variable(s) you’ve selected.

epub://m2d42wotr01ygbgy982w.1.vbk/OEBPS/ch0005.xlink.html#fig5.12