For some time now, the term stereotype has connoted one aspect of prejudice, and this linkage between stereotyping and prejudice isn’t altogether unfair. Most people can recall at least one instance when someone applied a stereotype to them, assuming something that was untrue, unflattering, and unfair. Nonetheless, there is another side to stereotypes—it can be rational to apply them when they’re generally true. For instance, young men are more likely to perpetrate violent crimes than young women, old men, or old women. Like most young men, I can recall when someone eyed me warily as I passed them on a city sidewalk late at night. While I posed no threat to them, their behavior didn’t strike me as irrational, because the stereotype they were applying was quite sensible to apply under the circumstances.

In social psychology, there has been an imbalance between the recognition that stereotypes can be both hurtful and accurate. Once the study of stereotypes became subsumed under prejudice research, stereotypes were assumed to be inaccurate. It was only in the 1980s that a very small group of social psychologists dared to measure whether at least some stereotypes were accurate. No sooner had this research been published than they were disparaged by critics who leveled numerous charges against them. Some of those charges were insubstantial insults, but an important charge, one that has persisted in the literature, is that stereotype accuracy is impossible to measure. Here’s a recent example of that kind of skepticism:

This most basic question–“what constitutes accuracy?”–is a slippery one indeed. Surely we can agree that if a belief describes only a few members of a group, it is off the mark. Similarly, we do not expect a trait to describe all group members before it is deemed “true.” The middle group, however, is harder to find. Would a stereotype that described 30 percent of social group members be accurate? How about 50 percent or 75 percent?

That’s an excerpt from The Psychology of Prejudice and Discrimination (2009), a psychology textbook adopted by a non-trivial number of college professors. Much of this book is commendable, but this particular section exemplifies the “how do you measure it?” critique.

Here’s another appearance of that critique, a few paragraphs later:

Regardless of whether percentage estimates or measures of dispersion are employed, the question is whether people over- or underestimate the group’s actual characteristics. To make this judgment, researchers must assume there is an objective way to assess the characteristic of interest, which as we discuss below, is often difficult. Further complicating the picture, these two measures of stereotype accuracy can operate independently. Research participants might be fairly accurate, for example, in their estimates of what percentage of Asians are mathematical, but they might be inaccurate in their estimate of the variability of this characteristic. If perceivers are accurate on one measure, but not the other, does their belief have a kernel of truth? This question is difficult to answer.

To my knowledge, no critic of stereotype accuracy research has leveled the same criticisms against social science in general even though the parallels are clear. In a number of social sciences, this is how you make a generalizable scientific claim. First, you advance a hypothesis. “Men are taller than women” is a good example. Second, you conjure an opponent, the null hypothesis. In this case, the null is: “Men and women are equally tall.” Third, you collect and measure a large sample of women and men and test for an average difference in height. You then plug that average difference into a formula that tells you whether it’s likely you would have found that difference if your opponent (the null hypothesis) were right. If you find that there’s a probability of less than 5% that you’d find this data in a world where your opponent is right, you claim that your finding is true and generalizable. Admittedly, you have to add caveats, and hedge your claim by noting that your results should be replicated. However, when people try to replicate your study, they too follow the same set of steps.

The fundamental sex comparison here is one of averages, not of proportions. Jumping back to the textbook example, we could consider proportions to be more valid, and we could ask whether 30 percent of men are taller than all women, or 50 percent, or 75 percent. Alternatively, we could ask whether the middle 50 percent of men is uniformly taller than the middle 50 percent of women. We could also discard both proportions and averages and use statistics like the median (midpoint) or the mode (most common value). One could make a good case for any of these. However, by convention, we don’t. We rely on the average, unless there’s a specific reason to avoid using the average, because it’s good enough.

Now to the second critique: researchers must measure both the central tendency and dispersion, and check if both are accurate. This criticism has more merit because people who get the average right can be wrong about the dispersion. For instance, if I think the height of the average man is 5’9” and also think that 90% of men are within 1/1000th of an inch of that height, I’ve got an accurate belief and an inaccurate belief.

However, in daily life, stereotypes aren’t about dispersion. For instance, there’s a stereotype—an accurate one—that men are more violent than women. However, there’s no general stereotype about dispersion. Everyone knows that we don’t live in a world where all men are violent, and all women are non-violent, which means that everyone acknowledges that there’s some dispersion, but there’s no fundamental stereotype about it. For psychological reasons, it can be interesting to measure dispersion, but in daily life, stereotypes are about differences in central tendency.

Social science is similarly concerned with differences in central tendency. If you ask a social psychologist to discuss the results of famous psychological studies like the marshmallow study, nearly everyone you ask will tell you about the difference between those who resisted the temptation to eat the marshmallow and those who ate it. But, without consulting the original article, no one can tell you about the dispersion.

Of course, social scientists care about the dispersion when they have to compute a standardized effect size, which allows comparisons across effects. However, this is simply because the dispersion has to go in the denominator. The purpose of the effect size is to give readers a sense of the magnitude of the difference in central tendency.

So yet again, there’s a double standard. Stereotype accuracy is suspect because the true evaluation of a stereotype requires more than a mere examination of mean differences. And yet social science is all about mean differences, not because other things are unimportant, but because much of the time, knowledge of mean differences is good enough.

I’ve found this double standard puzzling. If we can develop imperfect but reasonably good conventions for what constitutes a valid scientific finding, we can also develop imperfect but reasonably good conventions for what constitutes stereotype accuracy. Establishing these conventions can also spur research to establish which stereotypes are inaccurate. The “how do we measure it” school doesn’t note that if we simply give up on accuracy measurement, we have to give up on both accuracy and inaccuracy. When only stereotype accuracy is held to a higher standard, that’s an indication of political bias. In a fairer world, we would treat both scientific validity and stereotype accuracy similarly—as problems that have pragmatic though imperfect solutions.

Related links: