Is Stereotype Threat Overcooked, Overstated, and Oversold?

The Selling of Stereotype Threat

Stereotype threat is one of the most famous and influential phenomena in all of psychology. The famous paper (Steele & Aronson, 1995) unveiling the phenomenon has been cited over 5000 times, according to Google Scholar. And for good reason.

The original studies seemed to reveal an extraordinarily striking finding. The typically very large average difference in standardized test scores between African Americans and Whites was, supposedly, a very flimsy, superficial difference, readily eliminated by either of two tiny tweaks to the conditions under which such tests were administered. Given that, for over 50 years, educators and social scientists had found it essentially impossible to craft programs eliminated racial achievement differences, this was a “world-changing” finding.

What was that tiny situational tweak? Frame the test as a “challenge,” rather than as a test of ability. “That’s it?” you say. That’s really it. This was supposed to work because, according to the theory of stereotype threat, African American students often become anxious over confirming racial stereotypes and this harms their performance. According to widespread interpretations of the Steele & Aronson (1995) findings, if one removes that threat, voila! Race differences disappeared.

Black bars are for African American students; Striped bars are for White students. This figure is based on results presented in Steele & Aronson, 1995, Study 2. Because they only reported tests of significance of difference, and did not report the values, the values here are approximations based on a figure they presented.

Look at those bars! They are AMAZING. When the test was framed as a “challenge” the entire racial achievement gap completely disappeared! No wonder this paper has been cited 5000 times. Eureka! Finally, someone solved the racial achievement gap problem!

The Overselling of Stereotype Threat 1.0

Except they didn’t. Stereotype threat has been overstated, overpromoted, and oversold.

How can that be? You can see with your own eyes, can’t you, that the racial difference completely evaporated when the test was framed as a challenge. What can I possibly tell you that could suggest otherwise?

The results are not what they seem. It is certainly true that if the results were mean test performance scores, which is how they are labeled on the X-axis, they would in fact be a World Changing Result showing that a simple situational tweak can eliminate racial achievement differences.

But look at the Y-Axis label. See that nasty little, almost hidden in plain sight parenthetical “(adjusted by SAT scores)”? This figure does not show mean test scores. Because it does not show means, it cannot possibly (and does not) show that the means were equal.

It shows adjusted mean test scores, controlling for prior SAT scores. This crucially changes the meaning of the results; and it changes this World Changing Result to something a little more mundane. Specifically, this result provides no evidence at all that the racial achievement difference was even reduced, let alone eliminated, a point first made in a paper that was published over ten years ago (Sackett et al, 2004).

How can that be? This faux equating of the means is done through a statistical technique called Analysis of Covariance (ANCOVA). If the assumptions for an ANCOVA are met, and if students were truly randomly assigned to conditions, equal adjusted means do not indicate equal means; they indicate that the prior differences were simply maintained. (and if the assumptions were not met, or random assignment failed, the entire result reflects an almost impossible-to-interpret nonexperimental study).

ANCOVA “controls” for prior SAT scores. There can be good reasons for using ANCOVA, but one should never confuse a true mean with a “covariate adjusted mean.” If we “control” for prior differences (i.e., eliminate them) and get no difference, it is not because any intervention we have conducted “eliminates” that difference; it is because we have statistically removed the difference. We can do so, of course, but “equality” becomes a statistical fiction resulting from our removal of differences and the difference is still there. Equal adjusted means in ANCOVA can be a lot like saying, “Except for the four inch average difference in height between men and women, their average heights are identical.”

One can, perhaps, see this most easily in a little analysis we recently conducted on the far less controversial topic of temperature differences between Tampa and Anchorage. We selected 20 days scattered throughout the year and, what a shock, found that Tampa averaged about 40 degrees warmer than Anchorage.

However, through the magic of ANCOVA, we can make that difference “statistically” disappear. We simply identified the temperatures in Tampa and Anchorage the day before, and now conducted an ANCOVA, “controlling for prior temperatures.” Figure 2A shows the huge 40 degree difference in actual temperature. Figure 2B shows how, controlling for prior temperatures, there is no difference in the temperature of Tampa and Anchorage.

But of course there still is a difference. Saying “there are no differences, controlling for prior differences” is a silly, completely vacuous thing to say. It is like saying, “There is no difference, after we remove the difference.”

Figures based on those appearing in Jussim et al (in press). Statistically “controlling” for the temperature difference between Tampa on previous days, “eliminates” the difference between Tampa and Nome on subsequent days. Except, of course, this is a statistical fiction.

Overselling of Stereotype Threat 2.0

For the first 10 years post publication of the original stereotype threat study, it was routinely sold as showing that “remove threat, and eliminate racial achievement differences” (Jussim, Crawford, Anglin, Stevens, and Duarte, in press; Sackett et al, 2004; see also references at the end).

In response to Sackett et al’s critique, even Steele & Aronson (2004, p. 48) acknowledged this:

“Second, Sackett et al.’s (2004) narrow focus may have also led them to worry too much about the use of covariance analysis in Steele and Aronson’s (1995) study. They worried that this analysis led readers to believe that African Americans performed as well as Whites in the nondiagnostic (no stereotype threat) condition of that experiment, when, in fact, without this adjustment, they would be shown to perform still worse than Whites, as predicted by the group difference in their SATs. We, as much as Sackett et al., regret any confusion that this common analysis may have caused.”

You need to read that passage above closely. What starts out as a critique of Sackett et al’s point buries the acknowledgement that Sackett et al were right! The critical acknowledgement is worth repeating: “…without this adjustment, they would be shown to perform still worse than Whites.” And yet, the claim “Steele & Aronson found that, remove threat, and Black=White test scores” appears over and over and over (see examples at end of this post).

But it gets worse, or, at least, not much better. Social psychologists often react quite defensively when we are accused of not being a “real” science. Some such accusations are probably leveled by people who do not like our findings, but I, for one, would rather not be in the business of giving much credence to such critiques. And “true” sciences self-correct, when they have been found to be promoting invalid conclusions.

So, how are we doing self-correction-wise? One frequently now finds claims such as the following:

Schmader, Johns, and Forbes (2008, p. 336) claimed that the original study (Steele & Aronson, 1995) showed that:

“… African American college students performed worse than their White peers on standardized test questions when this task was described to them as being diagnostic of their verbal ability but that their performance was equivalent to that of their White peers when the same questions were simply framed as an exercise in problem solving (and after accounting for prior SAT scores).”

Similarly, Walton, Spencer, and Erman (2013, p. 5) wrote:

“In a classic series of studies, Black students performed worse than White students on a GRE test described as evaluative of verbal ability, an arena in which Blacks are negatively stereotyped. But when the same test was described as nonevaluative—rendering the stereotype irrelevant—Blacks performed as well as Whites (controlling for SAT scores; Steele & Aronson, 1995).”

These statements are technically true, highly convoluted and not unique to these papers. The language needs to be convoluted, because for the statements to be technically true, the declaration that African-American and White scores are “equivalent” in nonthreatening conditions needs to be walked back by adding the parenthetical regarding “controlling for prior SAT scores.” This “walking back” renders the conclusion statistically true but as meaningful as declaring that Tampa and Anchorage have equal temperatures (controlling for prior temperature). The actual result — pre-existing differences continued even under no threat conditions — is never explicitly stated in these descriptions of Steele & Aronson (1995).

Now, in fairness, not all stereotype threat researchers engage in this overselling of the findings, and even many that do, also acknowledge, when discussing research other than Steele & Aronson (1995) that stereotype threat does not provide a complete explanation of racial achievement differences. Furthermore, the results under the “threatening” (test of verbal ability) conditions do indicate that something interesting happened in the original studies – because those are adjusted means, they indicate that racial achievement differences increased when African American students believed they were being tested on their verbal ability. Threatening African American students in this way did seem to worsen their achievement, at least in these studies.

Stereotype threat is probably not quite as bad as buying a bridge for an Arizona desert. However, most reviews and even meta-analyses of stereotype threat are conducted by “advocates” — those who have enthusiastically embraced the idea, have published lots of research “demonstrating” the importance of stereotype threat, and who rarely, if ever, subject to the work to skeptical tests of falsification.

Skeptical Analyses Raise Doubts about the Power of Stereotype Threat

A “heterodox analysis” is one that challenges some entrenched orthodoxy (lower case “h” refers to the idea of challenging orthodoxies; I will use upper case “H” to refer to HeterodoxAcademy members and scholarship). Advocacy tests are those designed to “prove” how big or important some phenomenon is. They are “demonstrational” and some of us believe they are as much or more theater as science. My view is that true science involves skeptical tests to falsify cherished theories and beliefs, ala Karl Popper. Absent falsification, smart people can “prove” almost anything. They can usually do it with strong falsification, but it is at least more difficult.

Merely attempting to falsify some cherished belief, however, often creates risks for the falsifier, something well known from Copernicus through Galileo and Darwin. It is just as true today, but, instead, those cherished sacred beliefs, at least in the social sciences, often involve some variation on “egalitarianism.” Stereotype threat, of course, is a great rhetorical tool in the quest for egalitarianism (“oh, look at the flimsy situational basis for racial differences in achievement”). It is, therefore, professionally risky to challenge ideas that serve egalitarian rhetoric.

There have, however, been a small handful of skeptical analyses conducted by outsiders, by people who have not staked a significant portion of their careers on the real or imagined importance of stereotype threat. For example, Flore & Wicherts (2015) performed the only meta-analysis of which I am aware that has subjected stereotype threat findings to a whole family of skeptical tests, such as p-curves, funnel and forest plots, and tests for excess of significant results. The results are not pretty and show that the effects primarily appear in the underpowered, small scale studies, and either disappear or reverse altogether in the highly powered large scale studies. Uli Schimmack has also shown that stereotype threat studies are likely to have considerable difficulty replicating.

Furthermore, as Heterodox Academy’s Amy Wax (2009) has so aptly pointed out, stereotype threat effects among African Americans have been mostly obtained in very select and unrepresentative samples. A slew of boundary conditions have been proposed, which is another way of saying, “it might only apply to select people under select circumstances.” The generalizability of the findings, and the likely extent to which, they explain much of the very large racial differences in academic achievement is, at best, questionable, unknown, and certainly not “settled science.”

What Will Happen Next?

Most likely, stereotype threat researchers will just keep marching on as if nothing had happened. They will likely defend their turf and ignore compelling critiques of their findings (or deny that they are compelling). If they respond at all, they will likely produce, not apologies for misleading so many people, but apologia – defenses of their positions and claims. As long as heterodox critics are few and far between, stereotype threat advocates can probably continue this way, because their work will likely be subject to only muted skeptical scrutiny, if any. With respect to advancing one’s career, getting published, getting grants, etc., this is probably a very effective strategy. Egalitarian narratives, no matter how unjustified or flimsy their scientific basis, tend to play very well among grant agencies and journal reviewers. Which brings us back to one of the main inspirations for HeterodoxAcademy: The scholarly dysfunctions that result from lack of intellectual and political diversity in the academy

How Could it be Different?

I hope this pessimistic view is wrong. I invite any advocate of stereotype threat to come forward and acknowledge how the area has gone wrong, and how it can do better. That would be terrific, not just for how people understand stereotype threat, and not just for doing something to elevate the scientific status and credibility of my field of social psychology (though it would do both). Treating the right social disease (academic inequality) with a weak or ineffective “medicine” is as likely to be effective as treating pneumonia with aspirin. Overselling stereotype threat does a disservice to those whom the research is supposed to actually help. This is especially poignant when one considers “opportunity costs” – the millions in grant money spent on stereotype threat research that might have gone to other sorts of research left unfunded, and the journal space devoted to stereotype threat that might have gone to more solid research that might have made a greater contribution to basic understandings of human psychology and/or to academic interventions targeting phenomena likely to make larger differences, such as actually improving the quality of the education minority students receive.

References

Flore, P. C., & Wicherts, J. M. (2014). Does stereotype threat influence performance of girls in stereotyped domains? A meta-analysis. Journal of School Psychology, 53, 25-44.

Jussim, L., Crawford, J. T., Anglin, S. M., Stevens, S. T., & Duarte, J. L. (in press). Interpretations and methods: Towards a more effectively self-correcting social psychology. Journal of Experimental Social Psychology.

Sackett, P. R., Hardison, C. M., & Cullen, M. J. (2004).On interpreting stereotype threat as accounting for African American-White differences on cognitive tests. American Psychologist, 59, 7-13.

Steele, C. M., & Aronson, J. A. (2004). Stereotype threat does not live by Steele and Aronson (1995) alone. American Psychologist, 59, 47-48.

Walton, G. M., Spencer, S. J., & Erman, S. (2013). Affirmative meritocracy. Social Issues and Policy Review, 7, 1-35.

Wax, A. (2009). Stereotype threat: A case of overclaim syndrome? In C. H. Sommers (Ed.), The science on women and science (pp. 132-169). Washington D. C.: AIE Press.

For great examples of how Steele & Aronson’s results are routinely misrepresented or misinterpreted as “showing” that “remove threat, and Black=White scores,” see:

ReducingStereotypeThreat.org (retrieved 12/29/15)

“When race was not emphasized, however, Black students performed better and equivalently with White students.”

Stereotype Threat Widens Achievement Gap (American Psychological Association, retrieved 12/29/15)

“In the no stereotype- threat condition-in which the exact same test was described as a lab task that did not indicate ability-Blacks’ performance rose to match that of equally skilled Whites. “

Pigliucci, M. (2013). What are we to make of the concept of race? Thoughts of a philosopher-scientist. Studies in the History and Philosophy of Biological and Biomedical Sciences, 44, 272-277.

Steele and Aronson (1995), among others, looked at IQ tests and at ETS tests (e.g. SATs, GREs, etc.) to see whether human intellectual performance can be manipulated with simple psychological tricks priming negative stereotypes about a group that the subjects self-identify with. Notoriously, the trick worked, and as a result we can explain almost all of the gap between whites and blacks on intelligence tests as an artifact of stereotype threat, a previously unknown testing situation bias. (p. 276)

Schmader, T., Johns, M., & Forbes, C. (2008). An integrated process model of stereotype threat effects on performance. Psychological Review, 115, 336-356.

In support of this hypothesis, their experiments revealed that African American college students performed worse than their White peers on standardized test questions when this task was described to them as being diagnostic of their verbal ability but that their performance was equivalent to that of their White peers when the same questions were simply framed as an exercise in problem solving (and after accounting for prior SAT scores). (p. 336, referring to Steele & Aronson, 1995)