Grappling with the Past

A few days ago, Joe Hilgard asked a group of psychologists on Facebook how we create a culture that allows “frank and honest cataloging and discussion of the relative incidence of p-hacking” (where p-hacking is the use flexibility in collecting, analyzing, and reporting research in a way that increases the rate of positive results).

I’ve been thinking a lot about this over the past few days. First, I agree with Joe that part of doing science is building on the past, and uncertainty about the power and bias of our literature means that making use of our past is a struggle.

Those of us who believe that power is generally well under 50%, that publication bias is nearly 100%, and that flexibility in stopping rules, analyses, and reporting practices were used often enough to be of concern approach the published literature with skepticism about the size and reproducibility of reported effects.

This doesn’t mean that we assume every effect we encounter was found only through lots of trials of small studies that used flexible stopping, analysis, and reporting to achieve p < .05. But it does mean that we have some motivation to estimate the probabilities of these practices if we hope to make use of the literature we read.

It also means that, to a greater or lesser extent, some of us see sharing well-supported inferences about the power, extent of publication bias, and use of flexibility as a social good. Making informed inferences about these things can require a lot of work, and there’s no reason for every person researching in an area to duplicate that work. On the other hand, like every other part of science, these analyses benefit from replication, critique, and revision, so discussing them with others can make them better.

So that leaves me here: I am not confident that everything in our past is a reliable source of inference without investigation into its bias. Much as I love the idea of pressing the reboot button and starting over, I think that’s ultimately more wasteful than trying to make something of the past. I want to be able to do bias investigations, to share them with others, and to learn from the investigations others have done.

This is not about finding out who is good or bad or who is naughty or nice. This is about doing the best science I can do. And for that, I need to know how to interpret the past, which means I need a way to be able to talk about the strengths and weaknesses of the past with others.

Pretending like everything in the past is solid evidence is no longer an honest option for researchers who have accepted that small sample sizes, publication bias, and flexibility are threats to inference and parts of our research legacy. Yet, saying, “Gosh, I don’t quite believe that this study/paper/literature provides compelling evidence,” feels risky. It might be seen as an attack on the researchers (including one’s own collaborators if the research is one’s own), might be deemed uncivil, or might invite a bunch of social media backlash that would be a serious hassle and/or bummer. So Joe’s question is really important: How do we create a culture that makes this not an attack, not uncivil, and not a total bummer?

What Can We Do?

I have a few ideas. I don’t think any of them are easy, but I suspect that, like many things, the costs of doing them are likely not as high as we imagine.

  1. Stop citing weak studies, or collections of weak studies, as evidence for effects

When you think the literature supporting an idea is too weak to draw a confident inference, stop citing the literature as if it strongly supports the idea. Instead of citing the evidence, cite the ideas or hypotheses. Or stop citing the classic study you no longer trust as good evidence and cite the best study. When reviewers suggest that you omitted a classic and important finding, politely push back, explaining why your alternative citation provides better evidence.

  1. Focus on the most defensible criticism

As Jeff Sherman pointed out, it can be harder to find evidence that research makes use of research flexibility than that it exhibits low power and publication bias, and an argument about flexibility has more of a feeling of a personal attack. It’s relatively easy to show that even post-hoc power (which is likely an overestimate) is low and yet every reported finding is positive. Like all evidence, this isn’t proof that power is low and suppressed findings exist, but it’s reason to be cautious. If you can make a point for caution with power and publication bias alone maybe don’t bring up flexibility. So long as suggesting the use of flexibility feels like a personal attack, unless there are really compelling reasons to suspect that flexible research practices were used, you might be weakening your case against the evidence by suggesting they are possible.

That’s not to say we shouldn’t discuss research flexibility where there is good evidence for it, but I think Jeff Sherman makes another good point about such criticisms: “If I suggest that lack of double-blinding may be a problem for a study, I am specifying a particular issue. If I suggest p-hacking or researcher degrees of freedom, I am making a suggestion of unknown, unspecified monkeying around. There is a big difference.” So when suggesting that flexibility may undermine the inferences from a line of research, it’s important to be as specific about the type of flexibility and as concrete in the evidence as possible.

  1. Check yourself

Perhaps the safest place to start is with oneself. Michael Inzlicht and Michael Kraus have written about how some of their previous research shows signs of bias (and how they are changing things so that their future work shows less bias). They haven’t called out specific papers, but they’ve p-curved and TIVA’d and R-Indexed their prior papers and owned up to the fact that the work they’re doing now is better than the work they did in the past.

In admitting that their own research exhibits some forms of bias, they have opened the discussion and made it safer and easier for others to make similar admissions about themselves. Not that it was easy for them. Michael Inzlicht talks about fear, sadness, and pain in the process. But it is beautiful and brave that he not only performed the self-check anyway but went on to publish it publicly. And ultimately, he found the experience “humbling, yet gratifying.”

  1. Publish commentaries on or corrections of your previous work

I’m not going to pretend that this is at all easy or likely to be rewarded. It’s hard to remember exactly all of the studies that were run in a given research line, and, unfortunately, records may not be good enough to reconstruct that. So researchers may not know precisely the extent of publication bias in their own work. But still, for those cases where one knows that bias exists, it would benefit the entire community to admit it.

I can only think of one instance where someone has done this. Joe Hilgard wrote a blog post about a paper he had come to feel reported an unlikely finding based on (actually disclosed) flexible analyses and reporting. Vox wrote up a report complimenting Joe’s confession (and it really was brave and awesome!), but the coverage kind of gave the impression that Joe’s barely-cited paper was responsible for the collapse of the entire ego depletion literature: “All of this goes to show how individual instances of p-hacking can snowball into a pile of research that collapses when its foundations are tested.” Oops.

I doubt that that would happen to the next person who publishes a similar piece. But what will happen? One comment on Joe’s blog post asks whether he plans to retract the paper. I don’t think that’s the appropriate response to the bias in our literature but others definitely do, so calls for retraction seem plausible. Another concern is reputation: Will you anger your friends and collaborators or develop a reputation as someone who backstabs your colleagues? If people see admitting to bias as a personal black mark, this is possible

One way around these drawbacks is to publish a correction of a solo-authored paper or a paper authored with likeminded others. I’m on board with Andrew Gelman’s “No Retractions, Only Corrections: A manifesto”:

Maybe there should be no such thing as retraction, or maybe we could ban the word “retraction” and simply offer “corrections.” That would be fine with me. The point is never to “expunge the record,” it’s about correcting the record so that later scholars don’t take a mistaken claim as being true, or proven.

But, to the extent there are retractions, or corrections, or whatever you want to call them: Sure, just do it. It’s not a penalty or a punishment. I published corrections for two of my papers because I found that they were in error. That’s what you do when you find a mistake.

I’d love to see this opinion spread through psychology. As people who study people, psychologists know that bias happens; it’s just part of being human. Correct the record and move on. Start with thinking about the bias in your solo-authored papers. Begin talking about the idea with colleagues you already talk to about bias; warm them up to the idea of correcting their own work or your joint work. Then start leaving comments on PubPeer or on your blog or on http://psychdisclosure.org/. Or maybe even submit them as brief corrections to journals. If you’re an editor at a journal who would consider these kinds of corrections, invite them.

This is really an extension of what Michael Inzlicht and Michael Kraus have already done: start at home. By admitting our bias, we can set the example that it’s OK to have bias called out. But it can go a bit further by actually adding to the literature. If you include new data (e.g., dropped studies, conditions, or variables) or new analyses (e.g., an alternative specification of a DV), you are not just admitting bias but also contributing valuable new information that might make your correction into a meaningful paper in its own right.

  1. Publish your file-drawered studies

Make some use out of all the data you’re sitting on that was never published. You can simply post the data in an archive and make it available to meta-analysts and other researchers. You can publish it yourself as a new paper or as part of a correction. If you can’t get null or inconclusive results through traditional peer review, try an alternative outlet like the Journal of Articles in Support of the Null Hypothesis or The Winnower. The Winnower has the benefit of giving your blog post a citable DOI and pushing it through to Google Scholar. If you want to use your file drawer to make a big impact, gather all of your studies on a single topic into a publication-bias-free meta-analysis and use that to create theoretical insights and make meaningful methodological recommendations.

  1. Publish meta-scientific reviews

We already accept bias investigations in meta-analyses. Funnel plots, Egger tests, and other bias detection techniques are standard parts of meta-analysis. We are adding more and more tests to this repertoire every year.

Malte Elson brought up the idea that synthesizing whole research areas might be a more acceptable way to bring up criticisms about research flexibility, and he’s done some fantastic and detailed work cataloging flexibility in operationalizations of the CRTT. This work is specific (applies to a specific domain, a specific measure, and specific papers) but also diffuses agency across many authors. No one person is responsible for all of the flexibility, and actually attempting to figure out who has used more or less flexibility is fairly involved and just about the least interesting thing one can do with the published tools. Rather than providing, say, field-wide estimates of power, publication bias, or research flexibility, these domain-specific investigations provide the type of information needed by researchers to evaluate the papers they are using in their own work.

  1. Publicly praise and reward people who do these things

Cite corrections. Tweet and post on Facebook about how awesome people who admit bias are. Offer them jobs and promotions. If people are going to risk their reputations and relationships in trying to help others navigate the past, do everything you can to make it worth their while.

Final Thoughts

Let me be clear, doing any and all of these things is awesome, but it’s also only a beginning. Joe’s question is really about how to create a culture so that it is ok to point out specific instances of research flexibility in others’ work without ruining either one’s own or the author’s reputations. I think that admitting our own bias and examining field-wide bias will help normalize bias discussions, but they probably won’t bring us far enough.

I don’t expect everyone to make a complete catalog of their unpublished work or reveal their original planned analyses for every study they’ve ever published. Most people don’t have the time or records to do that. But we still need to be able to talk about the potential bias in their work anyway if we want to build on it. So we have to look and we have to talk about, and it has to be ok to do that.

Some people are already doing these investigations, but my general impression is that they are not received well. I hope that talking more about bias in ourselves and in general will bring us closer to the goal of discussing specific cases of bias, but I wonder whether there is more we can do to get us there faster.

Article Level Metrics and Many Labs Replication Outcomes

Update: I’ve edited this page slightly for clarity and proofreading and to correct an error. Before doing so, I archived the original version of the post. You can see the revision history at the Internet Archive.

This blog post provides additional details and analyses for the poster I am presenting at the 2016 meeting of the Society for Personality and Social Psychology. If you’ll be at SPSP and want to chat, come by my poster during Poster Session E on Friday at 12-1:30pm. I’ll be at poster board 258.

Introduction

Over the past several years, psychologists have become increasingly concerned about the quality and replicability of the research in their field. To what extent are the findings reported in psychology journals “false positives”—reports of effects where none truly exist?  Researchers have attempted to answer this question different approaches: by replicating previous research and by developing a series of research quality metrics that attempt to quantify the evidential value, replicability, power, and bias of of the research literature.

Replication

As part of this movement, a wave of replication studies has been published, including several large-scale projects. The results of these projects have been mixed. The Many Labs 1 (ML1) project involved 36 labs all running replications of the same 13 effects (16 effects, if you count the four anchoring effects separately). In aggregate analyses on data from all of the labs, the authors found that only two failed to reject the null hypothesis of no effect. Many Labs 3 (ML3), following a similar model, attempted to replicate 10 effects (plus 3 post-hoc additions of effects from three of the replicated studies) in 21 samples. This time, aggregate analyses of the 10 planned effects failed to reject the null hypothesis of no effect for seven of the effects. (Many Labs 2 is still in progress). The Reproducibility Project: Psychology took a different approach to replication, selecting many effects from specific journals and replicating each in a single lab. Out of 97 replications, 62 failed to reject the null hypothesis of no effect, a similar rate to ML3. However, unlike ML3, these analyses were not based on large, aggregated samples. Across these three projects, in general, effect sizes shrank from original to replication. The overall replicability of psychological science remains unknown (and may not be a well-defined or readily quantifiable concept); however, it is clear that some effects can be observed relatively regularly and in many settings while others are difficult to observe, even with many subjects and carefully constructed protocols.

Metrics

At the same time, concerns about research methods that inflate false positive rates and about the effect of publication bias on the veracity of reported research (as well as increasing awareness that traditional meta-analyses are threatened by publication bias) has driven researchers to develop new techniques to evaluate the literature. The p-curve, for example, tests the evidential value of a set of studies by looking at its distribution of p-values: the shape of the distribution of p-values changes when studying a true vs. false test (and with the power of the test). The Replication Index (R-Index) attempts to quantify the replicability of a set of studies based on their post-hoc power (their power to detect an effect of the observed size). The Test of Insufficient Variance (TIVA) examines publication bias by asking whether the variance in p-values (converted to z-scores) is smaller than would be expected, suggesting that some results have been censored. A positive correlation between sample size and effect size may be taken as an indication of suppressing null results. Smaller samples will produce significant results only when the effect size is larger, but they are less powerful at detecting effects at all. So there may be lots of non-significant results missing from the literature if all the small sample studies report large effects. The N-Pact factor attempts to quantify the power of a set of studies by looking at its median sample size with the assumption that more powerful studies create more reliable estimates. These tests and indices have become popular tools for evaluating the quality of research output. Researchers are using these metrics to examine the quality of journals, of their own work, and of the papers published by others.

Their Relationship to One Another

I’ve heard colleagues dismiss papers because they contain mostly high p-values, low variance in p-values, small sample sizes, or sample sizes correlated with effect sizes. Papers with these characteristics are perceived as unreplicable and unsound. I’ve made such inferences myself. Yet, most of these metrics are not explicitly designed to index replicability. So are these judgments about replicability justified? Do research quality metrics at the article level predict replication outcomes?

Intuitively, it makes sense to think they would. Assuming papers generally conclude that effects are real, sets of studies based on large samples, with adequate power, and not exploiting flexibility in analysis seem like they ought to contain more replicable results than sets of studies based on small samples with flexible analysis plans and low power.

But there are some reasons that these metrics may not be predictive in practice. Optimistically, if researchers are doing a bang up job on powering their studies, we would actually observe a (perhaps modest) negative correlation between effect size and sample size, because no one be wasting huge samples on huge effects or bothering with tiny samples for tiny effects. Pessimistically, if the literature is extremely biased or extremely underpowered, there may be no predictive power to the metrics in the present literature at all. For example, when power is very low, the distribution of p-values becomes fairly flat. (You can observe this for yourself here. Try an effect size of d = 0.3 and sample size of n = 20. This is also shown in the first p-curve paper.) Such a literature would also have small sample sizes and low post-hoc power regardless of whether effects are true—and thus little variability in the metrics. And we would expect poor replicability, even for true effects, if sample sizes for replications were not sufficiently larger than the original, dismally powered sample. So even if some of the metrics should predict replicability in principle, they may not do so in practice.

To test the predictive power of these metrics at the article level, I asked whether they are related to outcomes in ML1 and ML3. Why ML1 and ML3? I hoped that their large samples would yield fairly reliable outcome measures, and I wanted to have enough effects to have a hope of detecting a relationship, so using just one of them would be inadequate. If all the effects were usable, I would have a sample of 23 effects. Not very big, but powered at 80% to detect a correlation of r = .55. That might be optimistic, and it would be better to have even more effects included. But if I am going to use these metrics to make dismissive judgments about the replicability of effects in specific articles, I’d hope that the relationship is fairly strong.

Method

Operationalizing Replicability

There are many ways to operationalize replicability. For example, one might consider whether the original study had sufficient power to detect the observed effect size of the replication (Simonsohn’s [2015] “small telescopes“). Or whether the replication effect size falls within a prediction interval based on the the original and replication effect sizes. I decided to focus on the two outcomes I hear most often discussed:

  1. Difference in effect size (continuous): How much the replication effect size (converted to Cohen’s d) differed from the original effect size (converted to the same scale).
  2. Replication success (dichotomous): Whether the replication rejected the null hypothesis of no effect at p < .05.

These are not perfect outcome operationalizations, but I believe they represent the ways many people evaluate replications. Since the research question was driven by the kinds of judgments I and others seem to be making, it made sense to me to operationalize replication outcomes in ways that appear to be common in replication discourse.

Selecting Predictors

I again decided to use the metrics that I hear discussed frequently as predictors. These are the same metrics mentioned in the Introduction:

  1. P-Curve: Evidential Value: Test statistic (z) for evidential value of a set of studies based on p-values. Z-scores less than -1.64 indicate evidential value.
  2. P-Curve: Lacks Evidential Value: Test statistic (z) for a lack evidential value (power less than 33%) of a set of studies based on p-values. Z-scores less than -1.64 indicate lack of evidential value.
  3. R-Index: The difference between median post-hoc power of a set of studies and the “inflation” in the studies. Inflation is defined as the proportion of significant results minus the expected proportion of significant results. Higher values indicate greater replicability.
  4. Test of Insufficient Variance (TIVA): The variance in the converted z-scores of test statistics. For heterogeneous sets of studies (i.e., studies with different sample sizes or different methods), variance should be greater than 1. Variance less than 1 indicates that some studies have been censored.
  5. Correlation Between Effect Size and N: Pearson correlation between the observed effect sizes and sample sizes in a paper. Negative correlations may indicate publication bias.
  6. N-Pact Factor: Median sample size of included tests. Higher values generally indicate greater power.

I followed Simonsohn, Nelson, & Simmons (2015) recommendations for the inclusion of tests, using only tests of critical hypotheses. The p-curve disclosure table is available here as dataEntrySheet.csv.

Because many of these indices are based on the same information (test statistics, p-values, sample sizes), we can expect them to be correlated. For this reason, I decided to evaluate them individually. Models including multiple predictors might exhibit multicollinearity and be unsuitable.

Sample

ML1 and ML3 replicate effects from 22 articles:

ML Effect Name Description Citation
1 Anchoring (4 effects) People’s quantitative judgments are biased after seeing too large or too small estimates Jacowitz, K. E., & Kahneman, D. (1995). Measures of anchoring in estimation tasks. Personality and Social Psychology Bulletin, 21, 1161–1166. http://dx.doi.org/10.1177/01461672952111004
1 Allowed/ Forbidden People are less likely to endorse banning anti-democracy speeches than to fail to endorse allowing them Rugg, D. (1941). Experiments in wording questions: II. The Public Opinion Quarterly, 5, 91–92. http://dx.doi.org/10.1086/265467
1 Retrospective gambler fallacy People think that a rare outcome is from a longer series of events than a more common outcome Oppenheimer, D. M., & Monin, B. (2009). The retrospective gambler’s fallacy: Unlikely events, constructing the past, and multiple universes. Judgment and Decision Making, 4, 326–334.
1 Gain vs loss framing People are more willing to take risks to avoid losses than to procure gains Tversky, A., & Kahneman, D. (1981). The framing of decisions and the psychology of choice. Science, 211, 453–458. http://dx.doi.org/10.1126/science.7455683
1 Sex differences in implicit math attitudes (and relation between I & E attitudes) Women have more negative implicit math attitudes than men / Implicit and explicit math attitudes are positively correlated Nosek, B. A., Banaji, M. R., & Greenwald, A. G. (2002). Math = male, me = female, therefore math ≠ me. Journal of Personality and Social Psychology, 83, 44–59. http://doi.org/10.1037//0022-3514.83.1.44
1 Low vs high category scales People report watching more TV when the response scale ranges from “up to half an hour” to “more than two and a half hours” than when it ranges from “up to two and a half hours” to “more than four and a half hours” Schwarz, N., Hippler, H.-J., Deutsch, B., & Strack, F. (1985). Response scales: Effects of category range on reported behavior and comparative judgments. The Public Opinion Quarterly, 49, 388–395. http://doi.org/10.1086/268936
1 Quote Attribution People endorse a quotation more strongly when it is attributed to a liked rather than a disliked figure Lorge, I., & Curtiss, C. C. (1936). Prestige, suggestion, and attitudes. The Journal of Social Psychology, 7, 386–402. http://doi.org/10.1080/00224545.1936.9919891
1 Norm of reciprocity People are more likely to say that foreign reporters should be allowed into their home country after first being asked whether a foreign country should allow reporters from their country Hyman, H. H., & Sheatsley, P. B. (1950). The current status of American public opinion. In J. C. Payne (Ed.), The teaching of contemporary affairs: 21st yearbook of the National Council of Social Studies (pp. 11–34). New York, NY: National Council of Social Studies.
1 Sunk Costs People are more likely to brave the cold to use a ticket they bought vs. one that was free. Oppenheimer, D. M., Meyvis, T., & Davidenko, N. (2009). Instructional manipulation checks: Detecting satisficing to increase statistical power. Journal of Experimental Social Psychology, 45, 867–872. http://doi.org/10.1016/j.jesp.2009.03.009
1 Imagined contact Imagining contact with people from different ethnic groups reduces prejudice towards those groups Husnu, S., & Crisp, R. J. (2010). Elaboration enhances the imagined contact effect. Journal of Experimental Social Psychology, 46, 943–950. http://doi.org/10.1016/j.jesp.2010.05.014
1 Flag Priming People report more conservative attitudes after subtle exposure to the US flag than after no exposure Carter, T. J., Ferguson, M. J., & Hassin, R. R. (2011). A single exposure to the American flag shifts support toward Republicanism up to 8 months later. Psychological Science, 22, 1011–1018. http://doi.org/10.1177/0956797611414726
1 Currency Priming People endorse the status quo more strongly after exposure to an image of money than after no exposure Caruso, E. M., Vohs, K. D., Baxter, B., & Waytz, A. (2012). Mere exposure to money increases endorsement of free-market systems and social inequality. Journal of Experimental Psychology: General, 142, 301–306. http://doi.org/10.1037/a0029288
3 Stroop People are slower to name the font color of a color word when the word names a different color than when it names the same color Stroop, J. R. (1935). Studies of interference in serial verbal reactions. Journal of Experimental Psychology, 18, 643–662. http://dx.doi.org/10.1037/h0054651
3 Metaphoric Restructuring People interpret an ambiguous temporal statement differently depending on whether they had just completed a spatial task in a frame of reference involving an object and a stick figure labeled “you” or involving only objects Boroditsky, L., & Ramscar, M. (2002). The roles of body and mind in abstract thought. Psychological Science, 13, 185–189. http://doi.org/10.1111/1467-9280.00434
3 Availability Heuristic People overestimate the frequency of words starting with a given letter (rather than words where the letter is in the third place) because they are easier to recall Tversky, A., & Kahneman, D. (1973). Availability: a heuristic for judging frequency and probability. Cognitive Psychology, 5, 26–232. http://doi.org/10.1016/0010-0285(73)90033-9
3 Persistence and Conscien-tiousness Persistence as measured by one personality measure is positively correlated with conscientiousness as measured by another De Fruyt, F., & Van De Wiele, L. (2000). Cloninger’s psychobiological model of temperament and character and the five-factor model of personality. Personality and Individual Differences, 29, 441–452. http://doi.org/10.1016/S0191-8869(99)00204-4
3 Power and perspective-taking People made to feel high in power perform poorer on a perspective-taking task Galinsky, A. D., Magee, J. C., Inesi, M. E., & Gruenfeld, D. H. (2006). Power and perspectives not taken. Psychological Science, 17, 1068–1074. http://doi.org/10.1111/j.1467-9280.2006.01824.x
3 Weight embodiment People judge an issue as more important when holding a heavier (rather than lighter) clipboard Jostmann, N. B., Lakens, D., & Schubert, T. W. (2009). Weight as an embodiment of importance. Psychological Science, 20, 1169–1174. http://doi.org/10.1111/j.1467-9280.2009.02426.x
3 Warmth perceptions People judge a room to be warmer after reading about a communal rather than an agentic person Szymkow, A., Chandler, J., IJzerman, H., Parzuchowski, M. & Wojciszke, B. (2013). Warmer hearts, warmer rooms: How positive communal traits increase estimates of ambient temperature. Social Psychology, 44, 167-176. http://dx.doi.org/10.1027/1864-9335/a000147
3 Elaboration likelihood model People high in need for cognition differ more in their judgments of the persuasiveness of strong and weak arguments than people low in need for cognition Cacioppo, J. T., Petty, R. E., & Morris, K. J. (1983). Effects of need for cognition on message evaluation, recall, and persuasion. Journal of Personality and Social Psychology, 45, 805–818. http://doi.org/10.1037/0022-3514.45.4.805
3 Self-Esteem and subjective distance People high in self-esteem judge past positive and negative events as more different in subjective temporal distance than participants with low self-esteem Ross, M., & Wilson, A. E. (2002). It feels like yesterday: Self-esteem, valence of personal past experiences, and judgments of subjective distance. Journal of Personality and Social Psychology, 82, 792–803. http://doi.org/10.1037//0022-3514.82.5.792
3 Credentials and prejudice People are more likely to express prejudiced attitudes when they first have an opportunity to show that they are not prejudiced Monin, B., & Miller, D. T. (2001). Moral credentials and the expression of prejudice. Journal of Personality and Social Psychology, 81, 33-43. http://dx.doi.org/10.1037/0022-3514.81.1.33

Exclusions

Unfortunately, out of the 22 articles, several were excluded from the analyses.

One of the articles (Norm of reciprocity; Hyman & Sheatsley, 1950) was not available online or from the ML1 team. I found a catalog entry for the book at my library, but upon taking it out found it was the wrong volume of the series. I have an interlibrary loan request pending for the correct volume, but it has not yet been delivered. Since complete information for this article could not be gathered, it was excluded.

I next excluded any articles that reported only a single study: Anchoring, Allowed/Forbidden, Quote Attribution, and Persistence and conscientiousness. Some of the metrics (e.g., p-curve, correlation between effect size and sample size) require multiple studies to compute.

One further paper was excluded. The Sunk Costs effect originated in a paper (Thaler, 1985) that does not report null-hypothesis significance testing and so cannot be used for most of the metrics. The protocol used in the replication was derived from work by Oppenheimer and colleagues (2009), reported in an article testing the utility of instructional manipulation checks for studies conducted on Mechanical Turk. This paper is not suitable for inclusion because the remaining studies in the paper do not address the same theoretical question as the replicated effect.

This left 16 papers in the final sample, all of which could be evaluated on the dichotomous outcome. However,the original papers reporting the three effects involving interactions (Elaboration Likelihood Model, Self-Esteem and Subjective Distance, and Credentials and Prejudice) did not report enough information (or did not have appropriate designs) to estimate effect sizes in Cohen’s d. Thus, these three effects were excluded from analyses involving the continuous outcome, reducing the number of included effects for those analyses to 13.

Papers with Multiple Replicated Effects

Two of the sampled papers (Jacowitz & Kahneman, 1995; Nosek et al., 2002) had multiple effects included in the replication studies. Multiple outcomes from the same article should not be evaluated separately as their predictor metrics are not independent. I decided to average the outcomes for effects that are testing the same theoretical relationship and to focus on the the effect that was the primary theoretical target of the original paper when the effects were testing different theoretical relationships. This meant that the Anchoring effects were averaged and the implicit/explicit attitude correlation was excluded from analyses.

All of the data and code for data manipulation and primary analyses are on Github. Please feel free to ask me questions, open issues, send pull requests, or ask for files in other formats. I am happy to share.

The Data

Before I dive into results, I want to show some important features of the data. First, let’s look at the continuous outcome. The figure below shows original effect sizes on the left and replication effect sizes on the right, for those papers reporting enough information to calculate Cohen’s d. Lines connect the effect sizes from the same effects. There’s a filter that allows you to show only those effects included in the analyses (i.e., omitting those those from papers that contained only one study or which I couldn’t access).

Try switching from showing all effects to showing only those whose original articles are included in the analyses. Notice that just about all of the effects whose size grew from original to replication are excluded. This might be problematic. How often do effect sizes grow from original to replication, and how do the original papers reporting such studies differ from other papers? Unfortunately, there is not enough information here to answer these questions. But perhaps there is bias in the continuous outcome in our remaining studies.

Next, we should look at the distributions of the predictors. The figure below allows you to select which predictor you want to see and to switch among the full set of 16 effects, the set of 13 effects included in the effect size analyses, and the set of 3 effects excluded from those analyses.

One thing you might notice is that for most metrics, the scores are clustered in the “worse” end. The papers included in this sample have, mostly, low power, small sample sizes, and low variance in their z-scores. This would be consistent with a fairly biased or underpowered literature. So low predictive validity of these metrics would not be very surprising, given the small sample size and the low variability in the predictors. Models based on these metrics might be treated with suspicion, given their poor variability.

Results

Ok, now that we’ve looked at the data a bit and considered the ways in which they are less-than-ideal (small sample, perhaps unrepresentative on at least one outcome, and low variability in predictors), we should be pretty well-prepared for the results.

Difference in Effect Size

For this blog post, I’ve used absolute value of the difference between replication and original effect sizes rather than raw difference. Since only of the effects has a positive difference, and it’s a change of d = .01, this changes only the signs of the relationships but not their sizes. It just makes the plots and correlations a bit more intuitive to interpret: more positive values indicate greater disparity between original and replication results. The poster uses the raw difference scores (replication – original), if you want to compare.

Below is a table of correlations. The first column represents correlations with effect size difference. The degrees of freedom for all of these correlations are 11. The correlations in the other cells include all 16 of the effects and so have 14 degrees of freedom.

1 2 3 4 5 6
1. Difference in Effect Size
2. P-Curve: Evidential Value -.44
3. P-Curve: Lacks Evidential Value .45 -.99
4. R-Index .46 -.89 .89
5. TIVA: Variance of Z -.04 -.37 .39 .35
6. Correlation between ES and N .49 -.37 .37 .42 .40
7. N-Pact Factor: Median N -.14 .06 -.03 .15 .27 -.06

Overall, the correlations with effect size difference are moderately strong, but notice that some of them are in the opposite direction one would predict. Z-scores for evidential value decrease as effect size discrepancy increases. But negative z-scores indicate evidential value, so we would expect higher z-scores to be associated with greater discrepancy. For the other p-curve metric, positive z-scores indicate lack of evidential value, but the correlation is positive. Instead of discrepancy rising with lack of evidential value, it falls. Higher R-Index values should predict better replicability, but here they are associated with larger effect size differences. The last two correlations are in the predicted direction. As the article-level correlation between effect size and sample size increases (as potential publication bias increases), so does the difference between original and replication outcomes. And as N-pact factor (power) increases, discrepancy decreases (a bit).

But our predictors didn’t have much variability. Let’s see what these relationships look like plotted:

Plot of Difference Between Replication and Original Effect Sizes by Bias Metrics

We can see that only a few papers have p-curve z-scores much different from zero, or have variance in the z-scores of their studies (TIVA) greater than 1, or have sample sizes larger than 100. In general, the correlations are driven by just a few points that are outliers in their distributions.

Ok, let’s look quickly at the dichotomous outcome, with “successful” (p < .05) replications coded as 1 and “unsuccessful” (p >= .05) coded as 0. Here are the results of six different logistic regression models predicting this outcome:

       b      SE        z        p      OR
1. P-Curve: Evidential Value

-0.23

0.18 -1.31 .19

0.79

2. P-Curve: Lacks Evidential Value

0.23 0.18 1.33 .18

1.26

3. R-Index

0.93

1.64 0.56 .57

2.53

4. TIVA: Variance in Z

0.14

0.58 0.25 .80

1.16

5. Correlation Between Effect Size and N

0.12

0.66 0.19 .85

1.13

6. N-Pact Factor

-0.00

0.01 -0.48 .63

1.00

Now, the p-curve relationships are in the expected direction. As is R-Index. And TIVA. We’ll plot the predicted probability of success.

Plot of Predicted Probability of Success from Six Bias Metrics

We can see that, at least, the p-curve values most strongly indicating evidential value are associated with replication successes. But there are also a couple of replication “failures” that come from papers with evidential value z-scores above 1.64, i.e., where there is evidential value in that set of studies (but not necessarily in the replicated study). And there are a few successes that come from papers that do not have evidential value (though there are none from the two papers that significantly lack evidential value). So while p-curve performs the best at predicting replication success from article-level information in this set of studies, it’s definitely not predicting all of the outcomes.

Some Closing Thoughts

I’m very grateful for the attention this project has received and the many helpful comments and thought-provoking questions people have sent on Twitter, Facebook, email, and Github. I hope that this blog has clarified many points about methodology and data quality. There is a lot to be desired here. And I look forward to expanding this project to make it more useful.

In particular, as a few people have suggested, it’s important to have more data. Adding replications from the Reproducibility Project, a special issue of Social Psychology, the soon-to-be-released Many Labs 2, and perhaps other sources will hopefully increase the sample size, the reliability of the estimates, and the power to detect relationships.

In addition, there may be other outcomes that would be useful. Perhaps whether the original study had the power to detect the replication effect. Or, for Many Labs studies, the proportion of labs returning significant results. That said, no replication outcome metric can replace informed scientific judgment. If a replication is poorly conducted or has low power, we should not evaluate the replicability of the effect in the same way as when the replication is competently conducted and high powered. Likewise, if an original study used invalid manipulations or measures, what does its replicability matter?

It also remains possible that article level metrics don’t predict replication outcomes for the current psychological literature. If the literature is extremely low-powered and biased, these metrics may be capable of telling us only that the literature is low-powered and biased but not which effects are likely to replicate. Or there may be too little information contained in single papers to predict replication outcomes of their studies. Perhaps author- or journal-level metrics would be better indicators, though certainly harder to assess off-the-cuff.

For now, I would probably make three recommendations. First, I would apply caution in using these metrics to make judgments about the replicability of studies from single papers. It’s not clear that they can diagnose the replicability of studies in this way. Second, when reporting studies, include enough information to aid in meta-analytic and meta-scientific research. This means, include full model specifications, cell sizes, and test statistics (if you’re using frequentist statistics, please report more than a p-value). Finally, conduct more replications.