Psychologists have been talking about a research practice that goes something like this: I have a hypothesis that people are happier after they listen to Taylor Swift’s “Shake It Off” than after they listen to that Baz Lurhmann song about sunscreen. So I play “Shake It Off” to some people and “Everybody’s Free to Wear Sunscreen” to some other people. Then, I ask everyone how happy they are. I see that the people who listened to Taylor Swift rated themselves a little higher on my happiness scale than the people who listened to Baz Luhrmann. But this difference isn’t statistically significant.
So I play each of the songs to a few more people. Then, I pool my new data with the data from before and run my statistical test again. Now the the difference is significant! I have something I can publish!
This is one form of “p-hacking,” or running multiple statistical tests in order to get a significant result where there wasn’t one before. A while ago, Ryne Sherman wrote an R function that simulates this process. The details of it are over at his blog. His simulations showed that, as expected, determining sample size by looking intermittently at the data increases false positives when there’s no real difference between the groups. I’ll be using his function to look at what happens when my hypotheses are correct.
But first, just to demonstrate how it works, let’s take a look at what happens when there really is no difference between groups.
For my simulations, I am starting with 30 participants per condition and adding 30 more per condition each time I find p >= .05, up to a maximum of 270 per condition with a 2-sided t-test. Then, I’m repeating the study 9,999 more times.
Here’s what happens when the null hypothesis is true (people are just as happy after Taylor Swift as after Baz Luhrmann):
source("http://rynesherman.com/phack.r") # read in Ryne Sherman's function
set.seed(4)
res.null <- phack(initialN=30,
hackrate=30,
grp1M=0, # Group 1 has a mean of 0
grp2M=0, # So does Group 2
grp1SD=1, # Group 1 has an SD of 1
grp2SD=1, # So does Group 2
maxN=270,
alpha=.05,
alternative="two.sided",
graph=FALSE,
sims=10000)
## Loading required package: psych
## Proportion of Original Samples Statistically Significant = 0.049
## Proportion of Samples Statistically Significant After Hacking = 0.1898
## Probability of Stopping Before Reaching Significance = 0.819
## Average Number of Hacks Before Significant/Stopping = 6.973
## Average N Added Before Significant/Stopping = 209.19
## Average Total N 239.19
## Estimated r without hacking 0
## Estimated r with hacking 0
## Estimated r with hacking 0 (non-significant results not included)
The first line of the output tells me what proportion of times my first batch of 60 participants (30 per cell) was significant. As expected, it’s 5% of the time.
The second line tells me what proportion of times I achieved significance overall, including when I added more batches of 60 participants. That’s a much higher number, 19%.
Wow. I can increase my hit rate by almost 400% by looking at the data intermittently! One in five studies now returns a hit.
The Average Total N is the average number of participants I ran per cell before I stopped collecting data. It’s 239. If I am collecting data on Mechanical Turk, getting 239 people to listen to Taylor Swift and 239 to listen to Baz Luhrman is a cake-walk. I could collect hits very easily by running tons of studies on mTurk. I’d be very productive (in terms of publication count) this way. But all of my “hits” would be false positives, and all of my papers would be reporting on false findings.
But what about when the null is false?
The first simulation assumed that there really is no difference between the groups. But I probably don’t really think that is true. More likely, I think there is a difference. I expect the Taylor Swift group to score higher than the Baz Luhrmann group. I don’t know how much higher. Maybe it’s a small effect, d = .2.
So, what happens when people really are happier listening to Taylor Swift?
set.seed(4)
res.small <- phack(initialN=30,
hackrate=30,
grp1M=.2, # Group 1 now has a mean of .2
grp2M=0,
grp1SD=1,
grp2SD=1,
maxN=270,
alpha=.05,
alternative="two.sided",
graph=FALSE,
sims=10000)
## Proportion of Original Samples Statistically Significant = 0.1205
## Proportion of Samples Statistically Significant After Hacking = 0.744
## Probability of Stopping Before Reaching Significance = 0.3006
## Average Number of Hacks Before Significant/Stopping = 4.4569
## Average N Added Before Significant/Stopping = 133.707
## Average Total N 163.707
## Estimated r without hacking 0.1
## Estimated r with hacking 0.14
## Estimated r with hacking 0.17 (non-significant results not included)
Holy hit rate, Batman! Now I’m seeing p < .05 almost 75% of the time! And this time, they are true positives!
Sure, my effect size estimate is inflated if I publish only my significant results, but I am generating significant findings at an outstanding rate.
Not only that, but I’m stopping on average after 164 participants per condition. How many participants would I need to have 75% success if I only looked at my data once? I need a power analysis for that.
library(pwr)
pwr.t.test(d = .2,
sig.level = 0.05,
power = .75,
type = "two.sample",
alternative = "two.sided")
##
## Two-sample t test power calculation
##
## n = 347.9784
## d = 0.2
## sig.level = 0.05
## power = 0.75
## alternative = two.sided
##
## NOTE: n is number in *each* group
348 participants per condition!! That’s more than twice as many! The other way is MUCH more efficient. My Taylor Swift = happiness paper is going to press really quickly!
What if the effect were moderate? Say, d = .4?
Here are the simulations for d = .4:
set.seed(4)
res.moder <- phack(initialN=30,
hackrate=30,
grp1M=.4, # Group 1 now has a mean of .4
grp2M=0,
grp1SD=1,
grp2SD=1,
maxN=270,
alpha=.05,
alternative="two.sided",
graph=FALSE,
sims=10000)
## Proportion of Original Samples Statistically Significant = 0.3348
## Proportion of Samples Statistically Significant After Hacking = 0.9982
## Probability of Stopping Before Reaching Significance = 0.005
## Average Number of Hacks Before Significant/Stopping = 1.424
## Average N Added Before Significant/Stopping = 42.72
## Average Total N 72.72
## Estimated r without hacking 0.2
## Estimated r with hacking 0.24
## Estimated r with hacking 0.24 (non-significant results not included)
BOOM!! Batting a thousand! (Ok, .998, but that’s still really good!!)
And with only 73 participants per condition!
I’m rolling in publications! I can’t write fast enough to publish all these results.
And what would I have to do normally to get 99.8% success?
pwr.t.test(d = .4,
sig.level = 0.05,
power = .998,
type = "two.sample",
alternative = "two.sided")
##
## Two-sample t test power calculation
##
## n = 293.5578
## d = 0.4
## sig.level = 0.05
## power = 0.998
## alternative = two.sided
##
## NOTE: n is number in *each* group
Dang. That’s FOUR TIMES as many participants. Looking at the data multiple times wins again.
But am I really right all the time?
So looking at my data intermittently is actually a super effective way to reach p < .05 when I have even small true effects.1 It could lead to faster research, more publications, and less participant time used! Those are substantial benefits. On the downside, I would get to play “Shake It Off” for fewer people.
Looking at data multiple times makes it easier to get true positives.
And I’m only studying true effects, right?
Probably not.
p-hacking only seems like a problem if I accept that I might be studying false effects.2 Which I almost certainly am. At least some of the time.
But the problem is that I don’t know ahead of time which hypotheses are true or false. That’s why I am doing research to begin with.
It also seems that when I am studying true effects, and I am willing to collect large-ish samples,3 intermittent looking should yield a high hit rate. And I should be able to achieve that rate without needing to do anything else, such as dropping conditions, to achieve my desired p-value.4 If I am looking at my data intermittently, a low hit rate should make me consider that my hypothesis is wrong – or at the very least that I am studying a very small effect.
Edit:
Alexander Etz (@AlxEtz) pointed out that it’s possible to look at the data more than once without increasing alpha. He’s right. And it can be efficient if I’m not interested in getting a precise effect size estimate. Daniel Lakens has a great post about doing this, as does Rolf Zwaan. Alex adds:
@ecsalomon There's also this preprint for a bayesian alternative http://t.co/q78gJMiwoA 😉 I almost missed a chance for a bayes plug!
— Alexander Etz (@AlxEtz) June 28, 2015
- Some people much, much smarter than I am have already written about the “optimal” strategies for winning publications, and you should read their paper because it shows just how much these strategies bias publications.↩
- Or if I care about effect size estimates.↩
- Even if I am only willing to test 120 people in each condition, I find significant results 9%, 41%, and 90% of the time for d = 0, .2, and .4, respectively. For a small effect, even looking at my data just four times (at 30, 60, 90, and 120 participants per cell), my hit rate is quadruple that under the null hypothesis.↩
- I also modified Sherman’s original code a bit to look at what happens if I only continue adding participants when the Taylor Swift mean is bigger (but not sigficantly) than the Baz Luhrmann mean. I was able to find a significant effect 9%, 59%, and 93% of the time for d = 0, .2, and .4, respectively. In other words, I can still expect to find a significant result more than half the time even for effects as small as d = .2, even if the only p-hacking I do is looking at my data intermittently.↩
Very neat post. So overall we should think of this procedure -using batches of 30 per group stopping at 270 max- as having the following properties: alpha = .19, power (for d=.2) = .75, power (for d=.4) = .99.
I was curious how that compared to a fixed n comparison using the same power from each batch in your post, but instead using the same alpha level as the simulation result. So in this procedure there is alpha = .19, power is the same for each batch, looking for n required per group.
d=.2, power=.75, sig.level = .19: n per group was 197 (vs 167 sequentially)
d=.4, power=.99, sig.level=.19: n per group was 165 (vs 72 sequentially)
As you’d expect, it isn’t as drastic as using alpha = .05 (ns of 348 and 293), but really that isn’t a fair comparison in my mind because alpha is actually .19. But you still gain a lot of efficiency just from going sequential.
Then what happens when using the same average n as your sims, but looking to see the subsequent power?
d=.2, n=167, sig.level = .19: power is .46 (vs .75 sequentially)
d=.4, n=72, sig.level=.19: power is .86 (vs .99 sequentially)
Quite a bit better. If compared to fixed n, this procedure has much higher power, and if compared to fixed power, this procedure requires somewhat smaller samples. But it still has pretty high alpha, so if you can tolerate that it’s not so bad.
Oops, this:
d=.2, n=167, sig.level = .19: power is .46 (vs .75 sequentially)
should say this:
d=.2, n=167, sig.level = .19: power is .69 (vs .75 sequentially)
Not sure if you can edit comments.
Thanks, Alex! That’s a good way to think about this. Using a sequential procedure changes alpha without some correction. This procedure happens to change alpha a lot, which makes power a lot higher (especially for smaller effects).
By the way, I think my next blog post will be about how different strategies to use sequential testing without correction (e.g., as in footnote 4) affect alpha.