Do biased stats provide bogus economics? A primer on publication bias and power

Cite this Article
Eric Fruits, Do biased stats provide bogus economics? A primer on publication bias and power, Truth on the Market (March 29, 2018), https://truthonthemarket.com/2018/03/29/do-biased-stats-provide-bogus-economics-a-primer-on-publication-bias-and-power/

If you do research involving statistical analysis, you’ve heard of John Ioannidis. If you haven’t heard of him, you will. He’s gone after the fields of medicine, psychology, and economics. He may be coming for your field next.

Ioannidis is after bias in research. He is perhaps best known for a 2005 paper “Why Most Published Research Findings Are False.” A professor at Stanford, he has built a career in the field of meta-research and may be one of the most highly cited researchers alive.

In 2017, he published “The Power of Bias in Economics Research.” He recently talked to Russ Roberts on the EconTalk podcast about his research and what it means for economics.

He focuses on two factors that contribute to bias in economics research: publication bias and low power. These are complicated topics. This post hopes to provide a simplified explanation of these issues and why bias and power matters.

What is bias?

We frequently hear the word bias. “Fake news” is biased news. For dinner, I am biased toward steak over chicken. That’s different from statistical bias.

In statistics, bias means that a researcher’s estimate of a variable or effect is different from the “true” value or effect. The “true” probability of getting heads from tossing a fair coin is 50 percent. Let’s say that no matter how many times I toss a particular coin, I find that I’m getting heads about 75 percent of the time. My instrument, the coin, may be biased. I may be the most honest coin flipper, but my experiment has biased results. In other words, biased results do not imply biased research or biased researchers.

Publication bias

Publication bias occurs because peer-reviewed publications tend to favor publishing positive, statistically significant results and to reject insignificant results. Informally, this is known as the “file drawer” problem. Nonsignificant results remain unsubmitted in the researcher’s file drawer or, if submitted, remain in limbo in an editor’s file drawer.

Studies are more likely to be published in peer-reviewed publications if they have statistically significant findings, build on previous published research, and can potentially garner citations for the journal with sensational findings. Studies that don’t have statistically significant findings or don’t build on previous research are less likely to be published.

The importance of “sensational” findings means that ho-hum findings—even if statistically significant—are less likely to be published. For example, research finding that a 10 percent increase in the minimum wage is associated with a one-tenth of 1 percent reduction in employment (i.e., an elasticity of 0.01) would be less likely to be published than a study finding a 3 percent reduction in employment (i.e., elasticity of –0.3).

“Man bites dog” findings—those that are counterintuitive or contradict previously published research—may be less likely to be published. A study finding an upward sloping demand curve is likely to be rejected because economists “know” demand curves slope downward.

On the other hand, man bites dog findings may also be more likely to be published. Card and Krueger’s 1994 study finding that a minimum wage hike was associated with an increase in low-wage workers was published in the top-tier American Economic Review. Had the study been conducted by lesser-known economists, it’s much less likely it would have been accepted for publication. The results were sensational, judging from the attention the article got from the New York Times, the Wall Street Journal, and even the Clinton administration. Sometimes a man does bite a dog.

Low power

A study with low statistical power has a reduced chance of detecting a true effect.

Consider our criminal legal system. We seek to find criminals guilty, while ensuring the innocent go free. Using the language of statistical testing, the presumption of innocence is our null hypothesis. We set a high threshold for our test: Innocent until proven guilty, beyond a reasonable doubt. We hypothesize innocence and only after overcoming our reasonable doubt do we reject that hypothesis.

Type1-Type2-Error

An innocent person found guilty is considered a serious error—a “miscarriage of justice.” The presumption of innocence (null hypothesis) combined with a high burden of proof (beyond a reasonable doubt) are designed to reduce these errors. In statistics, this is known as “Type I” error, or “false positive.” The probability of a Type I error is called alpha, which is set to some arbitrarily low number, like 10 percent, 5 percent, or 1 percent.

Failing to convict a known criminal is also a serious error, but generally agreed it’s less serious than a wrongful conviction. Statistically speaking, this is a “Type II” error or “false negative” and the probability of making a Type II error is beta.

By now, it should be clear there’s a relationship between Type I and Type II errors. If we reduce the chance of a wrongful conviction, we are going to increase the chance of letting some criminals go free. It can be mathematically shown (not here), that a reduction in the probability of Type I error is associated with an increase in Type II error.

Consider O.J. Simpson. Simpson was not found guilty in his criminal trial for murder, but was found liable for the deaths of Nicole Simpson and Ron Goldman in a civil trial. One reason for these different outcomes is the higher burden of proof for a criminal conviction (“beyond a reasonable doubt,” alpha = 1 percent) than for a finding of civil liability (“preponderance of evidence,” alpha = 50 percent). If O.J. truly is guilty of the murders, the criminal trial would have been less likely to find guilt than the civil trial would.

In econometrics, we construct the null hypothesis to be the opposite of what we hypothesize to be the relationship. For example, if we hypothesize that an increase in the minimum wage decreases employment, the null hypothesis would be: “A change in the minimum wage has no impact on employment.” If the research involves regression analysis, the null hypothesis would be: “The estimated coefficient on the elasticity of employment with respect to the minimum wage would be zero.” If we set the probability of Type I error to 5 percent, then regression results with a p-value of less than 0.05 would be sufficient to reject the null hypothesis of no relationship. If we increase the probability of Type I error, we increase the likelihood of finding a relationship, but we also increase the chance of finding a relationship when none exists.

Now, we’re getting to power.

Power is the chance of detecting a true effect. In the legal system, it would be the probability that a truly guilty person is found guilty.

By definition, a low power study has a small chance of discovering a relationship that truly exists. Low power studies produce more false negative than high power studies. If a set of studies have a power of 20 percent, then if we know that there are 100 actual effects, the studies will find only 20 of them. In other words, out of 100 truly guilty suspects, a legal system with a power of 20 percent will find only 20 of them guilty.

Suppose we expect 25 percent of those accused of a crime are truly guilty of the crime. Thus the odds of guilt are R = 0.25 / 0.75 = 0.33. Assume we set alpha to 0.05, and conclude the accused is guilty if our test statistic provides p < 0.05. Using Ioannidis’ formula for positive predictive value, we find:

  • If the power of the test is 20 percent, the probability that a “guilty” verdict reflects true guilt is 57 percent.
  • If the power of the test is 80 percent, the probability that a “guilty” verdict reflects true guilt is 84 percent.

In other words, a low power test is more likely to convict the innocent than a high power test.

In our minimum wage example, a low power study is more likely find a relationship between a change in the minimum wage and employment when no relationship truly exists. By extension, even if a relationship truly exists, a low power study would be more likely to find a bigger impact than a high power study. The figure below demonstrates this phenomenon.

MinimumWageResearchFunnelGraph

Across the 1,424 studies surveyed, the average elasticity with respect to the minimum wage is –0.190 (i.e., a 10 percent increase in the minimum wage would be associated with a 1.9 percent decrease in employment). When adjusted for the studies’ precision, the weighted average elasticity is –0.054. By this simple analysis, the unadjusted average is 3.5 times bigger than the adjusted average. Ioannidis and his coauthors estimate among the 60 studies with “adequate” power, the weighted average elasticity is –0.011.

(By the way, my own unpublished studies of minimum wage impacts at the state level had an estimated short-run elasticity of –0.03 and “precision” of 122 for Oregon and short-run elasticity of –0.048 and “precision” of 259 for Colorado. These results are in line with the more precise studies in the figure above.)

Is economics bogus?

It’s tempting to walk away from this discussion thinking all of econometrics is bogus. Ioannidis himself responds to this temptation:

Although the discipline has gotten a bad rap, economics can be quite reliable and trustworthy. Where evidence is deemed unreliable, we need more investment in the science of economics, not less.

For policymakers, the reliance on economic evidence is even more important, according to Ioannidis:

[P]oliticians rarely use economic science to make decisions and set new laws. Indeed, it is scary how little science informs political choices on a global scale. Those who decide the world’s economic fate typically have a weak scientific background or none at all.

Ioannidis and his colleagues identify several way to address the reliability problems in economics and other fields—social psychology is one of the worst. However these are longer term solutions.

In the short term, researchers and policymakers should view sensational finding with skepticism, especially if those sensational findings support their own biases. That skepticism should begin with one simple question: “What’s the confidence interval?”