# Perceptions of Statistical Evidence Among Scientists

## Edinburgh / 19 May, 2014

Richard D. Morey
University of Groningen

## My collaborators

• Philosophy: Jan-Willem Romeijn (Rijksuniversiteit Groningen)
• Statistics: Paul Speckman, Jeff Rouder (University of Missouri)
• Psychology: Rink Hoekstra (Rijksuniversiteit Groningen)

## Science and evidence

### A few questions:

1. How strong is the evidence for anthropogenic global warming?
2. Is someone who rejects the theory of relativity rational?
3. Should we believe in subliminal priming, given the literature on it?
4. How strongly should we believe that $\delta>0$, given a particular data set?

These are all scientific questions, concerning evidence, reason, and belief. Only the last sounds odd.

## Evidence

That which would justify a change in a person's belief regarding a question of interest (Fox, 2011).

• Requires justfication or rationality (otherwise not useful)
• Is inherently subjective (questions and belief are subjective)
• Requires accounting for belief
• Additionally: is relative (for our purposes)

## An epic disagreement

### Jerzy Neyman: Statistics is about decision and action

"The processes of [statistical inference]...are certainly not any sort of 'reasoning', at least not in the sense in which this word is used in other instances; they are acts of will." (Neyman, 1957)

"We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis." (Neyman & Pearson, 1933)

## An epic disagreement

### Ronald Fisher: Statistics is about knowledge and rationality

"...the feeling induced by a test of significance has an objective basis in that the probability statement on which it is based is a fact communicable to and verifiable by other rational minds. The level of significance in such cases fulfils the conditions of a measure of the rational grounds for the disbelief it engenders" (my emphasis; Fisher, 1959).

## Formalizing evidence and belief: Bayesian statistics

### Modern Bayesian statistics arose at the same time.

• Beliefs can be represented as "plausibilities": $0\leq p \leq 1$
• Plausibilies conform to the laws of probability (Cox, 1946; Ramsey, 1926; de Finetti, 1935; Joyce, 1998)
• Updating beliefs in response to data $y$ is done according to the Bayesian conditionalization:

$\frac{p_y(\theta_0)}{p_y(\theta_1)} = \frac{p(\theta_0\mid y)}{p(\theta_1\mid y)} = \frac{p(y\mid\theta_0)}{p(y\mid\theta_1)}\times\frac{p(\theta_0)}{p(\theta_1)}$

## Does this possible disagreement have implications?

Schrier et al. (2008): "The interpretation of systematic reviews with meta-analyses: an objective or subjective process?"

### Does magnesium improve outcomes after a heart attack (myocardial infarction, MI)?

• 8 medical researchers study 23 studies grouped as 1 + 5 meta-analyses; total N: 69,505
• Researchers given typical inferential statistics for meta-analyses
• "I believe magnesium has now been shown to be beneficial for patients during the post-MI period."
• "I recommend that magnesium therapy be used in patients during the post-MI period."

## Medical disagreement: Schrier et al. (2008)

"I believe magnesium has now been shown to be beneficial for patients during the post-MI period."

## Medical disagreement: Schrier et al. (2008)

"I recommend that magnesium therapy be used in patients during the post-MI period."

## Medical disagreement: Schrier et al. (2008)

### What was the effect of the data?

• The same data moved some researchers in one direction, others in the other
• Data increased the disagreement among the researchers, in spite of large $N$
• What hope is there for research if so much data can't induce agreement?

## Why care about evidence evaluation?

### Accounting for evidence is needed for...

• Experimental data evaluation (Do I need more participants?)
• Theory building (which phenomena do I need to account for?)
• Theory evaluation (is the evidence for my theory strong or week?)
• Evaluation of clinical trials (how much evidence for efficacy is there?)
• Basically...everywhere in science!

## How do we assess evidence?

### Rink Hoekstra and I presented various scenarios, with common statistics, to researchers $(N=118)$.

• Scenario 1: Statistical falsification
• Scenario 2: Power, Type I error rate, significance
• Scenario 3: $p$ value and sample size
• Scenario 4: $p$ value, sample size, and power (extended Q3)
• Scenario 5: Confidence interval

## Assessing evidence: Logical Falsification

### The situation

Two researchers disagree about the sex of an adult antelope skull found. It is known that for adults of this antelope species, all males have antlers between 7cm and 12cm long. All females have antlers between 3cm and 5cm long. There are no exceptions. However, their assistant — who found the skull — has not told them the length of the antlers yet, and neither researcher has seen the antlers.

### The hypotheses

Based on its location in a particular grave site, the two researchers have their own hypotheses about the sex of the antelope. Dr. Z believes that the skull belonged to a female antelope. Dr. W believes that the skull belonged to a male antelope.

## Assessing evidence: Logical Falsification

### The experiment

Dr. X, their assistant, returns with the measurements of the antlers.

### The results

The exact length of the antlers was 4.5cm.

## The question

Suppose that like Dr. X, you were completely neutral and had no preference for either hypothesis. In light of Dr. X's measurement, how does the support for Dr. W's hypothesis relate to the support for Dr. Z's?

e.g., "The evidence for Dr. W's hypothesis is 10 times stronger than the evidence for Dr. Z's hypothesis"

## Assessing evidence: Statistical Significance

### The situation

Two researchers disagree about the size of an effect of a genetic mutation on the weight of mice.

### The hypotheses

Dr. A believes that this genetic mutation decreases the weight of mice by 1 gram. Dr. B believes that this genetic mutation increases the weight of mice by 1 gram.

## Assessing evidence: Statistical Significance

### The experiment

The two researchers ask a neutral third researcher, Dr. C, to conduct an experiment to test their hypotheses. Because Dr. C has no preference for either hypothesis, she randomly selects, by a fair coin flip, Dr. B's hypothesis as the null hypothesis. She designs and performs an experiment so that the statistical test she performs on the data has a type I error rate of 5% and a power, if Dr. A's hypothesis is correct, of 80%.

### The results

Dr. C performs the statistical test. The results are statistically significant, indicating that the null hypothesis Dr. B's is to be rejected. For the purpose of this question, consider the assumptions of the statistical procedure met.

## The question

Suppose that like Dr. C, you were completely neutral and had no preference for either hypothesis before the experiment. In light of Dr. C's findings, how does the support for Dr. A's hypothesis relate to the support for Dr. B's?

e.g., "The evidence for Dr. A's hypothesis is 10 times stronger than the evidence for Dr. B's hypothesis"

## Is evaluating evidence possible?

$N=55$ participants. 2 participants answered $\infty$.

## Assessing evidence: Confidence intervals

### The situation

Two researchers are studying the density of a new material, previously unknown to science. Based on their particular theoretical leanings, they disagree about what the density of the material will be.

### The hypotheses

Dr. K believes that the density of the new material is 1.2 g/cm3. Dr. L believes that the density of the new material is 0.8 g/cm3.

## Assessing evidence: Confidence intervals

### The experiment

The two researchers ask a neutral third researcher, Dr. Z, to conduct measurements to test the density of the new material. Dr. Z is an expert at measuring density, but due to recent cuts in funding for lab equipment, he has to use substandard equipment. To overcome this problem, Dr. Z measures the material 10 times and constructs a 95% confidence interval around the mean density measurement, based on a standard t procedure. For the purpose of this question, consider the assumptions of the statistical procedure met.

### The results

Dr. Z reports back to the two scientists the 95% confidence interval shown below.

## Confidence intervals: Possible intervals

Four types of intervals could be presented:

## Likelihood ratios

$N=64$. One participant answered "Infinity" in each non-"Equal" condition.

## Survey: Conclusions

• Substantial disagreement exists regarding whether evidence can be extracted from classical statistical reports.
• If they do believe it, their assessments of evidence are variable across orders of magnitude
• However, evidence seems to be somewhat meaningful to researchers: trends make sense.

## Future work

Idea: Use psychometric techniques to assess evaluations of statistical evidence.

($\delta$ is the standardized effect size $(\mu-\mu_0)/\sigma$)

## Future work

### Equi-evidence curves

• Evidence is a critical idea in science and philosophy
• Current dominant statistical techniques don't quantify evidence (and aren't meant to!)
• Evidence can be formalized using Bayesian statistics
• Needed: Training in statistical methods that interface with belief! (See: BayesFactor software - Bayesian linear models in R)
• Scientific evidence is a complicated thing; statistical evidence is only one (formal) piece of the pie.