BakedIn.co

How to read a primary study

Sample size, p-value, effect size, confidence interval. Conflicts of interest. The replication crisis, honestly.

~25 min625 XP on completion

Most wellness content is built on misreading primary literature. A single study with a flashy headline ('coffee linked to longevity!') almost never warrants the certainty the headline suggests. The skill we're building in this chapter is not 'read every primary study' — guideline bodies do that for you. It's 'when you encounter a study being cited at you, look at four numbers and decide how much to update your beliefs.'

Four numbers matter: sample size, p-value, effect size, confidence interval. Plus three meta-questions: what kind of study is it, who funded it, has it been replicated. Six minutes per study, lifetime utility.

The hierarchy of study types

Not all studies carry the same weight. The hierarchy you'll see in every textbook and every guideline-body methods document:

  1. Case report / case series — one person, or a small group. Lowest evidential weight; useful as hypothesis-generating, not as basis for action.
  2. Cross-sectional study — snapshot of a population at a point in time. Shows correlation, can't show causation.
  3. Case-control study — compare people with an outcome to people without. Useful for rare outcomes; vulnerable to recall bias.
  4. Cohort study — follow a group forward over time. Can show temporal sequence (X happened before Y) but can't randomize, so other factors may explain associations.
  5. Randomized controlled trial (RCT) — random assignment to treatment or control. The closest individual-study design to demonstrating causation.
  6. Systematic review + meta-analysis — pools multiple RCTs (or other studies) into a single quantitative summary. The most evidentially-weighty single document.

When you see a study being cited, identify its level first. 'A new study shows…' is usually a cross-sectional or cohort study — informative but not the same as an RCT. Wellness content systematically over-claims based on lower-tier evidence.

The four numbers

1. Sample size (n)

How many participants? Reported as 'n=...' in the methods. Small studies (n<100) are vulnerable to chance findings; their results often don't replicate in larger studies. Very large studies (n>10,000) can detect tiny effects that may or may not matter clinically. Read sample size in conjunction with effect size.

Rough heuristic: a single RCT with n<200 should not change your behavior on its own. Wait for a replication or a meta-analysis.

2. P-value (p)

The probability that you'd see a result this extreme or more extreme by chance alone if there's actually no effect. p<0.05 is the conventional threshold for 'statistically significant.' It does NOT mean 'this finding is true' or 'there's a 95% chance this is real.' It only means 'if there were no effect, this result would be unusual.'

  • p<0.001 — very unlikely to be chance, but still says nothing about effect size or real-world importance
  • p between 0.01 and 0.05 — conventionally significant; replicates less often than people think
  • p between 0.05 and 0.10 — 'marginal' / 'trending'; should be treated as a draw, not a finding
  • p>0.10 — no statistical support for the claimed effect

3. Effect size

How big is the effect? P-value tells you 'is there an effect'; effect size tells you 'does the effect matter.' A statistically significant 0.3-pound weight loss is not clinically meaningful. A statistically significant 30% reduction in heart attack risk is huge.

Effect sizes come in many forms — relative risk, odds ratio, hazard ratio, mean difference, Cohen's d. The unit depends on the study type. Two examples of how to read them:

  • 'Relative risk 0.7' — the treatment group had 70% the risk of the control group, i.e., a 30% relative reduction. (Watch: if baseline risk is 1 per 100,000, a 30% reduction is 0.3 per 100,000 — clinically tiny. Always look for absolute risk too.)
  • 'Mean difference -2.5 mmHg' — treatment lowered systolic blood pressure by 2.5 mmHg on average. (Tiny in clinical terms — guidelines treat blood pressure changes <5 mmHg as marginal.)

4. Confidence interval (CI)

Reported usually as '95% CI: 0.6 to 0.9' alongside the effect size. The range within which the true effect probably falls. NARROW intervals (close range) suggest precise estimates; WIDE intervals suggest uncertainty.

Practical rules:

  • If the CI crosses 1.0 for a relative risk / odds ratio / hazard ratio, the result is NOT statistically significant — even if reported with a p-value.
  • If the CI crosses 0 for a mean difference, NOT significant.
  • A wide CI (e.g., 0.5 to 2.3) often means the study was too small or the data too noisy to give a confident answer.
  • Strong findings have narrow CIs that don't cross the null.

Three meta-questions

What kind of study is it?

Cross-sectional? Cohort? RCT? Meta-analysis? The headline rarely tells you. Look at the methods section. A 'meta-analysis of RCTs' is much stronger evidence than a 'cross-sectional study' on the same question.

Who funded it?

Every reputable journal requires a funding disclosure and a conflict-of-interest statement. Industry-funded studies aren't automatically wrong — much of clinical research is industry-funded — but the funding direction is one factor in calibrating your skepticism. A study funded by a supplement company that finds their supplement works is in a different epistemic position than a Cochrane review of independent RCTs.

Has it been replicated?

The most important question. A single study should never change your behavior unless the field has tried to replicate it and the replications agree. Wellness journalism systematically over-weights novel findings. The norm is: a single study is a hypothesis; a replicated finding is a result.

The 6-minute read protocol

When someone cites a study at you, here is the actual workflow:

  1. Find the abstract (PubMed link, journal site, or Google Scholar). Read it.
  2. Identify the study type from the methods sentence. Is it RCT? Cohort? Cross-sectional?
  3. Find the four numbers: n, p, effect size, CI. These are usually in the results section's first paragraph or in a single summary table.
  4. Read the discussion/limitations paragraph for the authors' own caveats.
  5. Search Google Scholar for the study title + 'replication' or for 'meta-analysis' on the topic.
  6. Compare to the relevant guideline body's current recommendation. If the guideline doesn't reflect this study, the study probably hasn't (yet) changed the field.

Six minutes. The output is not 'I now know what's true' — it's 'I now know how much weight this single study should carry, and whether to update my beliefs or wait for replication.'

Most published research findings are false … the probability that a research claim is true depends on the prior probability of it being true, the statistical power of the study, the level of statistical significance, and a number of biases.
Ioannidis, JPA — 'Why Most Published Research Findings Are False', PLoS Med 2005 (retrieved 2026-05-23 via pubmed.gov/16060722)

That paper is two decades old and was provocative; it remains uncomfortably relevant. The reforms it helped trigger (pre-registration, larger samples, open data, replication initiatives) are still in progress. Skepticism toward single studies is the rational baseline.