57 pages • 1-hour read
Charles WheelanA modern alternative to SparkNotes and CliffsNotes, SuperSummary offers high-quality Study Guides with detailed chapter summaries and analysis of major themes, characters, and more.
A 2012 study in Science found that male fruit flies that had been repeatedly rejected by females chose food with 15% alcohol more often than males allowed to mate. The study highlights how sound experimental design—in this case, having clear treatment and control groups—can reveal underlying behavior without relying on complex statistics. The chapter emphasizes that analysis cannot fix flawed data. Instead, the quality of the data used determined the relevance, validity, and usability of the results: “garbage in, garbage out.”
The best data sets have three characteristics. First, they allow inference from representative samples. A simple random sample gives every member of a set an equal chance of selection. If data sets have been compiled poorly, larger samples can reduce random error but they cannot fix input bias. Second, treatment and control groups isolate causal effects. For this, randomization balances observed and unobserved traits, which is especially critical in human studies. Third, anticipatory data collection, as in the Framingham Heart Study and the Perry Preschool Study, enables researchers to link early-life factors to long-term outcomes.
The author recalls falling ill in Kathmandu and completing a lengthy cross-sectional survey before tests identified a waterborne cyanobacteria treatable with antibiotics, illustrating one-time data collection.
Common data problems include selection bias (e.g., nonrepresentative sampling), treatment assignment bias (healthier patients receiving certain therapies), self-selection bias (volunteers differing from nonvolunteers), publication bias (positive results are favored, meaning one outlier study is published while many that refute its results are not), recall bias (survey subjects’ memories alter with time), survivorship bias (those who remain in the sample skew the data by leaving out those who drop out), and healthy user bias (people who adopt recommended behaviors differ in many other ways from those who do not). These pitfalls underscore that careful sampling, randomization, and transparent reporting are essential to credible research.
The chapter explains how the central limit theorem (CLT) enables strong inferences from limited data. Wheelan makes up an outlandish story to demonstrate how sample means signal a population’s identity. A bus full of marathoners goes missing in a city that is also hosting a sausage festival ferrying visitors by bus. One errant bus is found, but since its passengers’ average weight is over 220 pounds—an unlikely mean for runners—it is most likely not the lost marathon bus.
The CLT holds that means of large, random samples from any population are distributed normally around the true population mean, regardless of the population’s shape. This allows four core inferences: predicting sample characteristics from known populations, inferring population parameters from samples (as in polling), testing whether a sample likely came from a given population, and assessing whether two samples likely share a common source.
For example, to survey US household income, even with a skewed distribution, repeated samples of 1,000 households produce a normal distribution of sample means around the true mean. With at least about 30 observations, the CLT’s normal approximation typically applies. The standard error captures how sample means vary: It shrinks as sample size grows and expands with greater population dispersion. Consequently, about 68% of sample means fall within one standard error of the population mean, 95% within two, and 99.7% within three.
The bus example applies these tools. If the mean of marathoner population weight is 162 pounds, with a standard deviation of 36, then the standard error for a 62-person sample is roughly 4.6. A bus full of people weighing an average of 194 pounds is over three standard errors from the mean, making it extremely unlikely to be made up of runners. A colleague notes that in real life, logic also matters (buses rarely go missing, so any missing bus is most likely the right one). However, the scenario clarifies how the CLT quantifies uncertainty.
In college, Wheelan spent most of a semester blowing off one of his classes. Then, when other time commitments shifted, he decided to really learn the material for the final. His dramatic improvement on the exam prompted his professor to question whether he had cheated. This anecdote illustrates statistical inference: weighing how likely observed outcomes are under competing explanations. Statistics rarely proves anything with certainty; rather, it gauges probabilities. For example, rolling 10 consecutive sixes while gambling is astronomically unlikely by chance, making cheating plausible, yet rare events do occur (e.g., an individual has been struck by lightning multiple times).
Inference starts with a null hypothesis (meaning that a given treatment has no effect on results) and an alternative hypothesis. Researchers reject the null when observed results would be very unlikely if the null hypothesis were true. In medical clinical trials, a null hypothesis might posit that a new drug matches a placebo; if the treatment group then shows starkly lower infection rates compared to the control group, the study is justified in rejecting the null. This can help catch cheating. For instance, in the Atlanta school scandal, wrong-to-right erasures were 20-50 standard deviations above norms, meaning they were so improbable that investigators concluded cheating had occurred.
A common threshold for rejecting the null is .05 (5%). In the bus-weight scenario with a population mean of 162 pounds and a standard error of 4.6, a bus population averaging 136 pounds falls well below the 95% range, yielding a p-value below .0001 and justifying rejection of the null.
Wheelan notes that statistical significance does not imply causation or meaningful magnitude. A study in the Archives of General Psychiatry found larger average brain volumes among children with autism (p = .002), with nonoverlapping 95% confidence intervals. The correlation does not clarify the causal relationship, but does invite further study. Conversely, a 2011 paper claiming evidence of extrasensory perception met the .05 threshold but failed the standard of extraordinary evidence for extraordinary claims. Its non-reproducible results were the result of luck.
Choosing a significance level involves trade-offs between Type I errors (false positives) and Type II errors (false negatives), with different domains—spam filtering, cancer screening, counterterrorism—prioritizing errors differently.
A late-2011 New York Times/CBS poll reported widespread public anxiety, including low approval of Congress and polarized views on President Obama. Such findings rely on polling, which applies the central limit theorem: a large, representative sample mirrors the population. The “margin of error” is the 95% confidence interval around a sample proportion; ± 3% sampling error means that in 95 of 100 similar polls, results fall within three points of the true value.
Because many polling questions estimate proportions, the relevant standard error depends on both the proportion and sample size; larger samples yield narrower margins. An exit poll of 500 voters showing 53% for one candidate has a margin of ± 2% and an approximate confidence level of 68%. Raising confidence to 95% widens the band, sometimes preventing a clear call. Increasing the sample to 2,000 tightens the ± 2% margin to 95% confidence, allowing more decisive inference.
Methodology matters. Representative sampling avoids self-selection biases common in call-in or online polls: Professional pollsters use random digit dialing of numbers that include cell phones, randomly select one adult per household, and make repeated callbacks to reduce nonresponse bias. Question wording also shapes results. Gallup finds over 60% of respondents support the death penalty for murder, but this number drops significantly when the question offers life without parole as an alternative. Pollsters use split-sample tests to detect such effects and avoid leading language.
Finally, respondents may misreport details about their lives or opinions, particularly on sensitive topics. For example, checking voting records shows that many of those polled overreport their voting frequency. This is why pollsters screen for likely voters instead of simply voting-age adults. Interestingly, the National Opinion Research Center’s 1995 “Sex Study,” designed to inform HIV/AIDS research, achieved a nearly 80% response and found less sexual activity than commonly assumed.
Ultimately, the challenges of polling are twofold: first, drawing a proper sample, and second, eliciting accurate responses. The statistics calculations are straightforward, but execution and interpretation are critical.
Wheelan relies on narrative analogies to make abstract statistical principles tangible. Moreover, to dramatize the most complex of the ideas presented in the book, he creates a memorable and borderline nonsensical scenario that will stick in readers’ minds. The saga of the lost marathoners’ bus in a city also hosting overweight sausage festival attendees riding other buses in Chapter 8 illustrates this technique. The far-fetched story becomes an extended metaphor for the central limit theorem, translating concepts like sample means and population characteristics into a concrete mystery. Wheelan thus demonstrates statistical reasoning in action before introducing its formal mechanics. By embedding the logic of inference within a narrative framework, the author foregrounds the theory of statistical thought, allowing reader understanding to support the idea that Statistical Literacy is Empowering when presented in a non-abstruse way.
These chapters are structured to build the reader’s understanding of statistical inference, moving sequentially from foundational principles to complex applications. Chapter 7 establishes that the quality of any analysis is contingent upon the quality of the initial data, encapsulated in the maxim “garbage in, garbage out” (111). This chapter functions as a cautionary tale, detailing the various forms of bias that can invalidate statistical techniques. Having established the importance of sound data, Chapter 8 introduces the central limit theorem as the conceptual engine that empowers statistical generalization. Chapter 9 then builds directly on this foundation, explaining how the theorem enables hypothesis testing—the formal process of using sample data to make claims about a population. The section culminates with Chapter 10 on polling, which serves as a capstone case study, demonstrating how the preceding concepts—sampling, bias, the central limit theorem, and confidence intervals—coalesce in a real-world application. This incremental layering of concepts ensures that each new idea is anchored to a previously established one.
Throughout this section, the author frames statistical work as a form of critical inquiry that contends with inherent limitations. While Wheelan always presents Probability as a Tool for Better Decisions, the catalogue of data problems in Chapter 7, from selection bias to healthy user bias, encourages the reader to question the provenance and integrity of data before accepting a conclusion. This theme of fallibility is further developed in the discussion of hypothesis testing in Chapter 9. The trade-off between Type I and Type II errors—illustrated via examples like spam filters, cancer screenings, and counterterrorism—highlights the inevitability of statistical error and couches it as a practical, ethical dilemma. Choosing a significance level is a value-laden decision about which kind of mistake is more acceptable in a given context. This focus on the potential for error and bias presents statistics as a powerful yet imperfect tool that requires careful judgment.
By grounding statistical concepts in significant social and scientific controversies, the author reinforces the thesis that quantitative literacy is essential for modern citizenship, especially given the ease with which Statistics Can Mislead or Be Manipulated. The analysis of the Atlanta schools cheating scandal, for instance, illustrates how understanding standard deviations can uncover systemic fraud. The breakdown of the study linking autism to brain volume transforms a news headline into a lesson on comparing sample means and interpreting p-values. Furthermore, the critique of a study on extrasensory perception (ESP) serves as a lesson in scientific skepticism, demonstrating that statistical significance alone is not sufficient justification for extraordinary claims. As a critique cited in the text notes, “Claims that defy almost every law of science are by definition extraordinary and thus require extraordinary evidence” (161). This example underscores the necessity of context and prior knowledge in evaluating statistical findings. The application of inference to public opinion polling in Chapter 10 shows how the mechanics of sampling and margin of error directly shape the political discourse that informs a democratic society.



Unlock all 57 pages of this Study Guide
Get in-depth, chapter-by-chapter summaries and analysis from our literary experts.