Opening subject page...
Loading your content
Why only well-designed experiments let us draw cause-and-effect conclusions from data.
For centuries, scientists and physicians relied on anecdotal evidence and uncontrolled observations to evaluate treatments, policies, and interventions. The results were often misleading: a new tonic might appear effective simply because patients who chose to take it were healthier to begin with. The fundamental problem—confounding variables—made it impossible to separate the effect of the treatment from pre-existing differences between groups. The development of formal experimental design over the twentieth century provided a rigorous framework for making causal inferences rather than merely detecting associations.
The central question this lesson addresses is: Under what conditions can we move beyond association and legitimately claim that one variable causes a change in another? The answer lies in understanding how experiments are designed, why random assignment is essential, and what scope of inference a given study design supports.
A well-designed experiment is the only data-collection method that supports a cause-and-effect conclusion. Observational studies, no matter how large, can only establish associations because uncontrolled lurking variables may explain observed differences. The following principles distinguish experiments from other study designs and dictate the kind of inference each supports.
The diagram above captures the essential logic of inference in statistics. When the researcher actively imposes treatments through random assignment, any systematic difference between group outcomes can be attributed to the treatment because randomization balances all other variables—both measured and unmeasured—across groups. In contrast, when subjects choose their own groups (or are sorted by nature, economics, or personal preference), a lurking variable may drive both group membership and the response, producing a spurious association. This distinction is the single most important idea in the AP Statistics curriculum on data collection.
The statistical engine behind experimental inference relies on the concept of sampling variability under the null hypothesis. After random assignment and data collection, statisticians ask: "If the treatment had no effect, how likely is a difference at least as extreme as the one observed?" When that probability (the p-value) is sufficiently small, we reject the null hypothesis and conclude that the treatment caused the observed effect.
Consider an experiment with two groups. After measuring the response variable, compute the observed difference in means (or proportions). Under the null hypothesis that treatment has no effect, every subject would have produced the same response regardless of group assignment. We can simulate thousands of re-randomizations to build a distribution of differences that would occur by chance alone.
x̄_treatment and x̄_control are the sample means of the treatment and control groups, respectively.d_obs. A small p-value (typically < 0.05) provides evidence against the null hypothesis.Every statistical study supports a certain scope of inference determined by two design features: whether the study used random selection (to obtain the sample from a larger population) and whether it used random assignment (to allocate subjects to treatments). These two forms of randomness serve different purposes: random selection supports generalizing results to the population, while random assignment supports causal conclusions.
On the AP exam, the most common scenario involves a study that uses random assignment but not random selection—placing it in the upper-right cell. In this case, you can conclude that the treatment caused the observed difference, but only for the subjects in the study, not for any broader population. Conversely, a well-designed survey (random selection, no random assignment) supports generalization but not causation. The ideal—both forms of randomness—is rare outside large-scale clinical trials.
A researcher wants to study whether background music improves test performance. She recruits 60 volunteers from her university and randomly assigns 30 to take a math test with classical music playing and 30 to take the same test in silence. The music group scores an average of 4.2 points higher. A significance test yields p = 0.014.
| Feature | Experiment | Observational Study |
|---|---|---|
| Treatment assignment | Researcher imposes via random assignment | Subjects self-select or are observed as-is |
| Confounding control | Randomization balances known and unknown confounders | Can only control for measured confounders (statistically) |
| Causal inference | Yes — supported | No — only association |
| Ethical feasibility | May be unethical (e.g., assigning smoking) | Can study exposures that cannot be imposed |
| Generalizability | Often limited (convenience samples) | Often broader (large random samples available) |
| Cost & complexity | Typically expensive and time-consuming | Often cheaper; can use existing data |
The concepts in this lesson lay the groundwork for every inference procedure you will encounter later in the course. When you perform a two-sample t-test, a chi-square test, or a confidence interval, the validity of your conclusion depends on how the data were collected. A statistically significant result from a randomized experiment warrants a causal interpretation; the same p-value from an observational study does not.
| Concept in This Lesson | Where It Reappears Later |
|---|---|
| Random assignment → causal claim | Conclusion step of every significance test ("there is evidence that X causes Y" vs. "there is an association") |
| Random selection → generalizability | Identifying the population to which confidence intervals or test results apply |
| Confounding variables | Explaining why an observed association may not reflect a causal relationship in regression (lurking variables) |
| P-value under H₀ | All hypothesis tests—z-tests, t-tests, chi-square tests—use the same null-hypothesis logic introduced here |
In more advanced coursework, you would encounter the Rubin Causal Model and techniques like propensity-score matching that attempt to approximate the benefits of randomization in observational data. At the AP level, however, the essential takeaway is simpler: design determines conclusion. Master the scope-of-inference framework, and you will navigate the conclusion step of any FRQ with confidence.
The ability to draw valid inferences from data depends entirely on study design. Only experiments with random assignment support cause-and-effect conclusions, because randomization balances both known and unknown confounding variables across groups. Observational studies can detect associations but cannot eliminate confounders, so causal language is never appropriate for them.
The scope of inference is determined by two dimensions: random selection (enables generalization to the population) and random assignment (enables causal claims). A well-designed experiment with control, replication, and blinding maximizes the strength of evidence. Always match your conclusion to the design: state causation only when random assignment justifies it, and limit your generalization to the population from which the sample was drawn.