Least Squares Regression
Help Questions
AP Statistics › Least Squares Regression
A teacher models number of practice problems completed ($x$) and quiz score ($y$) with a least-squares regression line to predict quiz score from practice. A scatterplot with the fitted line is shown. Which statement about the regression line is correct?

Using the line to predict for $x$ far beyond the observed practice range is not a concern because the model is linear.
A large residual means the $x$-value is far from $\bar{x}$.
The least-squares line always passes through $(\bar{x},\bar{y})$, so the point of averages lies on the line.
The least-squares line minimizes the sum of the residuals, not the sum of squared residuals.
Residuals are computed as $\hat{y}-y$, and their sum must be positive if the slope is positive.
Explanation
This question tests knowledge of a key property of least-squares regression lines. The regression line always passes through the point (x̄, ȳ), which represents the average of the x-values and the average of the y-values. This centroid property is guaranteed mathematically and helps anchor the line's position. Option A is wrong - residuals are y - ŷ (not ŷ - y), and their sum is always zero regardless of slope. Option C is incorrect because least-squares minimizes the sum of squared residuals, not just the sum. Option D confuses residuals with deviations from the mean. Option E is false - extrapolation beyond the observed range is always risky.
A school counselor models the relationship between hours of sleep the night before an exam ($x$) and exam score ($y$) for a group of students using a least-squares regression line. A scatterplot with the fitted regression line is shown. Which statement about the regression line is correct?

The regression line always passes through the point whose coordinates are $(\bar{x},\bar{y})$.
The regression line minimizes the sum of the absolute values of the residuals.
The regression line guarantees accurate predictions for any $x$-value, even far outside the observed range.
The sum of the $y$-values for the data points must equal 0 when the regression line is fit.
A residual is the horizontal distance from a point to the regression line.
Explanation
This question tests understanding of least-squares regression properties. The least-squares regression line has a special property: it always passes through the point (x̄, ȳ), where x̄ is the mean of all x-values and ȳ is the mean of all y-values. This point is called the centroid or center of mass of the data. Option A is incorrect because least-squares minimizes the sum of squared residuals, not absolute values. Option C is wrong because residuals are vertical distances (y - ŷ), not horizontal. Option D is false because extrapolation far outside the observed range is unreliable. Option E is incorrect because it's the sum of residuals that equals zero, not the sum of y-values.
A nutritionist records daily sodium intake (mg) $x$ and systolic blood pressure (mmHg) $y$ for 18 adults and fits a least-squares regression line to predict blood pressure from sodium intake. Which statement about the regression line is correct?
A negative residual means the observed value is above the line.
The regression line is chosen so that the sum of residuals is minimized in absolute value, not squared value.
The regression line is chosen so that $\sum \hat y=0$.
The regression line must pass through the point $(0,\bar y)$.
The regression line is chosen so that the sum of squared residuals is minimized, where residuals are vertical differences $y-\hat y$.
Explanation
AP Statistics emphasizes the least-squares criterion, as in this sodium and blood pressure context. Choice A correctly identifies minimization of sum of squared vertical residuals y - ŷ. This is the key feature. Choice B is a distractor, suggesting absolute values instead of squares, which is a robust but different method. Mini-lesson: The line doesn't force sum ŷ = 0 (Choice C) or pass through (0, ȳ) (Choice E). A negative residual means the point is below the line, not above (Choice D). Use caution with extrapolation.
An environmental scientist measures daily high temperature (°C) $x$ and electricity use (kWh) $y$ for 20 days and fits a least-squares regression line to model electricity use from temperature. Which statement about the regression line is correct?
The least-squares line is the one for which the sum of squared residuals is as small as possible.
The least-squares line is the one that minimizes the sum of squared perpendicular distances from the points to the line.
The least-squares line guarantees perfect predictions when $x$ is within the observed range.
Extrapolating to temperatures far outside the observed range is always reliable because it uses all the data.
The least-squares line is chosen so that the residuals alternate in sign as $x$ increases.
Explanation
This question in AP Statistics evaluates the criterion for the least-squares line in modeling electricity use from temperature. Choice A correctly states that the line minimizes the sum of squared residuals, making this sum as small as possible for the best fit. This is the fundamental goal of least-squares. Choice B is a distractor, as it describes minimizing squared perpendicular distances, which is used in total least squares or principal components, not standard regression. Mini-lesson: Least-squares regression focuses on vertical residuals to predict y from x, and while residuals may alternate, that's not a selection criterion (Choice C). It doesn't guarantee perfect predictions (Choice D) or reliable extrapolation (Choice E), which can fail if the relationship changes outside the data.
A city planner records distance from downtown (miles) $x$ and average rent (dollars) $y$ for a sample of apartments, then fits a least-squares regression line to model rent from distance. Which statement about the regression line is correct?
The least-squares line is the line that makes the maximum residual as small as possible.
Residuals should be measured as perpendicular distances so the “closest” line is found.
The least-squares line is selected so that all residuals are positive.
The least-squares line is the line that minimizes $\sum(y-\hat y)^2$.
Predictions from the line are equally trustworthy for $x$ within the observed distances and for $x$ far beyond them.
Explanation
This AP Statistics item checks the optimization in least-squares for rent prediction from distance. Choice B is accurate: the line minimizes ∑(y - $ŷ)^2$, the sum of squared vertical errors. This defines the method. Choice A distracts by focusing on minimizing the maximum residual, which is minimax regression, not least-squares. Mini-lesson: Least-squares prioritizes overall squared error reduction, not ensuring all positive residuals (Choice C) or using perpendicular distances (Choice E). Predictions are most reliable within observed x, not extrapolated far beyond (Choice D).
A student investigates the relationship between engine size (liters) $x$ and highway fuel efficiency (mpg) $y$ for several cars and fits a least-squares regression line to predict mpg from engine size. The scatterplot with the fitted line is shown, and the goal is to use the model for prediction. Which statement about the regression line is correct?

The least-squares regression line minimizes the sum of squared vertical residuals $\sum(y-\hat y)^2$.
The regression line is chosen so that the data points are as close as possible in perpendicular distance to the line.
A residual is the horizontal difference between the observed $x$ and the predicted $\hat x$ on the line.
Because the line fits the data, it is reasonable to predict mpg for an engine size much larger than any in the data with the same confidence as within the data range.
The least-squares regression line makes the sum of the residuals equal to the slope of the line.
Explanation
This AP Statistics question on engine size and mpg tests least-squares properties. Choice C is correct: the line minimizes ∑(y - $ŷ)^2$, the sum of squared vertical residuals. This ensures the best linear fit. Choice E distracts by mentioning perpendicular distances, which isn't standard for y-on-x regression. Mini-lesson: Residuals are vertical, not horizontal (Choice D), and the sum of residuals is zero, not equal to the slope (Choice B). Extrapolation isn't as confident as interpolation (Choice A), due to potential non-linearity outside data.
A fitness coach records each client’s resting heart rate (beats per minute) $x$ and time to run 1 mile (minutes) $y$ for a group of clients and fits a least-squares regression line to predict mile time from heart rate. Which statement about the regression line is correct?
The line is equally appropriate for prediction at any $x$-value, including far beyond the observed heart rates.
A residual is computed as $\hat y-y$, so points above the line have negative residuals.
The regression line is the line that minimizes the sum of absolute vertical distances to the line.
The regression line always passes through $(\bar x,\bar y)$.
The regression line is chosen so that the average of the predicted values $\hat y$ is $0$.
Explanation
AP Statistics covers least-squares regression features, such as the line passing through the mean point, as seen in this heart rate and mile time scenario. Choice B is correct: the regression line always goes through (x̄, ȳ), centering it in the data cloud. This property ensures the line captures the average trend. A distractor is Choice A, which defines residuals as ŷ - y instead of y - ŷ, leading to incorrect sign interpretation for points above the line. Mini-lesson on least-squares: It minimizes the sum of squared vertical residuals (not absolute as in Choice D), providing the best linear fit for prediction within the data range, but extrapolation (Choice E) can be unreliable. The average ŷ isn't zero (Choice C); it's related to the means.
A teacher records number of absences $x$ and final course grade (percent) $y$ for students in a class and fits the least-squares regression line to predict grade from absences. Which statement about the regression line is correct?
The regression line must pass through the point $(\bar x,0)$.
If a student’s point lies below the regression line, then the residual $y-\hat y$ is negative.
If a student’s point lies below the regression line, then the residual $\hat y-y$ is negative.
The regression line is chosen so that the median residual is $0$ rather than the mean residual.
Residuals are the horizontal differences $x-\hat x$ from a point to the line.
Explanation
AP Statistics questions on least-squares often test residual definitions, as in this absences and grades example. Choice A is right: for a point below the line, y < ŷ, so the residual y - ŷ is negative, indicating underperformance relative to prediction. This sign convention is key for residual analysis. Choice B distracts by reversing the formula to ŷ - y, flipping the sign interpretation. Mini-lesson: The least-squares line minimizes squared vertical residuals, not horizontal ones (Choice C) or using medians (Choice D). It passes through (x̄, ȳ), not (x̄, 0) as in Choice E, ensuring centered predictions.
A botanist models amount of fertilizer ($x$, in grams) and plant height after 4 weeks ($y$, in cm) using a least-squares regression line to predict height from fertilizer. A scatterplot with the fitted line is shown. Which statement about the regression line is correct?

Because the line fits by least squares, predictions outside the observed fertilizer range are equally reliable.
The least-squares regression line always has $r^2=1$ because it is the best-fitting line.
The line is chosen to minimize $\sum(y-\hat{y})^2$ across all data points.
The residuals are horizontal distances, so changing the units of $x$ changes the residuals but changing units of $y$ does not.
The regression line must pass through the point $(0,0)$ if $x=0$ grams is possible.
Explanation
This question tests understanding of the least-squares criterion. The regression line is specifically chosen to minimize the sum of squared residuals, written mathematically as Σ(y - ŷ)². This optimization gives us a unique line with desirable statistical properties for prediction and inference. Option A is incorrect - r² measures the proportion of variance explained, not a requirement that it equals 1. Option C is wrong because residuals are vertical distances in y, not horizontal. Option D is false - extrapolation outside the observed range is always less reliable. Option E is incorrect - the line only passes through (0,0) if that happens to equal (x̄, ȳ) or if we force it through the origin.