13  Discussion 11: Correlation & Regression (from Fall 2025)

13.0.1 Contact Information

Name Wesley Zheng
Pronouns He/him/his
Email wzheng0302@berkeley.edu
Discussion Wednesdays, 12–2 PM @ Etcheverry 3105
Office Hours Tuesdays/Thursdays, 2–3 PM @ Warren Hall 101

Contact me by email at ease — I typically respond within a day or so!

13.1 Standard Units and Correlation

NoteNote

The Correlation Coefficient: The average of the product of x and y when both are in standard units.

This coefficient is a number between −1 and 1 that measures the strength and direction of the linear relationship between two variables, x and y.

Code
from datascience import *
import numpy as np

13.1.1 (a)

When calculating the correlation coefficient, why do we convert data to standard units?

NoteStandard Units and Comparing Distributions
  • Changing the units of your data does not affect the correlation coefficient.
  • Standardizing allows us to compare distributions on different scales.
    • Standard units tell us the relative position of each value: how many standard deviations above or below the mean it is.
  • Example: If convert_su() returns 3 for a value, that value is 3 standard deviations above the mean.
Answer

We convert data to standard units in order to compare it with other data that may be of different units and scales. For example, if we wanted to compare the weights of cars (usually thousands of pounds) to the maximum speed of cars (usually tens of miles per hour), converting to standard units allows us to effectively compares the two variables.

Moreover, using standard units gives us the following nice properties:

  • r is a pure number with no units (because of standardization).
  • r is unaffected by changing the units on either axis (because of standardization).

13.1.2 (b)

Write a function called convert which takes in an array of elements called xs and returns an array of the values represented in standard units.

def convert(xs):
    sd = ______________________________
    mean = _____________________________
    return ____________________________
Answer
def convert(xs):
    sd = np.std(xs)
    mean = np.mean(xs)
    return (xs - mean) / sd
convert(make_array(1, 2, 3, 4, 5, 6, 7))
array([-1.5, -1. , -0.5,  0. ,  0.5,  1. ,  1.5])

13.1.3 (c)

Write a function called correlation which takes in a table of data tbl containing the column names x and y and returns the correlation coefficient.

def correlation(tbl, x, y):
  x_su = ____________________________________
  y_su = ____________________________________
  return ____________________________________
Answer
def correlation(tbl, x, y):
  x_su = convert(tbl.column(x))
  y_su = convert(tbl.column(y))
  return np.mean(x_su * y_su)
correlation(Table().with_columns("x", make_array(1, 2, 3, 4, 5), "y", make_array(1, 3, 5, 7, 9)), "x", "y")
0.99999999999999978

13.2 Comparing Correlation

Correlation Coefficient Visualizer!

13.2.1 (a)

Look at the following four datasets. Rank them in order from weakest correlation to strongest correlation.

NoteInterpreting Correlation
  • The magnitude (absolute value) of r represents the strength of a linear relationship.
  • Two distributions may look similar at first, but subtle trends matter:
    • Example: Distribution A has a negative trend; D has no obvious trend.
  • Textbook chapter 15.2.5 is helpful for understanding regression equations, slopes, and intercepts in both standard and original units.
  • Bonus questions can be explored at home to connect regression lines in standard units to original units.

(Weakest)  __________   __________   __________   __________  (Strongest)

Answer

D, A, B, C.

  • D has almost no visible negative or positive trend as it is basically a blob, so its correlation is near 0.
  • A has a negative correlation, but the points are not very tightly clustered around a straight line, so the strength of its correlation is greater than D but still quite small.
  • B has a positive correlation, and the points are more tightly clustered around a positive sloping line, so the strength of its correlation is greater than A.
  • C has a negative correlation, and the points almost perfectly form a straight line. This indicates that the strength of the correlation is very close to 1.

NoteNote

We have introduced correlation as a way of quantifying the strength and direction of a linear relationship between two variables. The correlation coefficient can allow us to define the best straight line that defines the relationship between the two variables, known as the regression line. In fact, by a remarkable fact in mathematics, the line uniquely defined by the slope and intercept below is always the best possible straight line we could construct.

\[ \text{slope} = r \cdot \frac{\text{SD}_{y}}{\text{SD}_{x}} \]

\[ \text{intercept} = \text{average of } y - \text{slope} \cdot \text{average of } x \]

Regression Line:

\[ \hat{y} = \text{slope} \cdot x + \text{intercept} \]

For every 1 SD increase in \(x\), the predicted value of \(y\) increases by \(r\) SDs of \(y\).


13.2.2 (b)

Derive the slope and intercept above using the equation for the regression line in standard units (\(y_{\text{SU}}\) represents \(y\) in standard units).

Answer \[ \begin{aligned} y_{SU} &= r * x_{SU} \\ \frac{y - \text{average of } y}{\text{SD of } y} &= r * \frac{x - \text{average of } x}{\text{SD of } x} \\ y - \text{average of } y &= \left(r * \frac{\text{SD of } y}{\text{SD of } x}\right) * (x - \text{average of } x) \\ y &= \text{slope} * (x - \text{average of } x) + \text{average of } y \\ y &= \text{slope} * x + (\text{average of } y - \text{slope} * \text{average of } x) \\ y &= \text{slope} * x + \text{intercept} \end{aligned} \]

13.3 Linear Regressi(OH)n

Will Furtado’s Visualizer 🐐!

You just submitted a ticket at Office Hours and would like to know how long it will take to receive help. However, you don’t believe the estimated wait time displayed on the queue to be very accurate, so you decide to make your own predictions based on the total number of students present at OH when you submitted your ticket. You obtain data for 100 wait times and plot them below, also fitting a regression line to the data.


13.3.1 (a)

Suppose that you submit a ticket at Office Hours when there were a total of 20 students present. Based on the regression line, what would you predict the waiting time to be?

NoteLinear Regression: Making Predictions
  • Regression line formula: y = mx + b
    • Once m and b are calculated, plug in x to predict y.
  • Visual example: find the height of the regression line for x = 20.
Answer We observe 20 students at Office Hours, so we would want to look at the height of the regression line for \(x = 20\). Looking at the scatter plot, we see that the regression line roughly passes through the point (20, 14) so we would predict the waiting time to be around 14 minutes.

13.3.2 (b)

You go to Office Hours right before a homework assignment is due, and despite safety concerns, you observe 70 students at Office Hours. Would it be appropriate to use your regression line to predict the waiting time? Explain.

Answer

It would not be appropriate to use the regression line to make a prediction. Most values that were used to construct the regression line were between \(x=0\) and \(x=25\). Since \(x=70\) is far out of that range, we cannot expect the regression line to make a very accurate prediction. Furthermore, since we don’t have data for \(x > 30\), we are not sure that the linear trend continues for larger values of \(x\).

NoteExtrapolation Caution
  • Data only covers certain ranges (e.g., x ≤ 30).
  • We cannot assume trends continue beyond observed data — relationships may not remain linear.
  • Other variables (like assignment difficulty) could affect outcomes — correlation ≠ causation.

13.3.3 (c)

When constructing your regression line, you find the correlation coefficient \(r\) to be roughly 0.73. Does this value of \(r\) suggest that an increase in the number of students at Office Hours causes an increase in the waiting time? Explain.

Answer

Correlation does not imply causation! Just looking at the data it is unclear whether we account for confounding factors, and how they might contribute to the overall waiting time. For example, it is definitely possible that varying difficulties in the assignments across tickets affects the overall waiting time.

NoteCorrelation vs. Causation
  • A large r does not imply a causal relationship.
  • Visualize your data to check for linearity — e.g., Anscombe’s Quartet.
  • Scatter plots show associations but cannot confirm causation.

13.3.4 (d)

Suppose you never generated the scatter plot at the beginning of this section. Knowing only that the value of \(r\) is roughly 0.73, can you assume that the two variables have a linear association? Circle the correct statement and explain.

Answer
For example, a quadratic or exponential relationship between 2 variables can still have a high value of \(r\). To determine if the relationship between two variables is linear, it is a good idea to plot the data for a visual interpretation as well.

13.4 This is Regression! (Optional)

(This question uses the same data as Question 3 on the Summer 2024 Final Exam)

Conan has an unhealthy addiction to Rocket League, a game where players play soccer but with cars instead of people. Players can pick up boost pads that are scattered across the field, which players can use to make their cars go faster! Conan plays 50 games and records how much boost he used, as well as how many times he touched the ball in a given game.


13.4.1 (a)

Select the correct option:

Answer

13.4.2 (b)

Select the correct option:

Answer
NoteRegression in Scatter Plots
  • From a scatter plot, you can infer association, but not causation.
  • There may still be causation, but without controlled experiments or accounting for confounders, no conclusions can be drawn.
  • Review worksheet section 2 for regression line formulas.

13.4.3 (c)

Conan runs some calculations and obtains the following statistics:

  • The correlation coefficient between Touches and Boost Usage was approximately 0.705.
  • The average number of Touches was 28.54 with a standard deviation of 9.51.
  • The average of Boost Usage was 1773.4 with a standard deviation of 471.7.

For the following questions, feel free to leave your answers as mathematical expressions.


13.4.3.1 (i)

Conan touched the ball 40 times in one of his games. What is this in standard units?

Answer \(\frac{40-28.54}{9.51} \approx 1.2\)

13.4.3.2 (ii)

Conan wishes to fit a regression line to the data. What would be the slope and intercept of the regression line in original units?

Answer

\(\text{slope} = r*\frac{\text{SD of touches}}{\text{SD of boost usage}} = 0.705 * \frac{9.51}{471.7} \approx 0.0142\)

\(\text{intercept} = \text{average of touches} - \text{slope * average of boost usage} = 28.54 - 0.0142*1773.4 \approx 3.36\)

13.4.3.3 (iii)

What would the slope and intercept be if the data were in standard units?

Answer

\(\text{slope} = r*\frac{\text{SD of touches (SU)}}{\text{SD of boost usage (SU)}} = 0.705*\frac{1}{1} = 0.705\)

\(\text{intercept} = \text{average of touches (SU) - slope * average of boost usage (SU)} = 0 - 0.705*0 = 0\)

When in standard units, the slope of the regression line is just \(r\), and the intercept is always zero.