9  Discussion 07: Assessing Models (from Fall 2025)

9.0.1 Contact Information

Name Wesley Zheng
Pronouns He/him/his
Email wzheng0302@berkeley.edu
Discussion Wednesdays, 12–2 PM @ Etcheverry 3105
Office Hours Tuesdays/Thursdays, 2–3 PM @ Warren Hall 101

Contact me by email at ease — I typically respond within a day or so!


9.0.2 Announcements

CautionAnnouncements
  • Regular Midterm - October 17th, Friday from 7-9pm
  • Alternate Exam - October 17th, Friday from 5-7pm

When we observe something different from what we expect in real life (i.e., four 3’s in six rolls of a fair die), a natural question to ask is “Was this unexpected behavior due to random chance, or something else?”

Hypothesis testing allows us to answer the above question in a scientific and consistent manner, using the power of computation and statistics to conduct simulations and draw conclusions from our data.

9.1 Test Statistics

Dylan is playing with a coin and he wants to test whether his coin is fair. His experiment is to toss the coin 100 times. He chooses the following null hypothesis.

Null Hypothesis: The coin is fair and any observed deviation is due to chance.

For each of the alternative hypotheses listed below, determine whether or not the test statistic is valid and give an explanation.

NoteChoosing a Test Statistic
  • If we only care about whether the coin is fair or unfair, we use a test statistic with absolute value.
  • If the alternative hypothesis is directional, we do not use absolute value.
  • For more details, check out the Hypothesis Testing Guide.

9.1.1 (a)

Alternative Hypothesis: The coin is biased towards heads.
Test Statistic: # of heads.

Answer

9.1.2 (b)

Alternative Hypothesis: The coin is not fair.
Test Statistic: # of heads.

Answer
We want more extreme values of our test statistic to favor the alternative hypothesis. We want to consider the two cases in which the coin is biased towards heads or that it’s biased towards tails, but simply counting the number of heads does not account for the second case.

9.1.3 (c)

Alternative Hypothesis: The coin is not fair.
Test Statistic: \(|\)# of heads - expected # of heads \(|\).

Answer
NoteWhy Use Absolute Distance?
  • We use absolute distance because bias can appear in either direction (too many heads or too many tails).
  • Larger values in either direction give evidence against fairness.
  • Think of it like “folding” a histogram of the distance test statistic at the expected mean (5 heads).

9.1.4 (d)

Alternative Hypothesis: The coin is biased towards heads.
Test Statistic: \(|\)# of heads - expected # of heads \(|\).

Answer

This is the opposite case of part (b). We see that this test statistic will also account for a bias towards tails (because of the absolute value).

NoteUnfair ≠ Just More Heads
  • The problem states: “test if the coin is unfair,” not “biased towards heads.”
  • That means either outcome (too many heads or too many tails) counts as evidence.
  • Even though we observed 9 heads, note that 1 head out of 10 flips is just as suspicious.

9.1.5 (e)

Alternative Hypothesis: The coin is not fair.
Test Statistic: \(\frac{1}{2}\) - proportion of heads.

Answer
Without the absolute value, we will not achieve large values of our test statistic leaning towards the alternative hypothesis.

9.2 Carnival Games

You are playing a wheel-spinning game at a carnival, where you can earn prizes based on where the wheel stops. The booth attendant claims the distribution of prizes is as below, but you think the game is rigged and doesn’t follow the listed probabilities.

Prize Chance
Nothing 80%
Teddy bear 2%
Pinwheel 6%
Sticker 12%

You would like to test your claim so you can report the carnival for fraud.

NoteSetting Up Hypotheses
  • Start by asking yourself:
    • What are we trying to prove?
    • How can we simulate this?
  • The null hypothesis is usually the “baseline” with a fully defined model that we can actually simulate under.
from datascience import *

9.2.1 (a)

Is the data we are working with numerical or categorical? Think about how this influences what test statistic we should use.

Answer
We have four named outcomes (the Prizes), and are not numeric measurements. This is therefore categorical, and we should consider using TVD as our test statistic (see part (e)).

9.2.2 (b)

What is the booth attendant’s hypothesis?

Answer The distribution of prizes follows the distribution listed by the carnival. Any observed difference is simply due to chance.

9.2.3 (c)

What is your hypothesis?

Answer The distribution of prizes does not follow the distribution listed by the carnival. Any observed difference is not just due to chance.

9.2.4 (d)

Which hypothesis (of the two we defined) can you simulate under?

Answer You could simulate under the booth attendant’s hypothesis. This is because it is a fully defined model, meaning we are able to describe the parameters of an experiment surrounding it. Your hypothesis is simply that the distribution is not the same as the carnival’s; there is no fully defined model that we can simulate under.

9.2.5 (e)

What is a good statistic to use?

Answer TVD from expected distribution. When we are observing categorical distributions of data and want to compare them, we should use TVD. Note, this is a good example because we have four different components in the distribution that we would like to test.

9.2.6 (f)

Write code that simulates playing the carnival game 1000 times, and returns an array of proportions corresponding to how often each prize was won.

prize_chances = _______________________________________________________
my_simulation = _______________________________________________________
Answer
prize_chances = make_array(0.8, 0.02, 0.06, 0.12)
my_simulation = sample_proportions(1000, prize_chances)
NoteUnderstanding sample_proportions
  • sample_proportions can be tricky—here’s a toy example:
    • Bag: 1 red marble + 2 blue marbles → make_array(1/3, 2/3).
    • Run: sample_proportions(5, make_array(1/3, 2/3)).
    • Imagine drawing 5 times with replacement and writing down each color.
    • At the end, record the proportion of red vs. blue.
  • One possible output: array([2/5, 3/5]).
  • For more details, see the Sampling Methods Guide.

9.2.7 (g)

Write one line of additional code that extracts the number of teddy bears we would have won in our simulation. You may use my_simulation from the previous question.

Answer
my_simulation.item(1) * 1000
22.0

Suppose the wheel-spinning game received a lot of complaints at the carnival, and the owners of the game are pressured to release their true distribution of prizes as below:

Prize Chance
Nothing 90%
Teddy bear 1%
Pinwheel 3%
Sticker 6%

Use the distribution above to answer the following probability questions.


9.2.8 (a)

What is the probability of winning a prize from one spin of the wheel?

Answer

Using the Complement Rule:

\[P(winning\:a\:prize) = 1 -[winning\:a\:prize] = 1 − P[Nothing] = 1 − 0.9 = 0.1\:or\:10\%\]

9.2.9 (b)

What is the probability of winning a Teddy bear and a Sticker in two spins?

Answer

\[P(Teddy\:bear\:and\:Sticker) = 2 * P(Teddy\:bear) * P(Sticker) = 2 * 0.01 * 0.06 = 0.12%\]
We multiply by 2 because we could have won the Teddy bear and then the Sticker OR the Sticker first and then the Teddy bear.

NoteTrick for Counting Outcomes
  • Sometimes you need to multiply by 2 (or more) because different orders produce the same overall outcome.
  • Example: winning a Teddy then a Sticker, or a Sticker then a Teddy—both count!

9.2.10 (c)

What is the probability of winning at least one prize in 10 spins?

Answer

Complement Rule again:

\[P(at\:least\:one\:prize) = 1 - P(no\:prizes\:in\:10\:spins) = 1 - P(Nothing) ^{10} = 1 - (0.9)^{10}\]

9.3 Flu (Bonus!)

Researchers are studying the effectiveness of a particular flu vaccine. A large random sample was taken from the population of people who took the vaccine in 2016. Among the sampled people, 48% did not get the flu. Another large random sample was taken in 2017, from among the people who took the vaccine that year. Among these sampled people, 40% did not get the flu.

(Spring 2018 Midterm Question 4)


9.3.1 (a)

A researcher thinks the vaccine was less effective in 2017 than in 2016. To test this, a null hypothesis is needed. Which is the correct null hypothesis?

Answer

Option A - Incorrect as it describes a model that is difficult to simulate under. How can we quantify “less effective”?

Option B - Incorrect as the question tells us that the vaccine was not equally effective in the two samples (48% vs 40%).

Option C - Correct. The null hypothesis would state that the vaccine was equally effective in the two populations, and that the differences we observe in the two samples are simply due to chance.

NoteSample vs. Population
  • When we say “any observed difference is due to chance,” we’re talking about differences in the sample, not the population itself.

9.3.2 (b)

The researcher says, “The observed value of my test statistic is \(40\% - 48\% = − 8\%\).” To perform the test, the statistic is simulated under the null hypothesis. One of the figures below is the empirical histogram of the simulated values. Which is it?

NoteSimulating Under the Null
  • If the null says two populations are equally effective, the expected difference = 0.
  • The histogram of simulated differences will then be centered around 0.

Answer

The test statistic we are using is the difference between the two sample percentages. Under the null hypothesis, this could be positive or negative depending on the sample. This rules out (ii).

Under the null hypothesis, the two sample percentages are expected to be equal and hence the difference is expected to be 0. This rules out (i).

Only (iii) has all the right properties.