10 Discussion 08: Midterm Review (from Fall 2025)

10.0.1 Contact Information

Name	Wesley Zheng
Pronouns	He/him/his
Email	wzheng0302@berkeley.edu
Discussion	Wednesdays, 12–2 PM @ Etcheverry 3105
Office Hours	Tuesdays/Thursdays, 2–3 PM @ Warren Hall 101

Contact me by email at ease — I typically respond within a day or so!

10.0.2 Announcements

Announcements

Grades for HW06, Lab06, and the Midterm have been released!
The mid-semester report will not include assignment drops — don’t worry, those will apply later in the term.
You’ll have a chance to switch lab formats later this week — the form will open on Friday.

10.1 Tables

You are given the following table called pokemon. For the following questions, fill in the blanks.

Working with Tables

Table manipulation skills are super valuable for the course!
Review methods like where, group, pivot, join, and others.
Pro tip: Pay attention to variable names — they often tell you exactly what the problem wants.
Always check the expected output type: do you need a number, an array, or a table?
Drawing out or imagining intermediate tables can help clarify your steps.
Working backwards from the desired result is often a great strategy (and can help earn partial credit).

Code

import random
from datascience import *
import numpy as np
%matplotlib inline

names = ["Bulbasaur", "Charmander", "Squirtle", "Palkia", "Dialga", "Giratina"]
types = ["Grass", "Fire", "Water", "Water/Dragon", "Steel/Dragon", "Ghost/Dragon"]
hp = [45, 39, 44, 90, 100, 150]
speed = [45, 65, 43, 100, 90, 90]
generation = [1, 1, 1, 4, 4, 4]
legendary = [False, False, False, True, True, True]

type_pool = [
    "Grass", "Fire", "Water", "Electric", "Psychic",
    "Ice", "Dragon", "Steel", "Ghost", "Dark",
    "Fairy", "Ground", "Rock", "Fighting", "Bug",
    "Flying", "Poison", "Normal"
]

for i in range(734):
    names.append(f"Pokemon_{i+7}")
    if random.random() < 0.2:
        t1, t2 = random.sample(type_pool, 2)
        types.append(f"{t1}/{t2}")
    else:
        types.append(random.choice(type_pool))
    hp.append(random.randint(30, 170))
    speed.append(random.randint(20, 150))
    generation.append(random.randint(1, 8))
    legendary.append(random.random() < 0.05)

names.append("Wailord")
types.append("Water")
hp.append(170)
speed.append(60)
generation.append(3)
legendary.append(False)

pokemon = Table().with_columns(
    "Name", names,
    "Type", types,
    "HP", hp,
    "Speed", speed,
    "Generation", generation,
    "Legendary", legendary
)

pokemon.show(6)

Name	Type	HP	Speed	Generation	Legendary
Bulbasaur	Grass	45	45	1	False
Charmander	Fire	39	65	1	False
Squirtle	Water	44	43	1	False
Palkia	Water/Dragon	90	100	4	True
Dialga	Steel/Dragon	100	90	4	True
Giratina	Ghost/Dragon	150	90	4	True

... (735 rows omitted)

10.1.1 (a)

Find the name of the pokemon of only type Water that has the highest HP.

water_pokemon = pokemon.____________(_____________, ____________)
water_pokemon._______(____________, _____________).column("Name").item(0)

Answer

water_pokemon = pokemon.where("Type", are.equal_to("Water"))
water_pokemon.sort("HP", descending = True).column("Name").item(0)

'Wailord'

10.1.2 (b)

Find the proportion of Fire-type Pokemon with a Speed less than 100.

fire_pokemon = pokemon.____________(____________, ____________)
fire_pokemon.____________(____________, ________________________)
.____________ / ____________._____________

Answer

fire_pokemon = pokemon.where("Type", "Fire")
fire_pokemon.where("Speed", are.below(100)).num_rows / fire_pokemon.num_rows

0.65625

10.1.3 (c)

Create a table containing Type and Generation that is sorted in decreasing order by the average HP for each pair of Type and Generation that appears in the table.

avg_hp = pokemon.____________(____________, ____________)
avg_hp.sort("HP mean",____________)
.____________(____________, ____________)

Answer

avg_hp = pokemon.group(make_array("Type", "Generation"), np.mean)
avg_hp.sort("HP mean", descending = True).select("Type", "Generation")

Type	Generation
Poison/Fire	4
Ice/Fairy	3
Dark/Ground	8
Fairy/Rock	6
Ghost/Fighting	8
Flying/Grass	4
Fairy	6
Fire/Ground	3
Psychic/Ice	4
Rock/Poison	4

... (269 rows omitted)

10.1.4 (d)

Return an array that contains ratios of legendary to non-legendary pokemons for each generation. You may assume that the Legendary column is a column of booleans.

helper = pokemon.____________(____________, ____________)
ratios = helper.________(________) / helper.________(________)

Answer

helper = pokemon.pivot("Legendary", "Generation")
ratios = helper.column("True") / helper.column("False")

ratios

array([ 0.02272727,  0.06      ,  0.05376344,  0.07894737,  0.0952381 ,
        0.02272727,  0.07954545,  0.08641975])

10.1.5 (e) (Bonus!)

Consider another table called trainers, which contains information about Pokemon trainers and the Pokemon they own. The trainers table has two columns: Trainer, the name of the trainer and Pokemon, the name of the Pokemon. Use table operations to create a new table called pokemon_with_trainers that includes each Pokemon’s Name, Type, Generation, and their Trainer.

Code

trainers = Table().with_columns(
    "Trainer", ["Ash", "Misty", "Brock", "Wallace"],
    "Pokemon", ["Bulbasaur", "Squirtle", "Charmander", "Wailord"]
)

trainers_added = pokemon._______(_______, _______, _______)
pokemon_with_trainers = trainers_added._______(_______, _______, _______)

Answer

trainers_added = pokemon.join("Name", trainers, "Pokemon")
pokemon_with_trainers = trainers_added.drop("HP", "Speed", "Legendary")

pokemon_with_trainers

Name	Type	Generation	Trainer
Bulbasaur	Grass	1	Ash
Charmander	Fire	1	Brock
Squirtle	Water	1	Misty
Wailord	Water	3	Wallace

10.2 Histograms

World Happiness Report is a landmark study on the state of global happiness. This study calculated and ranked the happiness level for 155 countries using data from the Gallup World Poll. The histogram below shows the distribution of happiness scores computed from this study in 2019. Suppose the data is stored in the table called happiness. The following code was used to generate the histogram you see below:

Code

happiness = Table.read_table("happiness.csv")

happiness.hist("Score", bins = np.arange(2.8, 8, 0.5))

Note that the histogram may look a bit different from usual for the purpose of making bar heights easier to interpret.

Thinking About Histograms

Remember the area principle: the area of each bar corresponds to its proportion of the data.
We don’t know how values are distributed within a single bin — only the count (or percent) of values in that bin.
Without the actual counts, a histogram only gives us percentages, not raw numbers.

10.2.1 (a)

What are the units of the y-axis in the histogram?

Answer

Percent per happiness score. Note that it is NOT “percent per country”. Although each data point used to generate the histogram represents a country, the values used to make the histogram are the happiness scores.

For part b - e, use the above histogram to calculate the following quantities. If it’s not possible, write “Cannot calculate” and explain your reasoning.

10.2.2 (b)

The proportion of countries with happiness scores between 4.3 and 5.8.

Answer

Each bin has width 0.5. We’re interested in the [4.3, 4.8), [4.8, 5.3), and [5.3, 5.8)] bins, which have heights approximately 38, 28, and 27 percent per unit respectively. We can use the Area Principle to calculate the sum of the areas of these bins, which in turn represents the proportion we’re looking for: \[ \begin{align*} \text{Total Area} &= \text{height}_1 \cdot \text{width} + \text{height}_2 \cdot \text{width} + \text{height}_3 \cdot \text{width} \\ &= 0.5 \cdot (\text{height}_1 + \text{height}_2 + \text{height}_3) \\ &= 0.5 \cdot (38 + 28 + 27) \\ &= 0.5 \cdot 93 \\ &= 46.5\% \quad \text{or as a proportion, } 0.465 \end{align*} \]

happiness.where("Score", are.between(4.3, 5.8)).num_rows / happiness.num_rows

0.46794871794871795

10.2.3 (c)

The number of countries with happiness scores between 4.3 and 5.8 (round to the nearest country).

Answer

We can simply multiply the proportion of countries with happiness score between 4.3 and 5.8 (calculated in the previous question) by the number of countries represented in the histogram:
Number of countries = \(46.5\% * 155\text{ countries}\approx 72 \text{ countries}\)

happiness.where("Score", are.between(4.3, 5.8)).num_rows

10.2.4 (d)

The number of countries with happiness scores between 6 and 7.

Answer

Cannot calculate; we don’t have bins that exactly comprise [6, 7), and we do not know the distribution of countries within bins.

# Just in case you are curious about this!
happiness.where("Score", are.between(6, 7)).num_rows

10.2.5 (e)

The height of the new bin after combining the three leftmost bins.

Answer

Combining the three leftmost bins combines their areas. The total area of the three leftmost bins is:

\[ 0.5 \cdot (5 + 10 + 14) \approx 14.5\% \]

The width of the new bin is going to be the sum of the widths of the original bins (0.5), so the new width is 1.5.

Now that we have the area and width of the new bin, we can calculate the height:

\[ \begin{align*} \text{Area} &= \text{width} \cdot \text{height} \\ \Rightarrow 14.5 &= 1.5 \cdot \text{height} \\ \Rightarrow \text{height} &= \frac{14.5}{1.5} \\ &\approx 9.67\% \text{ per unit} \end{align*} \]

happiness.hist("Score", bins = make_array(2.8, 4.3, 4.8, 5.3, 5.8, 6.3, 6.8, 7.3, 7.8, 8.3))

10.3 Probability

Building Probability Intuition

Draw out what’s happening to better visualize probabilities.
Start simple: consider just the first trial, then extend to more trials.
Use the multiplication rule when combining independent events across trials.
If you’re unsure whether to add or multiply:
- Use add when the problem says “or” (multiple possible ways).
- Use multiply when the problem says “and” (several conditions must happen together).
Example: There’s only one way to get 3/3 heads, but three ways to get 1/3 heads.

10.3.1 (a)

A fair coin is tossed five times. Two possible sequences of results are HTHTH and HTHHH. Which sequence of results is more likely? Explain your answer and calculate the probability of each sequence appearing.

Answer

They are equally likely since the coin is fair. By the multiplication rule, the probability that either of the two sequences appears is \((\frac{1}{2})^5\).

For parts (b) - (d), assume we have a biased coin such that the probability of getting heads is \(\frac{1}{5}\) and the probability of getting tails is \(\frac{4}{5}\). The coin is tossed 3 times.

10.3.2 (b)

What is the probability that you get exactly 2 heads? What about exactly 0 heads?

Answer

Here we need to consider all the possible outcomes that fall into this event and calculate their probabilities.
There are 3 possible outcomes: HHT, HTH, THH. The probability for each of them is \((\frac{1}{5})^2 * (\frac{4}{5})\). Therefore, the probability of getting exactly 2 heads is \(3*(\frac{1}{5})^2*(\frac{4}{5})\).

To get no heads, we must have gotten all tails, and the probability for that is: \((\frac{4}{5})^3\).

10.3.3 (c)

What is the probability of getting exactly 1 head or exactly 2 heads?

Answer

We can use the addition rule to add the probability of getting exactly 1 head, and the probability of getting exactly 2 heads. There are three possible outcomes for each case. The probability is: \(3*(\frac{1}{5})*(\frac{4}{5})^2 + 3*(\frac{1}{5})^2*(\frac{4}{5})\).

10.3.4 (d)

What is the probability you get 1 or more heads?

Answer

We can use the complement rule here. The complement of getting at least 1 head is getting no heads, which we’ve just calculated in the previous question. Therefore, the probability is: \(1 - (\frac{4}{5})^3\).

10.5 Simulation and Hypothesis Testing

A tortoise and a hare want to have a race on a number line! They both start at 0 and the race lasts for 100 time steps. However, they move differently. At each time step the tortoise moves 1 step forward with a \(\frac{1}{2}\) chance (and stays in place with a \(\frac{1}{2}\) chance), and the hare moves 3 steps forward with a \(\frac{1}{6}\) chance.

They race, and the tortoise loses badly; the hare finished 50 steps ahead of the tortoise. Suspicious, the tortoise decides to conduct a hypothesis test to determine whether or not the hare is actually faster.

Simulation and Hypothesis Testing

Example: testing whether the hare is actually faster than the tortoise.
Use a one-sided alternative hypothesis — don’t take absolute values in the test statistic.
To simulate using hypothetical probabilities, use sample_proportions.
Be careful: the one_race function doesn’t directly compute the test statistic.
- It gives an intermediate result, from which you can build an overlaid histogram or compute the difference to get the test statistic you need.

10.5.1 (a)

Fill in the blanks below for the null and alternative hypotheses of this test, as well as a valid test statistic.

Null Hypothesis:

Alternative Hypothesis:

Test Statistic:

Answer

Null Hypothesis: The hare is not faster than the tortoise. The hare finishing 50 steps ahead of the tortoise was simply due to random chance.

Alternative Hypothesis: The hare is faster than the tortoise. The hare finishing 50 steps ahead of the tortoise was not just due to random chance.

Test Statistic: The difference in distances between the hare and the tortoise at the end of each race.

10.5.2 (b)

Write a function called one_race() that simulates a single race of 100 time steps. It should return a two element array of the final distances of both the tortoise and the hare (in that order) from the origin after 100 time steps.

def one_race():
    tortoise_array = ________________________________________________
    hare_array = ________________________________________________
    tortoise_sim = ________________________________________________
    hare_sim = ________________________________________________
    tortoise_distance = ________________________________________________
    hare_distance = ________________________________________________
    return make_array(tortoise_distance, hare_distance)

Answer

def one_race():
    tortoise_array = make_array(1/2, 1/2)
    hare_array = make_array(1/6, 5/6)
    tortoise_sim = sample_proportions(100, tortoise_array)
    hare_sim = sample_proportions(100, hare_array)
    tortoise_distance = tortoise_sim.item(0) * 100
    hare_distance = hare_sim.item(0) * 300
    return make_array(tortoise_distance, hare_distance)

one_race()

array([ 45.,  51.])

10.5.3 (c)

We would now like to simulate what would happen if the tortoise and the hare races 10,000 times. Complete the code below and record how far the tortoise and the hare end from the origin in the arrays tortoise_distances and hare_distances respectively.

tortoise_distances = make_array()
hare_distances = make_array()
for ________________________________________________:
    race = ________________________________________________
    one_tortoise_dist = ________________________________________________
    one_hare_dist = ________________________________________________
    tortoise_distances = ________________________________________________
    hare_distances = ________________________________________________

Answer

tortoise_distances = make_array()
hare_distances = make_array()
for i in np.arange(10000):
    race = one_race()
    one_tortoise_dist = race.item(0)
    one_hare_dist = race.item(1)
    tortoise_distances = np.append(tortoise_distances, one_tortoise_dist)
    hare_distances = np.append(hare_distances, one_hare_dist)

tortoise_distances, hare_distances

(array([ 36.,  41.,  41., ...,  55.,  59.,  52.]),
 array([ 54.,  54.,  48., ...,  48.,  69.,  72.]))

The results table (20,000 rows) contains two columns recorded after 10,000 simulations: - Competitor (string): the name of the competitor, either “tortoise” or “hare” - Distance (int): the final distance of the competitor at the end of the race

10.5.4 (d)

Using the results table, create an overlaid histogram that shows the distribution of final distances for both the tortoise and the hare.

Code

results = Table().with_columns(
    "Competitor", np.append(np.full(10000, 'tortoise'), np.full(10000, 'hare')),
    "Distance", np.append(tortoise_distances, hare_distances)
)

________________________.________________________(________________________, group=____________)

Answer

results.hist("Distance", group = "Competitor")

10.5.5 (e)

Create an array called differences where each value in the array represents how many steps the hare finished ahead of the tortoise in a given race. Then, write a line of code to calculate the observed p-value of the hypothesis test. Finally, assuming a 5% p-value cutoff, describe the different conclusions you would come to based on the possible values of the observed p-value.

differences = ________________________________________________
p_value = ________________________________________________

Answer

differences = hare_distances - tortoise_distances
p_value = np.count_nonzero(differences >= 50) / 10000

If the observed p-value is less than the 5% cutoff, we would consider this evidence against the null hypothesis and reject it in favor of the alternative hypothesis. If the observed p-value is greater than the 5% cutoff, our observation is consistent with our null hypothesis and we would fail to reject the null.

p_value

0.0

10.6 More Hypothesis Testing

Chloe is a big fan of Trader Joe’s Frozen Mac ’n Cheese, but she noticed that the cheese used in it varies from box to box. A Trader Joe’s employee provides her with some data about the 4 different cheeses used and the probability of them being used in each box:

Chloe is suspicious about this distribution. After all, Velveeta is much cheaper to use than Gruyère, and she has also never bought a box that uses Gruyère. Chloe decides to buy many boxes throughout the next month and tracks the type of cheese used in each box. She uses this to conduct a hypothesis test.

10.6.1 (a)

Write a valid null and alternative hypothesis for this experiment.

Null Hypothesis:
Alternative Hypothesis:

Answer

Null Hypothesis: The types of cheese in the Frozen Mac ’n Cheese boxes are distributed according to the probability distribution provided by the employee. Any observed difference is simply due to chance.
Alternative Hypothesis: The types of cheese in the Frozen Mac ’n Cheese boxes are not distributed according to the probability distribution provided by the employee. Any observed difference is not just due to chance.

observed_proportions = make_array(0.2, 0.3, 0.45, 0.05)
employee_proportions = make_array(0.05, 0.55, 0.25, 0.15)

The array observed_proportions contains the proportions of cheese that Chloe observed in 20 boxes of Mac ’n Cheese.

10.6.2 (b)

Chloe wants to use the mean as a test statistic, but Katherine suggests that Chloe use the TVD (total variation distance) instead. Which test statistic should Chloe use in this case? Briefly justify your answer. Then, write a line of code to assign the observed value of the test statistic to observed_stat.

observed_stat = ____________________________________________________________

Answer

Katherine is correct, we should use the total variation distance because she is comparing two categorical distributions (the observed distribution and the one provided by the Trader Joe’s employee).

observed_stat = sum(np.abs(observed_proportions - employee_proportions)) / 2

observed_stat

0.35000000000000003

10.6.3 (c)

Define the function one_simulated_test_stat to simulate a random sample according to the null hypothesis and return the test statistic for that sample.

def one_simulated_test_stat():
    sample_prop = ________________________________________________
    return ________________________________________________

Answer

def one_simulated_test_stat():
    sample_prop = sample_proportions(20, employee_proportions)
    return sum(abs(employee_proportions - sample_prop)) / 2

one_simulated_test_stat()

0.099999999999999964

10.6.4 (d)

Chloe simulates the test statistic 10,000 times and stores the results in an array called simulated_stats. The observed value of the test statistic is stored in observed_stat. Complete the code below so that it evaluates to the p-value of the test:

________________(simulated_stats ______ observed_stat) / __________________

Code

simulated_stats = make_array()
for i in np.arange(10000):
    simulated_stats = np.append(simulated_stats, one_simulated_test_stat())

Answer

np.count_nonzero(simulated_stats >= observed_stat) / 10000

0.0030000000000000001

10.6.5 (e)

Given that the computed p-value is 0.0825, which of the following are true? Select all that may apply.

Using an 8% p-value cutoff, the null hypothesis should be rejected.
Using a 10% p-value cutoff, the null hypothesis should be rejected.
There is an 8.25% chance that the null hypothesis is true.
There is an 8.25% chance that the alternative hypothesis is true.

Answer

Only choice b is correct. We can only reject the null hypothesis when the observed p-value is less than the cutoff. Furthermore, there is no chance associated with whether or not the null or alternative hypothesis is true.

10.7 A/B Testing

Understanding A/B Testing

Use an A/B test to check if two samples come from the same distribution.
In a permutation test, sample without replacement so that the proportions of categories stay the same when shuffling.
A permutation is just a reordering of the original data.
- You can shuffle either the labels or the data column.
Remember: the p-value cutoff (like 0.05) represents the probability of falsely rejecting the null hypothesis — it is not the same as your observed p-value.

10.7.1 (a)

Kevin, a museum curator, has recently been given specimens of caddisflies collected from various parts of Northern California. The scientists who collected the caddisflies think that caddisflies collected at higher altitudes tend to be bigger. They tell him that the average length of the 560 caddisflies collected at high elevation is 14mm, while the average length of the 450 caddisflies collected from a slightly lower elevation is 12mm. He is not sure that this difference really matters and thinks that this could just be the result of chance in sampling.

What is an appropriate null hypothesis that Kevin can simulate under?

Answer

Null Hypothesis: The distribution of specimen lengths is the same for caddisflies sampled from high elevation as those sampled from low elevation. Any observed difference between the two samples is simply due to random chance.

10.7.2 (b)

How could you test the null hypothesis in the A/B test from above? What assumption would you make to test the hypothesis, and how would you simulate under that assumption?

Answer

If the null hypothesis is true – the caddisflies did not come from different distributions – then it should not matter how the samples were labeled (high elevation or low elevation). Under this assumption, you could shuffle the labels of the caddisflies and calculate your test statistic from this “relabeled” data.

10.7.3 (c)

What would be a useful test statistic for the A/B test? Remember that the direction of your test statistic should come from the initial setting.

Answer

Difference in mean lengths between the two groups. Note that this is not an absolute difference – we could choose either order for subtraction, but that would affect the direction of our alternative hypothesis so we need to be careful!

Assume flies refers to the following table:

Code

high_lengths = np.random.normal(loc=14, scale=2, size=557)
low_lengths  = np.random.normal(loc=12, scale=2, size=450)

flies = Table().with_columns(
    "Elevation", ["High elevation", "Low elevation", "High elevation"] + ["High elevation"]*557 + ["Low elevation"]*450,
    "Specimen length", np.append(make_array(12.3,13.1, 12.0), np.append(high_lengths, low_lengths))
)

flies.show(3)

Elevation	Specimen length
High elevation	12.3
Low elevation	13.1
High elevation	12

... (1007 rows omitted)

10.7.4 (d)

Fill in the blanks in this code to generate one value of the test statistic simulated under the null hypothesis.

def one_simulation():
    shuffled_labels = flies.______________________.column(___________)
    shuffled_flies = flies.with_columns(____________, __________________)
    grouped = shuffled_flies.___________(____________, ___________)
    means = grouped.column(‘Specimen length mean’)
    statistic = ________________
    return statistic

Answer

def one_simulation():
    shuffled_labels = flies.sample(with_replacement = False).column('Elevation')
    shuffled_flies = flies.with_columns('Elevation', shuffled_labels)
    grouped = shuffled_flies.group('Elevation', np.mean)
    means = grouped.column('Specimen length mean')
    statistic = means.item(0) - means.item(1)
    return statistic

one_simulation()

-0.16821416320684435

10.7.5 (e)

Fill in the code below to simulate 10000 trials of our permutation test.

test_stats = ____________________
repetitions = _______________________
for i in np.arange(__________________):
    one_stat = ______________________
    test_stats = np.append(test_stats, one_stat)

Answer

test_stats = make_array()
repetitions = 10000
for i in np.arange(repetitions):
    one_stat = one_simulation()
    test_stats = np.append(test_stats, one_stat)
test_stats

array([-0.00315336, -0.08789298,  0.03439367, ..., -0.12647804,
        0.09779716,  0.11945628])

10.7.6 (f)

The histogram of test_stats is plotted below with a vertical red line indicating the observed value of our test statistic. If the p-value cutoff we use is 5%, what is the conclusion of our test?

Answer

We can inspect the histogram above to see that the area to the right of the observed value (which is our p-value) is greater than 5%. Since our p-value is greater than our p-value cutoff, we fail to reject the null hypothesis and conclude that the data tend to favor the null hypothesis.

10.7.7 (g)

Suppose that the null hypothesis is true. If we ran this same hypothesis test 1000 times, each time from our flies table and with a p-value cutoff of 5%, how many times would we expect to incorrectly reject the null hypothesis?

Answer

We would expect to reject the null hypothesis \(1000 * 0.05 = 50\) times. A p-value cutoff of 5% represents the probability of incorrectly rejecting the null hypothesis.

10.7.8 (h)

What effect does decreasing our p-value cutoff have on the number of times we expect to incorrectly reject the null hypothesis?

Answer

If we decrease our p-value cutoff, we are reducing the expected number of times we will incorrectly reject the null.

10.7.9 (i)

Answer the following True/False questions.

10.7.9.1 (i)

A/B testing is used to determine whether or not we believe two samples come from the same underlying distribution.

True
False

Answer

True

True, this is the definition of A/B testing.

10.7.9.2 (ii)

To conduct a permutation test, you should sample your data with replacement with a sample size equal to the number of rows in the original table.

True
False

Answer

False

False, you should sample your data without replacement–otherwise, you would not get a permutation of your data.

10.7.9.3 (iii)

A/B testing is the same as using total variation distance as a test statistic for a hypothesis test.

True
False

Answer

False

False, total variation distance is just a test statistic that computes the distance between two distributions. It does not involve taking a random permutation of your data.

10.8 Functions (Bonus!)

Cyrus loves completing the NYT Monday crossword puzzle, and is interested in seeing how fast he completes it in comparison with his friends. Over the past two months, Cyrus and his friend Monica have been recording their crossword completion times (in seconds) in the arrays cyrus_times and monica_times respectively. Cyrus decides to put his skills to the test by randomly selecting one of his times and comparing it to a randomly chosen time of Monica’s.

Code

cyrus_times = [210, 230, 250, 240, 225, 260, 245, 255, 235, 220]
monica_times = [260, 270, 300, 280, 265, 275, 290, 310, 295, 285]

crossword_times = Table().with_columns(
    "Cyrus", cyrus_times,
    "Monica", monica_times
)

Fun with Functions

This type of problem might look like hypothesis testing, but the focus is on functions and string manipulation.
Example: one_comparison returns True/False, so it can be used directly in a conditional.
If a function only prints and doesn’t return, you can’t save its result in a variable.
Don’t forget to convert numbers to strings (like wins and trials) before concatenating them.

10.8.1 (a)

Write a function called one_comparison that randomly chooses one time from cyrus_times and one time from monica_times, and returns True if Cyrus’s time was better than Monica’s.

def one_comparison():
    return ______________________________

Answer

def one_comparison():
    return np.random.choice(cyrus_times) < np.random.choice(monica_times)

one_comparison()

True

10.8.2 (b)

Now, write a function called crossword_comparison that takes in trials, which is the number of times we randomly compare one of Cyrus’s completion times with one of Monica’s. The function should print a statement explaining the total number of times Cyrus won. For example, if Cyrus won 6 times out of 10 trials, the statement should read “Cyrus beat Monica 6 times out of 10 trials”.

def crossword_comparison(trials):
    wins = ________
    for i in ________:
        if ________:
            wins = ________
    print("Cyrus beat Monica " + ________ + " times out of " + ________ + " trials")

Answer

def crossword_comparison(trials):
    wins = 0
    for i in np.arange(trials):
        if one_comparison():
            wins = wins + 1
    print("Cyrus beat Monica " + str(wins) + " times out of " + str(trials) + " trials")

crossword_comparison(100)

Cyrus beat Monica 100 times out of 100 trials

10.8.3 (c)

Cyrus is interested in using his new function to show Monica that he is superior in crossword solving. He runs crossword_comparison over 100 trials, and assigns the output to a variable called my_wins for easy access. What is one issue with this process?

Answer

This function prints a sentence rather than returning a value or string. The variable my_wins will therefore be assigned to nothing, and could result in an error if it were to be used in any calculations.

print(crossword_comparison(100))

Cyrus beat Monica 100 times out of 100 trials
None

10.8.4 (d)

Finally, Cyrus wants to create a team of Data 8 course staff for competitive crossword-puzzle solving. However, he is particular and will only accept them if they satisfy the following two conditions:

He wants to create a very strong team, so He only wants to recruit people who have an average crossword completion time below 5 minutes.
He favorite number is 10, so of the people above he will only recruit those who have a last name that is exactly 10 letters long.

Write a function that takes in a table with three columns:

First (str): Player’s first name
Last (str): Player’s last name
Time (int): Player’s completion time for that puzzle in seconds

and returns an array of player names (First and Last) that Cyrus will recruit for his team. If Bing Concepcion is supposed to be in the array, you may leave his name as “BingConcepcion”.

def create_team(players):
    player_means = ______________________________
    with_lengths = ______________________________
    chosen_players = ______________________________
    return ______________________________

Answer

def create_team(players):
    player_means = players.group(make_array("First", "Last"), np.mean)
    with_lengths = player_means.with_columns("Length", player_means.apply(len, "Last"))
    chosen_players = with_lengths.where("Time mean", are.below(300)).where("Length", 10)
    return np.char.add(chosen_players.column("First"), chosen_players.column("Last"))

players = Table().with_columns(
    "First", ["Cyrus", "Monica", "Bing", "Wesley", "Wayland"],
    "Last", ["McSwain", "Tsai", "Concepcion", "Zheng", "La"],
    "Time", [250, 280, 260, 310, 240]
)
create_team(players)

array(['BingConcepcion'],
      dtype='<U17')

10.0.1 Contact Information

10.0.2 Announcements

10.1 Tables

10.1.1 (a)

10.1.2 (b)

10.1.3 (c)

10.1.4 (d)

10.1.5 (e) (Bonus!)

10.2 Histograms

10.2.1 (a)

10.2.2 (b)

10.2.3 (c)

10.2.4 (d)

10.2.5 (e)

10.3 Probability

10.3.1 (a)

10.3.2 (b)

10.3.3 (c)

10.3.4 (d)

10.4 Multiple Choice

10.4.1 (a)

10.4.2 (b)

10.4.3 (c)

10.4.4 (d)

10.4.5 (e)

10.4.6 (f)

10.5 Simulation and Hypothesis Testing

10.5.1 (a)

10.5.2 (b)

10.5.3 (c)

10.5.4 (d)

10.5.5 (e)

10.6 More Hypothesis Testing

10.6.1 (a)

10.6.2 (b)

10.6.3 (c)

10.6.4 (d)

10.6.5 (e)

10.7 A/B Testing

10.7.1 (a)

10.7.2 (b)

10.7.3 (c)

10.7.4 (d)

10.7.5 (e)

10.7.6 (f)

10.7.7 (g)

10.7.8 (h)

10.7.9 (i)

10.7.9.1 (i)

10.7.9.2 (ii)

10.7.9.3 (iii)

10.8 Functions (Bonus!)

10.8.1 (a)

10.8.2 (b)

10.8.3 (c)

10.8.4 (d)