5 Discussion 03: Visualizations, Data Types, Extending Tables (from Fall 2025)

5.0.1 Contact Information

Name	Wesley Zheng
Pronouns	He/him/his
Email	wzheng0302@berkeley.edu
Discussion	Wednesdays, 12–2 PM @ Etcheverry 3105
Office Hours	Tuesdays/Thursdays, 2–3 PM @ Warren Hall 101

Contact me by email at ease — I typically respond within a day or so!

5.0.2 Announcements

Announcements

Lab 03 is due Friday (9/12) at 5PM
Tutoring sections are available! Take advantage of this resource to get extra practice and support
Please remember to open worksheet links using your Berkeley email account — this prevents access issues (and keeps our inbox from filling up with requests)
Don’t forget to fill out the Midterm Accommodations Form if you anticipate a time conflict
Grades Update: HW01, Lab01, and Lab02 grades will be released this Friday.

An extremely important aspect of data science is visualizing the data in a precise, consistent manner.

5.1 Visualizations

A common discussion people have when visualizing data revolves around the scale of the y-axis on a chart. It is commonly believed that charts with numerical y-axis must always start at 0. After watching the video, do you think this is true? Discuss the following questions.

Code

# Don’t worry about understanding this code! It’s just here to embed the YouTube video into the webpage
from IPython.display import YouTubeVideo
YouTubeVideo("14VYnFhBKcY")

Discussion Question

Think about different types of visualizations and their purposes:

Why might we choose one visualization over another?
What should a reader be able to interpret from a visualization?

We usually tell you what visualization to make, but now it’s important to also think about why we choose certain visualizations and what they communicate.

5.1.1 (a)

In your opinion, when is it appropriate to start the y-axis at 0, and when not?

Answer

Answers vary. One possible idea is that we should start the y-axis at 0 when 0 is a good baseline value for the data we are trying to represent. Another idea is to start the y-axis at a different value if the magnitude of the changes we are trying to show is small.

5.1.2 (b)

In general, what should we consider when making a good data visualization?

Answer

Answers vary. We may want to first think about the purpose of the visualization — what are we trying to show? Is there a trend of interest? A relationship between multiple variables? These questions will guide us toward choosing not only an appropriate type of visualization but also the best ways in which we can present them.

An example: body temperature never needing to reach 0, as seen in the graphs above.

From Visualizations to Tables

Next, we’ll transition from discussing visualizations to working with tables in Python. Table methods allow us to:

Explore our data
Reveal patterns
Build meaningful visualizations later

This helps us connect how we make visualizations with why we choose them.

5.2 Aces

In tennis, a player hits an ace when their serve is untouched by the opponent. Your friend Tim is interested in analyzing the 8 male tennis professionals who have hit the most aces throughout tennis history. The table below is called tennis and includes statistics obtained from the ATP Tour. The table is sorted in decreasing order of Num. Aces, the total number of aces (as an integer) hit by that player in their career.

Code

# Initializing the table!
from datascience import *

tennis = Table().with_columns(
    "Name", ["John Isner", "Ivo Karlovic", "Roger Federer", "Feliciano Lopez",
             "Goran Ivanisevic", "Andy Roddick", "Sam Querrey", "Pete Sampras"],
    "Nationality", ["United States", "Croatia", "Switzerland", "Spain",
                    "Croatia", "United States", "United States", "United States"],
    "Num. Aces", [14470, 13728, 11478, 10261, 10237, 9074, 8879, 8858],
    "Matches Played", [772, 694, 1462, 976, 731, 776, 694, 792],
    "Height (cm)", [208, 211, 185, 188, 193, 188, 198, 185],
    "Weight (kg)", [108, 104, 85, 88, 82, 88, 95, 77]
)

tennis

Name	Nationality	Num. Aces	Matches Played	Height (cm)	Weight (kg)
John Isner	United States	14470	772	208	108
Ivo Karlovic	Croatia	13728	694	211	104
Roger Federer	Switzerland	11478	1462	185	85
Feliciano Lopez	Spain	10261	976	188	88
Goran Ivanisevic	Croatia	10237	731	193	82
Andy Roddick	United States	9074	776	188	88
Sam Querrey	United States	8879	694	198	95
Pete Sampras	United States	8858	792	185	77

Unfortunately, the code Tim wrote to analyze the data has some issues. Below are some error messages that appeared, along with what Tim was trying to calculate. Describe the issues and explain how you would change the code to fix them.

Debugging in Python

Error messages in Jupyter notebooks may look intimidating, but they are one of your best tools for learning.

Key Idea

Some errors are straightforward (like a typo in a name).
Others may be less obvious, but each error message contains clues about what went wrong and where.
Learning to read error messages carefully will save you time and frustration.

Even unusual errors can usually be traced back to a small mistake.

5.2.1 (a)

The proportion of players in the table that are from the United States.

tennis.where("Nationality", "United States") / tennis.num_rows

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 tennis.where("Nationality", "United States") / tennis.num_rows

TypeError: unsupported operand type(s) for /: 'Table' and 'int'

Answer

The code results in an error because it tries to divide a table by an integer, which is not a valid operation.
Correct code:

tennis.where("Nationality", "United States").num_rows / tennis.num_rows

0.5

5.2.2 (b)

An array of the average number of aces hit per match for each player.

tennis.column("Num. Aces") / tennis.select("Matches Played")

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 tennis.column("Num. Aces") / tennis.select("Matches Played")

TypeError: ufunc 'divide' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

Answer

The code results in an error because it attempts to divide an array by a table. Arrays can only be divided by an integer or another array of a similar data type and of the same length.
Correct code:

tennis.column("Num. Aces") / tennis.column("Matches Played")

array([ 18.74352332,  19.78097983,   7.85088919,  10.51331967,
        14.00410397,  11.69329897,  12.79394813,  11.18434343])

5.2.3 (c)

The name of the player with less than 10,000 aces who has played the most matches in their career.

less_aces_players = tennis.where("Num. Aces", are.below("10000"))
less_aces_most_matches = less_aces_players.sort("Matches Played", descending=True).column("Name").item(0)

---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
Cell In[7], line 1
----> 1 less_aces_players = tennis.where("Num. Aces", are.below("10000"))
      2 less_aces_most_matches = less_aces_players.sort("Matches Played", descending=True).column("Name").item(0)

File /opt/anaconda3/envs/quarto/lib/python3.13/site-packages/datascience/tables.py:1415, in Table.where(self, column_or_label, value_or_predicate, other)
   1413     else:
   1414         predicate = value_or_predicate
-> 1415     column = [predicate(x) for x in column]
   1416 return self.take(np.nonzero(column)[0])

File /opt/anaconda3/envs/quarto/lib/python3.13/site-packages/datascience/predicates.py:225, in _combinable.__call__(self, x)
    224 def __call__(self, x):
--> 225     return self.f(x)

File /opt/anaconda3/envs/quarto/lib/python3.13/site-packages/datascience/predicates.py:121, in are.below.<locals>.<lambda>(x)
    119 """Less than y."""
    120 check_iterable(y)
--> 121 return _combinable(lambda x: x < y)

UFuncTypeError: ufunc 'less' did not contain a loop with signature matching types (<class 'numpy.dtypes.Int64DType'>, <class 'numpy.dtypes.StrDType'>) -> None

Answer

The code errors because "10000" is a string, making comparisons with the values in the Num. Aces column (which are integers) invalid. Correct code:

less_aces_players = tennis.where("Num. Aces", are.below(10000))
less_aces_most_matches = less_aces_players.sort("Matches Played", descending=True).column("Name").item(0)

Note: If we wanted the code to evaluate to the actual name, we would need a new line to call less_aces_most_matches.

less_aces_most_matches

'Pete Sampras'

5.3 Ed Discussion is Our Friend

Data 8 staff love to answer questions on Ed! The table staff contains information about the staff’s Spring 2025 Data 8 Ed statistics. There are 5 columns:

Name (string): name of the staff member.
ID (int): the ID of the staff member.
Hearts per Post (float): the number of hearts awarded per post made by a staff member (0 if no posts made)
Answers (int): number of answers to Ed questions and comments.
Admin (boolean): whether or not the staff member was an admin of the Data 8 Spring 2025 Ed.

Some rows are shown below:

Code

from datascience import Table
import numpy as np

staff = Table().with_columns(
    "Name", [
        "Dagny Streit",
        "Dylan Tay",
        "Isaac Chung",
        "Mariel Del Rosario",
        "Brandon Su",
        "Marissa Lumpkin"
    ],
    "ID", [18, 7, 4, 8, 5, 0],
    "Hearts per Post", [6.5172, 0, 9.1333, 0, 0, 16],
    "Answers", [165, 102, 240, 82, 48, 22],
    "Admin", [True, False, True, False, True, True]
)

staff

Name	ID	Hearts per Post	Answers	Admin
Dagny Streit	18	6.5172	165	True
Dylan Tay	7	0	102	False
Isaac Chung	4	9.1333	240	True
Mariel Del Rosario	8	0	82	False
Brandon Su	5	0	48	True
Marissa Lumpkin	0	16	22	True

5.3.1 (a)

For each of the columns in staff, identify if the data contained in that column is numerical or categorical.

Answer

Name: Categorical
ID: Categorical
Hearts per Post: Numerical
Answers: Numerical
Admin: Categorical

One way of determining if a column is numerical or not is to perform arithmetic operations on its values. For example, do we get a meaningful result if we add two IDs together? If not, the column may be categorical.

staff.column("ID").item(0) + staff.column("ID").item(1) #This doesn't make sense!

5.3.2 (b)

Professor Sanchez wants to award a prize to a staff member based on their performance on Ed. To be eligible, the staff member must receive at least one heart per post and hold an admin role. Among those who meet these criteria, the staff member with the highest number of answers will be selected as the winner. Fill in the code blanks to determine who earns the prize.

good_staff = staff.where(_________________________).where(___________)
lucky_staff_name = ____________.sort(___________________).column(______).item(0)

Answer

good_staff = staff.where("Hearts per Post", are.above_or_equal_to(1)).where("Admin", True)
lucky_staff_name = good_staff.sort("Answers", descending=True).column("Name").item(0)

Note: If we wanted the code to evaluate to the actual name, we would need a new line to call lucky_staff_name.

lucky_staff_name

'Isaac Chung'

5.4 Warm Down (Optional)

A table named seat contains a row for each time a student submitted the attendance form in lecture on September 18th, 20th, or 22nd. The table contains four columns.

Email: a string, the email address of the student
Row: a string, the letter of the row in which they claim to be seated
Seat: an int, the number of the seat in which they claim to be seated
Date: an int, the date of the submission, either 18, 20, or 22.

(Fall 2017 Midterm Question 2 Modified)

Code

seats = Table.read_table('seat.csv') # Initializing the seats table
seats.show(3)

Email	Row	Seat	Date
sulu@berkeley.edu	C	102	20
mccoy@berkeley.edu	A	3	18
kirk@berkeley.edu	R	110	20

... (1747 rows omitted)

Fill in the blanks of the Python expressions to compute the described values. You must use all and only the lines provided. The last (or only) line of each answer should evaluate to the value described.

Practice with Exam-Style Questions

Sometimes it’s helpful to practice problems that feel like actual exam questions.

Key Lessons

There is often more than one way to solve a problem. For example, you might use a Python method or a table method to reach the same result.
Some questions involve special details, like comparing rows in lexicographic (alphabetical) order. Hints are usually given, but you’ll need to think carefully.

Working through these types of problems now gives you a taste of what to expect on exams and builds confidence.

5.4.1 (a)

The largest seat number in the seat table.

Method 1:

max(____________________________________)

Answer

max(seats.column('Seat'))

Method 2:

___________.sort(_____________, _______________)._______________.________________

Answer

seats.sort('Seat', descending=True).column('Seat').item(0)

Finding the Largest Value

A common pattern in Data 8 is:

table.sort("column_name").column("column_name").item(0)

This lets us find the row with the largest (or smallest) value in a column.

You’ll see this pattern repeatedly, so it’s worth getting comfortable with it now.

5.4.2 (b)

The total number of attendance submissions for September 20th in rows A, B, C, D, or E.

Hint: You can use Table.where predicates to compare letters lexicographically (e.g. A is below B). You could alternatively use a different method that does not involve comparing lexicographically.

filtered_seats = seats.__________(________________, __________________)
filtered_seats.__________(________________, ________________).______________

Answer

filtered_seats = seats.where("Row", are.below("F"))
filtered_seats.where("Date", 20).num_rows

filtered_seats = seats.where("Row", are.contained_in(make_array("A", "B", "C", "D", "E")))
filtered_seats.where("Date", 20).num_rows

5.5 K-Pop Enthusiasts (Bonus!)

Ethan and his friends attend various concerts. The table concerts contains information about their adventures in 2022. There are four columns:

Name: string, name of the concertgoer
Artist: string, name of the performing artist
Month: string, 1-12 corresponding to the month
Price: float, cost of the concert ticket

Some rows are shown below:

Code

# Initializing the table
concerts = Table().with_columns(
    "Name", [
        "Jeffrey", "Kristen", "Ethan", "Jeffrey", "Oscar",
        "Maya", "Liam", "Sophia", "Noah", "Emma",
        "Ethan", "Ava", "Olivia", "Ethan", "Lucas"
    ],
    "Artist", [
        "Blackpink", "Seventeen", "BTS", "Twice", "BTS",
        "Blackpink", "Seventeen", "Twice", "BTS", "Blackpink",
        "Twice", "Seventeen", "BTS", "Twice", "Twice"
    ],
    "Month", [
        11, 8, 4, 5, 4,
        12, 7, 6, 9, 11,
        5, 8, 4, 5, 6
    ],
    "Price", [
        132.62, 42.68, 70.02, 392.11, 70.02,
        145.00, 50.00, 400.00, 75.00, 150.00,
        392.11, 48.50, 72.00, 400.00, 390.00
    ]
)

concerts.show(5)

Name	Artist	Month	Price
Jeffrey	Blackpink	11	132.62
Kristen	Seventeen	8	42.68
Ethan	BTS	4	70.02
Jeffrey	Twice	5	392.11
Oscar	BTS	4	70.02

... (10 rows omitted)

5.5.1 (a)

For each of the columns in concerts, identify if the data contained in that column is numerical or categorical.

Categorical vs. Numerical Data

Not all numbers are numerical!

Categorical example: An ID column is made of numbers, but taking the sum or average of IDs doesn’t mean anything.
Numerical example: A column of heights or ages is truly numeric because averages and sums make sense.

A good rule of thumb:

If adding or averaging the values doesn’t make sense, the data is categorical.

Example: ZIP Codes

ZIP codes are a classic case: they look like numbers, but they are categorical.
You can group or compare them, but taking an average ZIP code isn’t meaningful.

Answer

Name - Categorical
Artist - Categorical
Month - Categorical
Price - Numerical

5.5.2 (b)

Assume Ethan attended two Twice concerts in 2022. Assign months_passed to the number of months in between those two concerts as an integer.

ethan_twice_months = concerts.where(___________, ___________).where(___________, ___________).column('Month')
months_passed = abs(___________________________)

Filtering with Multiple Conditions

Chaining .where() calls is the only way to filter a table by multiple conditions.

.where() returns a new table, so it can be followed by another .where().
You cannot pass multiple conditions at once.

Being mindful of the data type that each method returns is crucial.

Answer

ethan_twice_months = concerts.where('Name', 'Ethan').where('Artist', 'Twice').column('Month')
months_passed = abs(ethan_twice_months.item(0) - ethan_twice_months.item(1))

months_passed