5  Discussion 03: Visualizations, Data Types, Extending Tables (from Fall 2025)

5.0.1 Contact Information

Name Wesley Zheng
Pronouns He/him/his
Email wzheng0302@berkeley.edu
Discussion Wednesdays, 12–2 PM @ Etcheverry 3105
Office Hours Tuesdays/Thursdays, 2–3 PM @ Warren Hall 101

Contact me by email at ease — I typically respond within a day or so!


5.0.2 Announcements

CautionAnnouncements
  • Lab 03 is due Friday (9/12) at 5PM
  • Tutoring sections are available! Take advantage of this resource to get extra practice and support
  • Please remember to open worksheet links using your Berkeley email account — this prevents access issues (and keeps our inbox from filling up with requests)
  • Don’t forget to fill out the Midterm Accommodations Form if you anticipate a time conflict
  • Grades Update: HW01, Lab01, and Lab02 grades will be released this Friday.

An extremely important aspect of data science is visualizing the data in a precise, consistent manner.

5.1 Visualizations

A common discussion people have when visualizing data revolves around the scale of the y-axis on a chart. It is commonly believed that charts with numerical y-axis must always start at 0. After watching the video, do you think this is true? Discuss the following questions.

Code
# Don’t worry about understanding this code! It’s just here to embed the YouTube video into the webpage
from IPython.display import YouTubeVideo
YouTubeVideo("14VYnFhBKcY")
NoteDiscussion Question

Think about different types of visualizations and their purposes:

  • Why might we choose one visualization over another?
  • What should a reader be able to interpret from a visualization?

We usually tell you what visualization to make, but now it’s important to also think about why we choose certain visualizations and what they communicate.


5.1.1 (a)

In your opinion, when is it appropriate to start the y-axis at 0, and when not?

Answer Answers vary. One possible idea is that we should start the y-axis at 0 when 0 is a good baseline value for the data we are trying to represent. Another idea is to start the y-axis at a different value if the magnitude of the changes we are trying to show is small.

5.1.2 (b)

In general, what should we consider when making a good data visualization?

Answer

Answers vary. We may want to first think about the purpose of the visualization — what are we trying to show? Is there a trend of interest? A relationship between multiple variables? These questions will guide us toward choosing not only an appropriate type of visualization but also the best ways in which we can present them.

An example: body temperature never needing to reach 0, as seen in the graphs above.
NoteFrom Visualizations to Tables

Next, we’ll transition from discussing visualizations to working with tables in Python. Table methods allow us to:

  • Explore our data
  • Reveal patterns
  • Build meaningful visualizations later

This helps us connect how we make visualizations with why we choose them.

5.2 Aces

In tennis, a player hits an ace when their serve is untouched by the opponent. Your friend Tim is interested in analyzing the 8 male tennis professionals who have hit the most aces throughout tennis history. The table below is called tennis and includes statistics obtained from the ATP Tour. The table is sorted in decreasing order of Num. Aces, the total number of aces (as an integer) hit by that player in their career.

Code
# Initializing the table!
from datascience import *

tennis = Table().with_columns(
    "Name", ["John Isner", "Ivo Karlovic", "Roger Federer", "Feliciano Lopez",
             "Goran Ivanisevic", "Andy Roddick", "Sam Querrey", "Pete Sampras"],
    "Nationality", ["United States", "Croatia", "Switzerland", "Spain",
                    "Croatia", "United States", "United States", "United States"],
    "Num. Aces", [14470, 13728, 11478, 10261, 10237, 9074, 8879, 8858],
    "Matches Played", [772, 694, 1462, 976, 731, 776, 694, 792],
    "Height (cm)", [208, 211, 185, 188, 193, 188, 198, 185],
    "Weight (kg)", [108, 104, 85, 88, 82, 88, 95, 77]
)

tennis
Name Nationality Num. Aces Matches Played Height (cm) Weight (kg)
John Isner United States 14470 772 208 108
Ivo Karlovic Croatia 13728 694 211 104
Roger Federer Switzerland 11478 1462 185 85
Feliciano Lopez Spain 10261 976 188 88
Goran Ivanisevic Croatia 10237 731 193 82
Andy Roddick United States 9074 776 188 88
Sam Querrey United States 8879 694 198 95
Pete Sampras United States 8858 792 185 77

Unfortunately, the code Tim wrote to analyze the data has some issues. Below are some error messages that appeared, along with what Tim was trying to calculate. Describe the issues and explain how you would change the code to fix them.

NoteDebugging in Python

Error messages in Jupyter notebooks may look intimidating, but they are one of your best tools for learning.

Key Idea

  • Some errors are straightforward (like a typo in a name).
  • Others may be less obvious, but each error message contains clues about what went wrong and where.
  • Learning to read error messages carefully will save you time and frustration.

Even unusual errors can usually be traced back to a small mistake.


5.2.1 (a)

The proportion of players in the table that are from the United States.

tennis.where("Nationality", "United States") / tennis.num_rows
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[3], line 1
----> 1 tennis.where("Nationality", "United States") / tennis.num_rows

TypeError: unsupported operand type(s) for /: 'Table' and 'int'
Answer

The code results in an error because it tries to divide a table by an integer, which is not a valid operation.
Correct code:

tennis.where("Nationality", "United States").num_rows / tennis.num_rows
0.5

5.2.2 (b)

An array of the average number of aces hit per match for each player.

tennis.column("Num. Aces") / tennis.select("Matches Played")
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[5], line 1
----> 1 tennis.column("Num. Aces") / tennis.select("Matches Played")

TypeError: ufunc 'divide' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Answer

The code results in an error because it attempts to divide an array by a table. Arrays can only be divided by an integer or another array of a similar data type and of the same length.
Correct code:

tennis.column("Num. Aces") / tennis.column("Matches Played")
array([ 18.74352332,  19.78097983,   7.85088919,  10.51331967,
        14.00410397,  11.69329897,  12.79394813,  11.18434343])

5.2.3 (c)

The name of the player with less than 10,000 aces who has played the most matches in their career.

less_aces_players = tennis.where("Num. Aces", are.below("10000"))
less_aces_most_matches = less_aces_players.sort("Matches Played", descending=True).column("Name").item(0)
---------------------------------------------------------------------------
UFuncTypeError                            Traceback (most recent call last)
Cell In[7], line 1
----> 1 less_aces_players = tennis.where("Num. Aces", are.below("10000"))
      2 less_aces_most_matches = less_aces_players.sort("Matches Played", descending=True).column("Name").item(0)

File /opt/anaconda3/envs/quarto/lib/python3.13/site-packages/datascience/tables.py:1415, in Table.where(self, column_or_label, value_or_predicate, other)
   1413     else:
   1414         predicate = value_or_predicate
-> 1415     column = [predicate(x) for x in column]
   1416 return self.take(np.nonzero(column)[0])

File /opt/anaconda3/envs/quarto/lib/python3.13/site-packages/datascience/predicates.py:225, in _combinable.__call__(self, x)
    224 def __call__(self, x):
--> 225     return self.f(x)

File /opt/anaconda3/envs/quarto/lib/python3.13/site-packages/datascience/predicates.py:121, in are.below.<locals>.<lambda>(x)
    119 """Less than y."""
    120 check_iterable(y)
--> 121 return _combinable(lambda x: x < y)

UFuncTypeError: ufunc 'less' did not contain a loop with signature matching types (<class 'numpy.dtypes.Int64DType'>, <class 'numpy.dtypes.StrDType'>) -> None
Answer

The code errors because "10000" is a string, making comparisons with the values in the Num. Aces column (which are integers) invalid. Correct code:

less_aces_players = tennis.where("Num. Aces", are.below(10000))
less_aces_most_matches = less_aces_players.sort("Matches Played", descending=True).column("Name").item(0)

Note: If we wanted the code to evaluate to the actual name, we would need a new line to call less_aces_most_matches.

less_aces_most_matches
'Pete Sampras'

5.3 Ed Discussion is Our Friend

Data 8 staff love to answer questions on Ed! The table staff contains information about the staff’s Spring 2025 Data 8 Ed statistics. There are 5 columns:

  • Name (string): name of the staff member.
  • ID (int): the ID of the staff member.
  • Hearts per Post (float): the number of hearts awarded per post made by a staff member (0 if no posts made)
  • Answers (int): number of answers to Ed questions and comments.
  • Admin (boolean): whether or not the staff member was an admin of the Data 8 Spring 2025 Ed.

Some rows are shown below:

Code
from datascience import Table
import numpy as np

staff = Table().with_columns(
    "Name", [
        "Dagny Streit",
        "Dylan Tay",
        "Isaac Chung",
        "Mariel Del Rosario",
        "Brandon Su",
        "Marissa Lumpkin"
    ],
    "ID", [18, 7, 4, 8, 5, 0],
    "Hearts per Post", [6.5172, 0, 9.1333, 0, 0, 16],
    "Answers", [165, 102, 240, 82, 48, 22],
    "Admin", [True, False, True, False, True, True]
)

staff
Name ID Hearts per Post Answers Admin
Dagny Streit 18 6.5172 165 True
Dylan Tay 7 0 102 False
Isaac Chung 4 9.1333 240 True
Mariel Del Rosario 8 0 82 False
Brandon Su 5 0 48 True
Marissa Lumpkin 0 16 22 True

5.3.1 (a)

For each of the columns in staff, identify if the data contained in that column is numerical or categorical.

Answer
  • Name: Categorical
  • ID: Categorical
  • Hearts per Post: Numerical
  • Answers: Numerical
  • Admin: Categorical

One way of determining if a column is numerical or not is to perform arithmetic operations on its values. For example, do we get a meaningful result if we add two IDs together? If not, the column may be categorical.

staff.column("ID").item(0) + staff.column("ID").item(1) #This doesn't make sense!
25

5.3.2 (b)

Professor Sanchez wants to award a prize to a staff member based on their performance on Ed. To be eligible, the staff member must receive at least one heart per post and hold an admin role. Among those who meet these criteria, the staff member with the highest number of answers will be selected as the winner. Fill in the code blanks to determine who earns the prize.

good_staff = staff.where(_________________________).where(___________)
lucky_staff_name = ____________.sort(___________________).column(______).item(0)
Answer
good_staff = staff.where("Hearts per Post", are.above_or_equal_to(1)).where("Admin", True)
lucky_staff_name = good_staff.sort("Answers", descending=True).column("Name").item(0)

Note: If we wanted the code to evaluate to the actual name, we would need a new line to call lucky_staff_name.

lucky_staff_name
'Isaac Chung'

5.4 Warm Down (Optional)

A table named seat contains a row for each time a student submitted the attendance form in lecture on September 18th, 20th, or 22nd. The table contains four columns.

  • Email: a string, the email address of the student
  • Row: a string, the letter of the row in which they claim to be seated
  • Seat: an int, the number of the seat in which they claim to be seated
  • Date: an int, the date of the submission, either 18, 20, or 22.

(Fall 2017 Midterm Question 2 Modified)

Code
seats = Table.read_table('seat.csv') # Initializing the seats table
seats.show(3)
Email Row Seat Date
sulu@berkeley.edu C 102 20
mccoy@berkeley.edu A 3 18
kirk@berkeley.edu R 110 20

... (1747 rows omitted)

Fill in the blanks of the Python expressions to compute the described values. You must use all and only the lines provided. The last (or only) line of each answer should evaluate to the value described.

NotePractice with Exam-Style Questions

Sometimes it’s helpful to practice problems that feel like actual exam questions.

Key Lessons

  • There is often more than one way to solve a problem. For example, you might use a Python method or a table method to reach the same result.
  • Some questions involve special details, like comparing rows in lexicographic (alphabetical) order. Hints are usually given, but you’ll need to think carefully.

Working through these types of problems now gives you a taste of what to expect on exams and builds confidence.


5.4.1 (a)

The largest seat number in the seat table.

Method 1:

max(____________________________________)

Answer
max(seats.column('Seat'))
150

Method 2:

___________.sort(_____________, _______________)._______________.________________

Answer
seats.sort('Seat', descending=True).column('Seat').item(0)
150
NoteFinding the Largest Value

A common pattern in Data 8 is:

table.sort("column_name").column("column_name").item(0)

This lets us find the row with the largest (or smallest) value in a column.

You’ll see this pattern repeatedly, so it’s worth getting comfortable with it now.


5.4.2 (b)

The total number of attendance submissions for September 20th in rows A, B, C, D, or E.

Hint: You can use Table.where predicates to compare letters lexicographically (e.g. A is below B). You could alternatively use a different method that does not involve comparing lexicographically.

filtered_seats = seats.__________(________________, __________________)
filtered_seats.__________(________________, ________________).______________
Answer
filtered_seats = seats.where("Row", are.below("F"))
filtered_seats.where("Date", 20).num_rows
106

or

filtered_seats = seats.where("Row", are.contained_in(make_array("A", "B", "C", "D", "E")))
filtered_seats.where("Date", 20).num_rows
106

5.5 K-Pop Enthusiasts (Bonus!)

Ethan and his friends attend various concerts. The table concerts contains information about their adventures in 2022. There are four columns:

  • Name: string, name of the concertgoer
  • Artist: string, name of the performing artist
  • Month: string, 1-12 corresponding to the month
  • Price: float, cost of the concert ticket

Some rows are shown below:

Code
# Initializing the table
concerts = Table().with_columns(
    "Name", [
        "Jeffrey", "Kristen", "Ethan", "Jeffrey", "Oscar",
        "Maya", "Liam", "Sophia", "Noah", "Emma",
        "Ethan", "Ava", "Olivia", "Ethan", "Lucas"
    ],
    "Artist", [
        "Blackpink", "Seventeen", "BTS", "Twice", "BTS",
        "Blackpink", "Seventeen", "Twice", "BTS", "Blackpink",
        "Twice", "Seventeen", "BTS", "Twice", "Twice"
    ],
    "Month", [
        11, 8, 4, 5, 4,
        12, 7, 6, 9, 11,
        5, 8, 4, 5, 6
    ],
    "Price", [
        132.62, 42.68, 70.02, 392.11, 70.02,
        145.00, 50.00, 400.00, 75.00, 150.00,
        392.11, 48.50, 72.00, 400.00, 390.00
    ]
)

concerts.show(5)
Name Artist Month Price
Jeffrey Blackpink 11 132.62
Kristen Seventeen 8 42.68
Ethan BTS 4 70.02
Jeffrey Twice 5 392.11
Oscar BTS 4 70.02

... (10 rows omitted)


5.5.1 (a)

For each of the columns in concerts, identify if the data contained in that column is numerical or categorical.

NoteCategorical vs. Numerical Data

Not all numbers are numerical!

  • Categorical example: An ID column is made of numbers, but taking the sum or average of IDs doesn’t mean anything.
  • Numerical example: A column of heights or ages is truly numeric because averages and sums make sense.

A good rule of thumb:

If adding or averaging the values doesn’t make sense, the data is categorical.

NoteExample: ZIP Codes

ZIP codes are a classic case: they look like numbers, but they are categorical.
You can group or compare them, but taking an average ZIP code isn’t meaningful.

Answer Name - Categorical
Artist - Categorical
Month - Categorical
Price - Numerical

5.5.2 (b)

Assume Ethan attended two Twice concerts in 2022. Assign months_passed to the number of months in between those two concerts as an integer.

ethan_twice_months = concerts.where(___________, ___________).where(___________, ___________).column('Month')
months_passed = abs(___________________________)
NoteFiltering with Multiple Conditions

Chaining .where() calls is the only way to filter a table by multiple conditions.

  • .where() returns a new table, so it can be followed by another .where().
  • You cannot pass multiple conditions at once.

Being mindful of the data type that each method returns is crucial.

Answer
ethan_twice_months = concerts.where('Name', 'Ethan').where('Artist', 'Twice').column('Month')
months_passed = abs(ethan_twice_months.item(0) - ethan_twice_months.item(1))
months_passed
0