1 CS 189 Discussion 01: Math Review & Data Processing

1.0.1 Contact Information

Name	Wesley Zheng
Pronouns	He/him/his
Email	wzheng0302@berkeley.edu
Discussion	Wednesdays, 11–12 PM @ Wheeler 120 & Wednesdays, 3-4 PM @ Hildebrand Hall B51
Office Hours	Tuesdays, 11–1 PM @ Cory Courtyard

Chain Rule:

\[ \begin{equation} \dfrac{\partial}{\partial x}f(g(x)+ h(x)) = \dfrac{\partial f}{\partial g}\dfrac{dg}{dx} + \dfrac{\partial f}{\partial h}\dfrac{dh}{dx} \end{equation} \]

1.1 Calculus

Calculus Essentials for ML

Calculus is fundamental in machine learning because it allows us to measure changes in functions and compute gradients, which guide optimization.

Derivatives & Partial Derivatives

Measure how a function changes with respect to one variable while keeping others constant.
Partial derivatives are critical for computing gradients in multivariate functions.

Chain Rule

For a composite function $f(g(x))$:
\[ \frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx} \]
Used extensively in backpropagation to compute derivatives of neural network layers.

Common Activation Functions

Sigmoid:
\[ \sigma(t) = \frac{1}{1 + e^{-t}}, \quad \frac{d\sigma}{dt} = \sigma(t)(1-\sigma(t)) \]
Maps real numbers to $(0,1)$, often used for binary classification.
tanh:
\[ \tanh(t) = \frac{e^t - e^{-t}}{e^t + e^{-t}}, \quad \frac{d}{dt}\tanh(t) = 1 - \tanh^2(t) \]
Maps to $(-1,1)$, zero-centered; helps with symmetric outputs.
ReLU:
\[ \text{ReLU}(t) = \max(0,t), \quad \frac{d}{dt}\text{ReLU}(t) = \begin{cases} 1 & t > 0 \\ 0 & t \le 0 \end{cases} \]
Common in deep learning for fast, sparse activations.
Softmax:
For vector $\mathbf{z} \in \mathbb{R}^n$,
\[ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}} \]
Converts vector scores into probabilities; used in multi-class classification.

Gradient, Jacobian, Hessian

Gradient = vector of first-order partial derivatives; points in direction of steepest ascent.
Jacobian = matrix of all first-order partial derivatives for vector-valued functions.
Hessian = matrix of second-order partial derivatives; describes curvature and is used in optimization.

1.1.1 (a)

Consider the function \[f(x, y) = \sigma(ax + by),\] where $a, b \in \mathbb{R}$ and $\sigma(t) = \dfrac{1}{1 + e^{-t}}$ for $t\in \mathbb{R}.$

1.1.1.1 (i)

Show that $\dfrac{d\sigma}{dt} = \sigma(t)\left(1-\sigma(t)\right)$.

Answer

\[ \begin{align} \frac{d}{dt}\sigma(t) &= \frac{d}{dt}\left(\frac{1}{1+e^{-t}}\right) \\ &= \frac{e^{-t}}{(1+e^{-t})^2} \\ &= \left(\frac{1}{1+e^{-t}}\right) \left(\frac{1 + e^{-t}}{1 + e^{-t}} - \frac{1}{1 + e^{-t}}\right) \\ &= \boxed{\sigma(t)\left(1-\sigma(t)\right)} \end{align} \]

1.1.1.2 (ii)

Using the result you showed in part (i) and the chain rule, compute $\dfrac{\partial f}{\partial x}$ and $\dfrac{\partial f}{\partial y}$.

Answer

First let’s define a new variable $u = ax + by$ such that $f = \sigma(u)$. By chain rule, we have \[\frac{\partial f}{\partial x} = \frac{\partial \sigma}{\partial u} \frac{\partial u}{\partial x}.\] We can calculate $\dfrac{\partial u}{\partial x} = a$, giving us, in conjunction with the result from part (i), \[\frac{\partial f}{\partial x} = \sigma(u)\left(1-\sigma(u)\right) \cdot a.\] Substituting $ax + by$ back in for $u$ gives us \[\frac{\partial f}{\partial x} = \boxed{a\sigma(ax + by)\left[1-\sigma(ax + by)\right].}\]

Following the same process for $\dfrac{\partial f}{\partial y}$, we get

\[\frac{\partial f}{\partial y} = \boxed{b\sigma(ax + by)\left[1-\sigma(ax + by)\right].}\]

1.1.2 (b)

For $\mathbf{x}=\begin{bmatrix}x_1 \\ \vdots \\ x_n\end{bmatrix}\in\mathbb{R}^n$, define \[ r(\mathbf{x})=\sum_{j=1}^n x_j^2. \] Compute the partial derivative $\dfrac{\partial r}{\partial x_i}$ for a generic coordinate $i \in \{1, ..., n\}$.

Answer

Since $r(\mathbf{x})=\sum_{j=1}^n x_j^2$, we have \[ \frac{\partial r}{\partial x_i}=\boxed{2x_i}. \]

1.1.3 (c)

Let $\mathbf{w}\in\mathbb{R}^n$ be a constant vector and define the scalar function \[ s(\mathbf{x})=\mathbf{w}^\top \mathbf{x}=\sum_{j=1}^n w_j x_j. \] Compute $\dfrac{\partial s}{\partial x_i}$ for a generic coordinate $i \in \{1, ..., n\}$.

Answer

With $s(\mathbf{x})=\sum_{j=1}^n w_j x_j$, treating $\mathbf{w}$ as constant, \[ \frac{\partial s}{\partial x_i}=\boxed{w_i}. \]

1.2 Linear Algebra

Linear Algebra Essentials for ML (Expanded)

Linear algebra is the language of machine learning. It allows us to represent data, transformations, and models efficiently.

Vectors & Matrices

Vector: $\mathbf{x} = [x_1, x_2, \dots, x_n]^T$; represents data points, features, or parameters.
Matrix: $\mathbf{A} \in \mathbb{R}^{m \times n}$; represents datasets, transformations, or weight layers.
Norms: Measure vector size or distance: \[ ||\mathbf{x}||_2 = \sqrt{\sum_i x_i^2}, \quad ||\mathbf{x}||_1 = \sum_i |x_i| \] Common in regularization and optimization.

Matrix Operations

Dot Product: $\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i$; measures similarity between vectors.
Matrix Multiplication: $(\mathbf{AB})_{ij} = \sum_k A_{ik} B_{kj}$; represents composed linear transformations.
Transpose: $(\mathbf{A}^T)_{ij} = A_{ji}$.
Trace: $\text{tr}(\mathbf{A}) = \sum_i A_{ii}$; linear operator, appears in loss and covariance computations.
Determinant: $\det(\mathbf{A})$; indicates invertibility and volume scaling.

Eigenvalues & Eigenvectors

Solve $\mathbf{A}\mathbf{v} = \lambda \mathbf{v}$; eigenvectors define directions, eigenvalues define scaling.
Used in PCA, spectral clustering, and stability analysis.
Symmetric matrices always have real eigenvalues and orthogonal eigenvectors.

Singular Value Decomposition (SVD)

$\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T$; factorizes a matrix into orthonormal bases and singular values.
$\Sigma$ contains singular values $\sigma_1 \ge \dots \ge \sigma_r > 0$.
Used for dimensionality reduction, low-rank approximation, and solving least squares problems.

Matrix Properties

Symmetric: $\mathbf{A} = \mathbf{A}^T$; real eigenvalues, orthogonal eigenvectors.
Positive Definite: $\mathbf{x}^T \mathbf{A} \mathbf{x} > 0$ for all $\mathbf{x} \neq 0$; arises in quadratic forms and convex optimization.
Orthogonal: $\mathbf{Q}^T \mathbf{Q} = \mathbf{Q} \mathbf{Q}^T = I$; preserves lengths and angles, common in rotations and SVD.

Special Matrices in ML

Covariance Matrix: $\Sigma = \frac{1}{n} X^T X$; symmetric, PSD, describes data spread.
Identity Matrix: $I_n$; neutral element for multiplication.
Diagonal Matrix: non-zero entries only on diagonal; simplifies eigenvalue computations.

Linear Transformations

Matrix multiplication represents linear transformations: $y = A x$.
Properties like rank, null space, and column space describe how transformations act on input space.

1.2.1 (a)

Prove that $\mathbf{A}^\top \mathbf{A}$ is symmetric for any $\mathbf{A}\in\mathbb{R}^{m \times n}$.

Answer

For a matrix to be symmetric, the following identity must hold: $\mathbf{S} = \mathbf{S}^\top$ \[ \boxed{(\mathbf{A}^\top\mathbf{A})^\top = \mathbf{A}^\top (\mathbf{A}^\top)^\top = \mathbf{A}^\top\mathbf{A}}. \]

1.2.2 (b)

Consider the matrix \[ \mathbf{B} = \begin{bmatrix} 1 & 0 \\ 0 & 2 \\ 2 & 1 \end{bmatrix} \]Find the singular values of $\mathbf{B}$.

Answer

The singular values are the square roots of the eigenvalues of $\mathbf{B}^\top \mathbf{B}$. First compute: \[ \mathbf{B}^\top \mathbf{B} = \begin{bmatrix} 1 & 0 & 2 \\ 0 & 2 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 2 \\ 2 & 1 \end{bmatrix} = \begin{bmatrix} 5 & 2 \\ 2 & 5 \end{bmatrix}. \]

The characteristic polynomial is: \[ \det\!\left( \begin{bmatrix}5 & 2\\2 & 5\end{bmatrix} - \lambda I \right) = (5-\lambda)^2 - 4 = \lambda^2 - 10\lambda + 21. \] This factors as $(\lambda - 7)(\lambda - 3)$. The eigenvalues are the roots of this expression, so they are $\lambda_1 = 7$ and $\lambda_2 = 3$. Therefore, the singular values are: \[ \boxed{\sigma_1 = \sqrt{7}, \quad \sigma_2 = \sqrt{3}.} \]

1.2.3 (c) (Bonus!)

Recall the spectral theorem: any symmetric matrix $S \in \mathbb{R}^{n \times n}$ can be written as \[ S = Q \Lambda Q^\top, \] with $Q$ orthonormal and $\Lambda$ diagonal of eigenvalues.

Let $A \in \mathbb{R}^{m \times n}$ have an SVD \[ A = U \Sigma V^\top, \] where $U \in \mathbb{R}^{m \times m}$ and $V \in \mathbb{R}^{n \times n}$ are orthonormal, and $\Sigma \in \mathbb{R}^{m \times n}$ is rectangular diagonal with singular values $\sigma_1 \ge \cdots \ge \sigma_r > 0$ on the main diagonal (and zeros otherwise). Use this to derive the spectral decomposition of $A^\top A$; explicitly identify $Q$ and $\Lambda$ in terms of the SVD factors.

Answer

Compute \[ A^\top A = (U \Sigma V^\top)^\top (U \Sigma V^\top) = V \Sigma^\top U^\top U \Sigma V^\top = V (\Sigma^\top \Sigma) V^\top, \] since $U^\top U = I_m$. The matrix $\Sigma^\top \Sigma \in \mathbb{R}^{n \times n}$ is diagonal with entries \[ \Sigma^\top \Sigma = \operatorname{diag}(\sigma_1^2, \dots, \sigma_r^2, 0, \dots, 0). \] Thus \[ A^\top A = V \underbrace{\operatorname{diag}(\sigma_1^2, \dots, \sigma_r^2, 0, \dots, 0)}_{\Lambda} V^\top. \] This is precisely the spectral decomposition of the symmetric matrix $A^\top A$, with \[ Q = V \quad\text{(columns are right singular vectors of $A$)}, \qquad \Lambda = \operatorname{diag}(\sigma_1^2, \dots, \sigma_r^2, 0, \dots, 0). \] In particular, the eigenvalues of $A^\top A$ are the squared singular values of $A$, and its orthonormal eigenvectors are the right singular vectors (the columns of $V$).

1.3 Probability Review

An incoming email is spam with prior $p(S)=0.2$ and not spam with $p(\bar S)=0.8$. Two independent filters flag spam: \[ p(F_1{=}1\mid S)=0.9,\quad p(F_1{=}1\mid \bar S)=0.1,\qquad p(F_2{=}1\mid S)=0.8,\quad p(F_2{=}1\mid \bar S)=0.05, \] and $F_1,F_2$ are independent the class ($S$ or $\bar S$).

Probability Essentials for ML

Probability provides the framework for reasoning under uncertainty, which is essential for modeling and inference in ML.

Random Variables & Distributions

Random Variable: $X$ that takes values according to a probability distribution.
Probability Distribution: Describes how probabilities are assigned (e.g., Bernoulli, Binomial, Gaussian).
Expectation / Mean:
\[ \mathbb{E}[X] = \sum_x x \, P(X=x) \quad \text{or} \quad \int x \, p(x) dx \]

Conditional Probability & Independence

Conditional Probability:
\[ P(A|B) = \frac{P(A \cap B)}{P(B)} \]
Marginalization:
\[ P(A) = \sum_B P(A,B) \]
Independence: $A \perp B \iff P(A,B) = P(A)P(B)$.
Conditional Independence: $A \perp B \mid C \iff P(A,B|C) = P(A|C)P(B|C)$.

Bayes’ Theorem

Updates belief given evidence:
\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]

Likelihood, Prior, Posterior

Prior: $P(\theta)$, belief about parameters before seeing data.
Likelihood: $P(D|\theta)$, probability of data given parameters.
Posterior: $P(\theta|D)$, updated belief after observing data.

1.3.1 (a)

What is the probability that both filters flag an email as spam?

Answer

We want to find $p(F_1=1, F_2=1)$. Conditional independence tells us \[p(F_1=1, F_2=1\mid c) = p(F_1=1\mid c)\,p(F_2=1\mid c).\]We can then use the definition of conditional probability to find \[p(F_1=1, F_2=1, c) = p(F_1=1, F_2=1\mid c)\,p(c).\]Finally, we can marginalize over $c$ using the sum rule to get \[p(F_1=1, F_2=1)=\sum_{c\in\{S,\bar S\}}p(F_1=1, F_2=1, c).\] Putting this all together gives us

\[ \begin{aligned} p(F_1=1, F_2=1) &= \sum_{c \in \{S, \bar S\}} p(c) p(F_1=1 \mid c) p(F_2=1 \mid c) \\ &= 0.2 \cdot (0.9 \cdot 0.8) + 0.8 \cdot (0.1 \cdot 0.05) \\ &= 0.144 + 0.004 \\ &= \boxed{0.148} \end{aligned} \]

1.3.2 (b)

Given that both filters flag an email, what is the probability of the email being spam? (You can leave your answer as an unsimplified fraction.)

Answer

We wish to calculate $p(S\mid F_1 =1, F_2=1)$. We can utilize Bayes’ Theorem, which gives us

\[ p(S\mid F_1 =1, F_2=1) = \frac{p(F_1 =1, F_2=1\mid S)\,p(S)}{p(F_1 =1, F_2=1)}. \]

The denominator comes from part (a), $p(S)$ is given to us as $0.2$, and $p(F_1 =1, F_2=1\mid S) = p(F_1=1\mid S)\,p(F_2 = 1\mid S) = 0.9(0.8) = 0.72$ by conditional independence. Therefore, we have

\[ p(S\mid F_1 =1, F_2=1) = \frac{0.72(0.2)}{0.148} = \boxed{\frac{0.144}{0.148} \approx 0.973}. \]