1 CS 189 Discussion 01: Math Review & Data Processing
1.0.1 Contact Information
| Name | Wesley Zheng |
| Pronouns | He/him/his |
| wzheng0302@berkeley.edu | |
| Discussion | Wednesdays, 11–12 PM @ Wheeler 120 & Wednesdays, 3-4 PM @ Hildebrand Hall B51 |
| Office Hours | Tuesdays, 11–1 PM @ Cory Courtyard |
Chain Rule:
\[ \begin{equation} \dfrac{\partial}{\partial x}f(g(x)+ h(x)) = \dfrac{\partial f}{\partial g}\dfrac{dg}{dx} + \dfrac{\partial f}{\partial h}\dfrac{dh}{dx} \end{equation} \]
1.1 Calculus
Calculus is fundamental in machine learning because it allows us to measure changes in functions and compute gradients, which guide optimization.
Derivatives & Partial Derivatives
- Measure how a function changes with respect to one variable while keeping others constant.
- Partial derivatives are critical for computing gradients in multivariate functions.
Chain Rule
- For a composite function \(f(g(x))\):
\[ \frac{df}{dx} = \frac{df}{dg} \cdot \frac{dg}{dx} \]
- Used extensively in backpropagation to compute derivatives of neural network layers.
Common Activation Functions
Sigmoid:
\[ \sigma(t) = \frac{1}{1 + e^{-t}}, \quad \frac{d\sigma}{dt} = \sigma(t)(1-\sigma(t)) \]
Maps real numbers to \((0,1)\), often used for binary classification.tanh:
\[ \tanh(t) = \frac{e^t - e^{-t}}{e^t + e^{-t}}, \quad \frac{d}{dt}\tanh(t) = 1 - \tanh^2(t) \]
Maps to \((-1,1)\), zero-centered; helps with symmetric outputs.ReLU:
\[ \text{ReLU}(t) = \max(0,t), \quad \frac{d}{dt}\text{ReLU}(t) = \begin{cases} 1 & t > 0 \\ 0 & t \le 0 \end{cases} \]
Common in deep learning for fast, sparse activations.Softmax:
For vector \(\mathbf{z} \in \mathbb{R}^n\),
\[ \text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}} \]
Converts vector scores into probabilities; used in multi-class classification.
Gradient, Jacobian, Hessian
- Gradient = vector of first-order partial derivatives; points in direction of steepest ascent.
- Jacobian = matrix of all first-order partial derivatives for vector-valued functions.
- Hessian = matrix of second-order partial derivatives; describes curvature and is used in optimization.
1.1.1 (a)
Consider the function \[f(x, y) = \sigma(ax + by),\] where \(a, b \in \mathbb{R}\) and \(\sigma(t) = \dfrac{1}{1 + e^{-t}}\) for \(t\in \mathbb{R}.\)
1.1.1.1 (i)
Show that \(\dfrac{d\sigma}{dt} = \sigma(t)\left(1-\sigma(t)\right)\).
Answer
\[ \begin{align} \frac{d}{dt}\sigma(t) &= \frac{d}{dt}\left(\frac{1}{1+e^{-t}}\right) \\ &= \frac{e^{-t}}{(1+e^{-t})^2} \\ &= \left(\frac{1}{1+e^{-t}}\right) \left(\frac{1 + e^{-t}}{1 + e^{-t}} - \frac{1}{1 + e^{-t}}\right) \\ &= \boxed{\sigma(t)\left(1-\sigma(t)\right)} \end{align} \]
1.1.1.2 (ii)
Using the result you showed in part (i) and the chain rule, compute \(\dfrac{\partial f}{\partial x}\) and \(\dfrac{\partial f}{\partial y}\).
Answer
First let’s define a new variable \(u = ax + by\) such that \(f = \sigma(u)\). By chain rule, we have \[\frac{\partial f}{\partial x} = \frac{\partial \sigma}{\partial u} \frac{\partial u}{\partial x}.\] We can calculate \(\dfrac{\partial u}{\partial x} = a\), giving us, in conjunction with the result from part (i), \[\frac{\partial f}{\partial x} = \sigma(u)\left(1-\sigma(u)\right) \cdot a.\] Substituting \(ax + by\) back in for \(u\) gives us \[\frac{\partial f}{\partial x} = \boxed{a\sigma(ax + by)\left[1-\sigma(ax + by)\right].}\]
Following the same process for \(\dfrac{\partial f}{\partial y}\), we get
\[\frac{\partial f}{\partial y} = \boxed{b\sigma(ax + by)\left[1-\sigma(ax + by)\right].}\]
1.1.2 (b)
For \(\mathbf{x}=\begin{bmatrix}x_1 \\ \vdots \\ x_n\end{bmatrix}\in\mathbb{R}^n\), define \[ r(\mathbf{x})=\sum_{j=1}^n x_j^2. \] Compute the partial derivative \(\dfrac{\partial r}{\partial x_i}\) for a generic coordinate \(i \in \{1, ..., n\}\).
Answer
Since \(r(\mathbf{x})=\sum_{j=1}^n x_j^2\), we have \[ \frac{\partial r}{\partial x_i}=\boxed{2x_i}. \]
1.1.3 (c)
Let \(\mathbf{w}\in\mathbb{R}^n\) be a constant vector and define the scalar function \[ s(\mathbf{x})=\mathbf{w}^\top \mathbf{x}=\sum_{j=1}^n w_j x_j. \] Compute \(\dfrac{\partial s}{\partial x_i}\) for a generic coordinate \(i \in \{1, ..., n\}\).
Answer
With \(s(\mathbf{x})=\sum_{j=1}^n w_j x_j\), treating \(\mathbf{w}\) as constant, \[ \frac{\partial s}{\partial x_i}=\boxed{w_i}. \]1.2 Linear Algebra
Linear algebra is the language of machine learning. It allows us to represent data, transformations, and models efficiently.
Vectors & Matrices
- Vector: \(\mathbf{x} = [x_1, x_2, \dots, x_n]^T\); represents data points, features, or parameters.
- Matrix: \(\mathbf{A} \in \mathbb{R}^{m \times n}\); represents datasets, transformations, or weight layers.
- Norms: Measure vector size or distance: \[ ||\mathbf{x}||_2 = \sqrt{\sum_i x_i^2}, \quad ||\mathbf{x}||_1 = \sum_i |x_i| \] Common in regularization and optimization.
Matrix Operations
- Dot Product: \(\mathbf{u} \cdot \mathbf{v} = \sum_i u_i v_i\); measures similarity between vectors.
- Matrix Multiplication: \((\mathbf{AB})_{ij} = \sum_k A_{ik} B_{kj}\); represents composed linear transformations.
- Transpose: \((\mathbf{A}^T)_{ij} = A_{ji}\).
- Trace: \(\text{tr}(\mathbf{A}) = \sum_i A_{ii}\); linear operator, appears in loss and covariance computations.
- Determinant: \(\det(\mathbf{A})\); indicates invertibility and volume scaling.
Eigenvalues & Eigenvectors
- Solve \(\mathbf{A}\mathbf{v} = \lambda \mathbf{v}\); eigenvectors define directions, eigenvalues define scaling.
- Used in PCA, spectral clustering, and stability analysis.
- Symmetric matrices always have real eigenvalues and orthogonal eigenvectors.
Singular Value Decomposition (SVD)
- \(\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T\); factorizes a matrix into orthonormal bases and singular values.
- \(\Sigma\) contains singular values \(\sigma_1 \ge \dots \ge \sigma_r > 0\).
- Used for dimensionality reduction, low-rank approximation, and solving least squares problems.
Matrix Properties
- Symmetric: \(\mathbf{A} = \mathbf{A}^T\); real eigenvalues, orthogonal eigenvectors.
- Positive Definite: \(\mathbf{x}^T \mathbf{A} \mathbf{x} > 0\) for all \(\mathbf{x} \neq 0\); arises in quadratic forms and convex optimization.
- Orthogonal: \(\mathbf{Q}^T \mathbf{Q} = \mathbf{Q} \mathbf{Q}^T = I\); preserves lengths and angles, common in rotations and SVD.
Special Matrices in ML
- Covariance Matrix: \(\Sigma = \frac{1}{n} X^T X\); symmetric, PSD, describes data spread.
- Identity Matrix: \(I_n\); neutral element for multiplication.
- Diagonal Matrix: non-zero entries only on diagonal; simplifies eigenvalue computations.
Linear Transformations
- Matrix multiplication represents linear transformations: \(y = A x\).
- Properties like rank, null space, and column space describe how transformations act on input space.
1.2.1 (a)
Prove that \(\mathbf{A}^\top \mathbf{A}\) is symmetric for any \(\mathbf{A}\in\mathbb{R}^{m \times n}\).
Answer
For a matrix to be symmetric, the following identity must hold: \(\mathbf{S} = \mathbf{S}^\top\) \[ \boxed{(\mathbf{A}^\top\mathbf{A})^\top = \mathbf{A}^\top (\mathbf{A}^\top)^\top = \mathbf{A}^\top\mathbf{A}}. \]
1.2.2 (b)
Consider the matrix \[ \mathbf{B} = \begin{bmatrix} 1 & 0 \\ 0 & 2 \\ 2 & 1 \end{bmatrix} \]Find the singular values of \(\mathbf{B}\).
Answer
The singular values are the square roots of the eigenvalues of \(\mathbf{B}^\top \mathbf{B}\). First compute: \[ \mathbf{B}^\top \mathbf{B} = \begin{bmatrix} 1 & 0 & 2 \\ 0 & 2 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 2 \\ 2 & 1 \end{bmatrix} = \begin{bmatrix} 5 & 2 \\ 2 & 5 \end{bmatrix}. \]
The characteristic polynomial is: \[ \det\!\left( \begin{bmatrix}5 & 2\\2 & 5\end{bmatrix} - \lambda I \right) = (5-\lambda)^2 - 4 = \lambda^2 - 10\lambda + 21. \] This factors as \((\lambda - 7)(\lambda - 3)\). The eigenvalues are the roots of this expression, so they are \(\lambda_1 = 7\) and \(\lambda_2 = 3\). Therefore, the singular values are: \[ \boxed{\sigma_1 = \sqrt{7}, \quad \sigma_2 = \sqrt{3}.} \]
1.2.3 (c) (Bonus!)
Recall the spectral theorem: any symmetric matrix \(S \in \mathbb{R}^{n \times n}\) can be written as \[ S = Q \Lambda Q^\top, \] with \(Q\) orthonormal and \(\Lambda\) diagonal of eigenvalues.
Let \(A \in \mathbb{R}^{m \times n}\) have an SVD \[ A = U \Sigma V^\top, \] where \(U \in \mathbb{R}^{m \times m}\) and \(V \in \mathbb{R}^{n \times n}\) are orthonormal, and \(\Sigma \in \mathbb{R}^{m \times n}\) is rectangular diagonal with singular values \(\sigma_1 \ge \cdots \ge \sigma_r > 0\) on the main diagonal (and zeros otherwise). Use this to derive the spectral decomposition of \(A^\top A\); explicitly identify \(Q\) and \(\Lambda\) in terms of the SVD factors.
Answer
Compute \[ A^\top A = (U \Sigma V^\top)^\top (U \Sigma V^\top) = V \Sigma^\top U^\top U \Sigma V^\top = V (\Sigma^\top \Sigma) V^\top, \] since \(U^\top U = I_m\). The matrix \(\Sigma^\top \Sigma \in \mathbb{R}^{n \times n}\) is diagonal with entries \[ \Sigma^\top \Sigma = \operatorname{diag}(\sigma_1^2, \dots, \sigma_r^2, 0, \dots, 0). \] Thus \[ A^\top A = V \underbrace{\operatorname{diag}(\sigma_1^2, \dots, \sigma_r^2, 0, \dots, 0)}_{\Lambda} V^\top. \] This is precisely the spectral decomposition of the symmetric matrix \(A^\top A\), with \[ Q = V \quad\text{(columns are right singular vectors of $A$)}, \qquad \Lambda = \operatorname{diag}(\sigma_1^2, \dots, \sigma_r^2, 0, \dots, 0). \] In particular, the eigenvalues of \(A^\top A\) are the squared singular values of \(A\), and its orthonormal eigenvectors are the right singular vectors (the columns of \(V\)).1.3 Probability Review
An incoming email is spam with prior \(p(S)=0.2\) and not spam with \(p(\bar S)=0.8\). Two independent filters flag spam: \[ p(F_1{=}1\mid S)=0.9,\quad p(F_1{=}1\mid \bar S)=0.1,\qquad p(F_2{=}1\mid S)=0.8,\quad p(F_2{=}1\mid \bar S)=0.05, \] and \(F_1,F_2\) are independent the class (\(S\) or \(\bar S\)).
Probability provides the framework for reasoning under uncertainty, which is essential for modeling and inference in ML.
Random Variables & Distributions
- Random Variable: \(X\) that takes values according to a probability distribution.
- Probability Distribution: Describes how probabilities are assigned (e.g., Bernoulli, Binomial, Gaussian).
- Expectation / Mean:
\[ \mathbb{E}[X] = \sum_x x \, P(X=x) \quad \text{or} \quad \int x \, p(x) dx \]
Conditional Probability & Independence
- Conditional Probability:
\[ P(A|B) = \frac{P(A \cap B)}{P(B)} \]
- Marginalization:
\[ P(A) = \sum_B P(A,B) \]
- Independence: \(A \perp B \iff P(A,B) = P(A)P(B)\).
- Conditional Independence: \(A \perp B \mid C \iff P(A,B|C) = P(A|C)P(B|C)\).
Bayes’ Theorem
- Updates belief given evidence:
\[ P(A|B) = \frac{P(B|A)P(A)}{P(B)} \]
Likelihood, Prior, Posterior
- Prior: \(P(\theta)\), belief about parameters before seeing data.
- Likelihood: \(P(D|\theta)\), probability of data given parameters.
- Posterior: \(P(\theta|D)\), updated belief after observing data.
1.3.1 (a)
What is the probability that both filters flag an email as spam?
Answer
We want to find \(p(F_1=1, F_2=1)\). Conditional independence tells us \[p(F_1=1, F_2=1\mid c) = p(F_1=1\mid c)\,p(F_2=1\mid c).\]We can then use the definition of conditional probability to find \[p(F_1=1, F_2=1, c) = p(F_1=1, F_2=1\mid c)\,p(c).\]Finally, we can marginalize over \(c\) using the sum rule to get \[p(F_1=1, F_2=1)=\sum_{c\in\{S,\bar S\}}p(F_1=1, F_2=1, c).\] Putting this all together gives us
\[ \begin{aligned} p(F_1=1, F_2=1) &= \sum_{c \in \{S, \bar S\}} p(c) p(F_1=1 \mid c) p(F_2=1 \mid c) \\ &= 0.2 \cdot (0.9 \cdot 0.8) + 0.8 \cdot (0.1 \cdot 0.05) \\ &= 0.144 + 0.004 \\ &= \boxed{0.148} \end{aligned} \]
1.3.2 (b)
Given that both filters flag an email, what is the probability of the email being spam? (You can leave your answer as an unsimplified fraction.)
Answer
We wish to calculate \(p(S\mid F_1 =1, F_2=1)\). We can utilize Bayes’ Theorem, which gives us
\[ p(S\mid F_1 =1, F_2=1) = \frac{p(F_1 =1, F_2=1\mid S)\,p(S)}{p(F_1 =1, F_2=1)}. \]
The denominator comes from part (a), \(p(S)\) is given to us as \(0.2\), and \(p(F_1 =1, F_2=1\mid S) = p(F_1=1\mid S)\,p(F_2 = 1\mid S) = 0.9(0.8) = 0.72\) by conditional independence. Therefore, we have
\[ p(S\mid F_1 =1, F_2=1) = \frac{0.72(0.2)}{0.148} = \boxed{\frac{0.144}{0.148} \approx 0.973}. \]