4 CS 189 Discussion 04: Ridge Regression (From Spring 2026)

4.0.1 Contact Information

Name	Wesley Zheng
Pronouns	He/him/his
Email	wzheng0302@berkeley.edu
Discussion	Wednesdays, 11–12 PM @ Wheeler 120 & Wednesdays, 3-4 PM @ Hildebrand Hall B51
Office Hours	Tuesdays, 11–1 PM @ Cory Courtyard

4.1 Gradients and Hessians

Let \(f:\mathbb{R}^n\to \mathbb{R}\). The of \(f\) is defined as \[\nabla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}(x) \\\vdots\\ \frac{\partial f}{\partial x_n}(x) \end{bmatrix}.\] Similarly, the of \(f\) is defined as \[\nabla^2 f(x) = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} (x)&\cdots&\frac{\partial^2 f}{\partial x_1 \partial x_n} (x) \\ \vdots&\ddots&\vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} (x) &\cdots& \frac{\partial^2 f}{\partial x_n^2} (x) \end{bmatrix}.\] Find the gradient and Hessian with respect to \(x\in\mathbb{R}^n\) for the following functions.

Vector Calculus Essentials

Vector calculus underlies optimization and backpropagation. Core objects: gradients (first order), Jacobians (multi-output first order), Hessians (second order), and convexity (global guarantees).

Gradients

For \[ f:\mathbb{R}^d \to \mathbb{R}, \] \[ \nabla f(x)= \begin{bmatrix} \frac{\partial f}{\partial x_1}\\ \vdots\\ \frac{\partial f}{\partial x_d} \end{bmatrix}. \]

Vector of partial derivatives (often viewed as transposed partials).
Same structure as input vector.
Direction of steepest increase; \(-\nabla f\) gives descent.
\(\nabla f(x)=0\) → critical point.

First-order approximation: \[ f(x+\Delta x)\approx f(x)+\nabla f(x)^T\Delta x. \]

Jacobians

For \[ f:\mathbb{R}^n\to\mathbb{R}^m, \] \[ J_f(x)= \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \cdots & \frac{\partial f_1}{\partial x_n}\\ \vdots & \ddots & \vdots\\ \frac{\partial f_m}{\partial x_1} & \cdots & \frac{\partial f_m}{\partial x_n} \end{bmatrix}. \]

Matrix form of first derivatives.
Shape: outputs × inputs.
Local linearization: \[ f(x+\Delta x)\approx f(x)+J_f(x)\Delta x. \]
Central to backpropagation (via chain rule).

Chain Rule

If \(z=g(x)\) and \(y=f(z)\):

Scalar: \[ \frac{dy}{dx}=\frac{dy}{dz}\frac{dz}{dx}. \]

Vector: \[ \nabla_x f(g(x)) = J_g(x)^T \nabla_z f(z). \]

Foundation of backpropagation.

Hessians

For \[ f:\mathbb{R}^d\to\mathbb{R}, \] \[ H_f(x)=\left[\frac{\partial^2 f}{\partial x_i \partial x_j}\right]. \]

Matrix of second-order partial derivatives.
Describes curvature.
Symmetric (if derivatives continuous).
Determines critical point type:
- \(H \succ 0\) → local minimum
- \(H \prec 0\) → local maximum
- Indefinite → saddle

Second-order approximation: \[ f(x+\Delta x)\approx f(x)+\nabla f(x)^T\Delta x +\tfrac12 \Delta x^T H_f(x)\Delta x. \]

Convexity

Definition: \[ f(\theta x+(1-\theta)y) \le \theta f(x)+(1-\theta)f(y). \]

First-order condition: \[ f(y)\ge f(x)+\nabla f(x)^T(y-x). \]

Second-order condition: \[ H_f(x)\succeq 0 \;\; \forall x. \]

If convex → every local minimum is global.

Common Matrix Calculus Formulas

Quadratic form: \[ f(x)=x^T A x \] \[ \nabla_x f= \begin{cases} 2Ax & A=A^T\\ (A+A^T)x & \text{general} \end{cases} \]

Linear: \[ f(x)=a^T x \quad\Rightarrow\quad \nabla_x f=a. \]

Squared norm: \[ \|x\|^2=x^T x \quad\Rightarrow\quad \nabla_x=2x. \]

Least squares: \[ f(w)=\|Xw-y\|^2 \quad\Rightarrow\quad \nabla_w=2X^T(Xw-y). \]

Trace rule: \[ \frac{\partial}{\partial X}\mathrm{tr}(AX)=A^T. \]

Matrix-vector product: \[ y=Wx \] \[ \frac{\partial y}{\partial x}=W, \qquad \frac{\partial y}{\partial W}=x^T. \]

4.1.1 (a)

\(f(x) = w^\top x\) for fixed \(w\in\mathbb{R}^n\).

Answer

Expand the definition as \(f(x) = \sum_{k=1}^n w_kx_k\), so \(\frac{\partial f}{\partial x_j} = w_j\) and \(\frac{\partial^2 f}{\partial x_i \partial x_j} = 0\), thus \[\nabla f(x) = w \quad\text{and}\quad \nabla^2 f(x) = 0_{n\times n}.\]

Wesley’s Notes

Differentiate the sum: \(f(x) = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n\). Take the derivative with respect to \(x_j\): \(w_j x_j\) differentiates to \(w_j\); all other terms are constants with respect to \(x_j\). Thus \(\frac{\partial f}{\partial x_j} = w_j\).

So the gradient becomes \(\nabla f(x) = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_n \end{bmatrix} = w\).

The Hessian contains all second derivatives: \(\nabla^2 f(x) = \bigl[\frac{\partial^2 f}{\partial x_i \partial x_j}\bigr]\). But we already found \(\frac{\partial f}{\partial x_j} = w_j\). Since \(w_j\) is a constant, differentiating again gives \(\frac{\partial^2 f}{\partial x_i \partial x_j} = 0\) for all \(i, j\).

Since every second derivative is zero, \(\nabla^2 f(x) = 0_{n\times n}\), which is the \(n\times n\) zero matrix.

4.1.2 (b)

\(f(x) = x^\top x\).

Answer

Expand the definition as \(f(x) = \sum_{k=1}^n x_k^2\), so \(\frac{\partial f}{\partial x_j} = 2x_j\) and \(\frac{\partial^2 f}{\partial x_i\partial x_j} = \begin{cases} 2, &i=j,\\ 0, &i\neq j, \end{cases}\) thus \[\nabla f(x) = 2x \quad\text{and}\quad \nabla^2 f(x) = 2I_{n\times n}.\]

Wesley’s Notes

Since \(f(x) = x_1^2 + x_2^2 + \cdots + x_n^2\), take the derivative with respect to \(x_j\). Only the term \(x_j^2\) depends on \(x_j\): \(\frac{\partial}{\partial x_j}(x_j^2) = 2x_j\). All other terms are constants with respect to \(x_j\). So \(\frac{\partial f}{\partial x_j} = 2x_j\).

Thus the gradient vector becomes \(\nabla f(x) = \begin{bmatrix} 2x_1 \\ 2x_2 \\ \vdots \\ 2x_n \end{bmatrix}\), which is simply \(\nabla f(x) = 2x\).

The Hessian contains all second derivatives \(\frac{\partial^2 f}{\partial x_i \partial x_j}\). We already know \(\frac{\partial f}{\partial x_j} = 2x_j\). Now differentiate again.

Case 1: \(i = j\): \(\frac{\partial}{\partial x_j}(2x_j) = 2\), so \(\frac{\partial^2 f}{\partial x_j^2} = 2\).

Case 2: \(i \neq j\): \(2x_j\) does not depend on \(x_i\), so \(\frac{\partial^2 f}{\partial x_i \partial x_j} = 0\).

The Hessian has 2 on the diagonal and 0 off the diagonal. This is just 2 times the identity matrix: \(\nabla^2 f(x) = 2I_{n\times n}\).

4.1.3 (c)

\(f(x) = x^\top Ax\) for fixed \(A\in\mathbb{R}^{n\times n}\).

Answer

Expand the definition as \[\begin{align*} f(x) &= \sum_{k,\ell\in[n]} A_{k\ell} x_k x_\ell \\ &= \sum_{k\in [n]} A_{kk}x_k^2 + \sum_{k,\ell\in [n]; k\neq \ell} A_{k\ell}x_kx_\ell, \end{align*}\] where we let \([n] = \{1,\dots,n\}\).

Wesley’s Notes

The function is \(f(x) = x^\top A x\). Write it using coordinates. First compute \((Ax)_k = \sum_{\ell=1}^n A_{k\ell} x_\ell\). Then \(x^\top A x = \sum_{k=1}^n x_k (Ax)_k\). Substitute: \(f(x) = \sum_{k=1}^n x_k \bigl(\sum_{\ell=1}^n A_{k\ell} x_\ell\bigr)\), so \(f(x) = \sum_{k,\ell\in[n]} A_{k\ell} x_k x_\ell\). This is what the solution writes.

We splits the sum: \(f(x) = \sum_k A_{kk} x_k^2 + \sum_{k\neq \ell} A_{k\ell} x_k x_\ell\). Why? Because diagonal terms behave differently when differentiating.

Thus, \[\begin{align*} \frac{\partial f}{\partial x_j} &= 2A_{jj}x_j + \sum_{k\in [n]; k\neq j} (A_{jk} + A_{kj})x_k \\ &= \sum_{k\in [n]} A_{jk}x_k + \sum_{k\in [n]}A_{kj}x_k \\ &= (Ax)_j + (A^\top x)_j \end{align*}\]

Wesley’s Notes

Look at which terms contain \(x_j\).

First type: \(x_j^2\) from the diagonal term \(A_{jj} x_j^2\). Derivative: \(\frac{\partial}{\partial x_j}(A_{jj} x_j^2) = 2A_{jj} x_j\).

Second type: \(x_j x_k\) from terms like \(A_{jk} x_j x_k\). Derivative: \(\frac{\partial}{\partial x_j}(A_{jk} x_j x_k) = A_{jk} x_k\).

Third type: \(x_k x_j\) from terms like \(A_{kj} x_k x_j\). Derivative: \(\frac{\partial}{\partial x_j}(A_{kj} x_k x_j) = A_{kj} x_k\).

Putting these together gives \(\frac{\partial f}{\partial x_j} = 2A_{jj} x_j + \sum_{k\neq j} (A_{jk} + A_{kj}) x_k\).

Notice the expression can be regrouped: \(\sum_k A_{jk} x_k + \sum_k A_{kj} x_k\). The first sum is the \(j\)-th entry of \(Ax\): \((Ax)_j = \sum_k A_{jk} x_k\). The second sum is the \(j\)-th entry of \(A^\top x\): \((A^\top x)_j = \sum_k A_{kj} x_k\). So \(\frac{\partial f}{\partial x_j} = (Ax)_j + (A^\top x)_j\).

Stacking these components gives \(\nabla f(x) = Ax + A^\top x\). Factor: \(\nabla f(x) = (A + A^\top)x\).

and \[\begin{align*} \frac{\partial^2 f}{\partial x_i\partial x_j} &= A_{ij} + A_{ji} \\ &= (A + A^\top)_{ij}, \end{align*}\]

Wesley’s Notes

We already have \(\nabla f(x) = (A + A^\top)x\). This is a linear function of \(x\). The derivative of a linear map \(Bx\) is just the matrix \(B\). Thus \(\nabla^2 f(x) = A + A^\top\).

thus \[\nabla f(x) = (A+A^\top)x\quad\text{and}\quad\nabla^2f(x) = A + A^\top.\]

4.1.4 (d)

\(f(x) = \|Wx - b\|_2^2\) for fixed \(W\in \mathbb{R}^{m\times n}\) and \(b\in\mathbb{R}^m\).

Answer

Expand the definition as \[\begin{align*} f(x) &= (Wx-b)^\top (Wx-b) \\ &= x^\top (W^\top W)x - (b^\top W)x - x^\top (W^\top b) + b^\top b \\ &= x^\top (W^\top W)x - 2(b^\top W)x + b^\top b. \end{align*}\]

Wesley’s Notes

\(f(x) = \|Wx - b\|_2^2\). This means \(f(x) = (Wx - b)^\top (Wx - b)\) because the squared \(\ell_2\) norm satisfies \(\|v\|_2^2 = v^\top v\).

Expand \((Wx - b)^\top (Wx - b)\). First note \((Wx - b)^\top = x^\top W^\top - b^\top\). So \(f(x) = (x^\top W^\top - b^\top)(Wx - b)\). Multiply out like a quadratic: \(f(x) = x^\top W^\top W x - b^\top W x - x^\top W^\top b + b^\top b\). This matches the first expansion in the solution.

Notice \(b^\top W x\) is a scalar. For scalars, \(b^\top W x = x^\top W^\top b\). So the two middle terms are equal, giving \(f(x) = x^\top (W^\top W)x - 2(b^\top W)x + b^\top b\).

By linearity, using Parts (a) and (c), \[\begin{align*} \nabla f(x) &= (W^\top W + (W^\top W)^\top)x - 2(b^\top W)^\top \\ &= 2W^\top W x - 2W^\top b \\ &= 2W^\top (Wx - b) \end{align*}\]

Wesley’s Notes

Differentiate term-by-term.

First term: From part (c), \(\nabla(x^\top A x) = (A + A^\top)x\). Here \(A = W^\top W\), so \(\nabla(x^\top W^\top W x) = (W^\top W + (W^\top W)^\top)x\). But \(W^\top W\) is symmetric: \((W^\top W)^\top = W^\top W\), so \(= 2W^\top W x\).

Second term: Differentiate \(-2(b^\top W)x\). This is linear in \(x\); the gradient of \(a^\top x\) is \(a\). So \(\nabla[-2(b^\top W)x] = -2(b^\top W)^\top\). But \((b^\top W)^\top = W^\top b\), thus \(-2W^\top b\).

Third term: \(b^\top b\) is constant, so derivative \(= 0\).

and \[\begin{align*} \nabla^2 f(x) &= W^\top W + (W^\top W)^\top \\ &= 2W^\top W. \end{align*}\]

Wesley’s Notes

The gradient is \(\nabla f(x) = 2W^\top W x - 2W^\top b\). Derivative again: \(2W^\top W x \to 2W^\top W\); \(-2W^\top b \to 0\). So \(\nabla^2 f(x) = 2W^\top W\).

4.2 Mixture of Gaussians

Let \(x_1,\dots,x_n\in\mathbb{R}\) be i.i.d. samples. Consider a mixture of \(K\) univariate Gaussians \[p(x_i \mid \theta) = \sum_{k=1}^K \pi_k\cdot\mathcal N(x_i\mid \mu_k, \sigma^2_k),\] where \(\mu_k\in\mathbb{R}\), \(\sigma^2_k > 0\), \(\pi_k > 0\) with \(\sum_{k=1}^K \pi_k = 1\), and \(\theta = \{\pi_k, \mu_k, \sigma^2_k\}_{k=1}^K\). Suppose we are given a latent cluster assignment for each point \(x_i\), i.e., define indicator variables \(z_{ik}\) such that \[z_{ik} = \begin{cases} 1, &x_i \text{ belongs to component } k,\\ 0, &\text{otherwise}, \end{cases}\] with \(\sum_{k=1}^K z_{ik} = 1\) for all \(i\). Find the maximum likelihood estimates of \(\pi_k,\mu_k, \sigma^2_k\) for all \(k\). Recall that the PDF of a univariate Gaussian with mean \(\mu\) and variance \(\sigma^2\) is \[p(x\mid \mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right).\]

Answer

The log-likelihood is \[\begin{align*} \ell(\theta) &= \log p(x, z\mid \theta) \\ &= \log\left(\prod_{i=1}^n \prod_{k=1}^K \left(\pi_k\cdot \mathcal N(x_i \mid \mu_k, \sigma^2_k)\right)^{z_{ik}}\right) \\ &= \sum_{i=1}^n \sum_{k=1}^K z_{ik} \left(\log \pi_k + \log \mathcal N(x_i \mid \mu_k, \sigma^2_k)\right) \\ &= \sum_{i=1}^n \sum_{k=1}^K z_{ik} \left(\log \pi_k + \log \left(\frac{1}{\sqrt{2\pi\sigma_k^2} }\exp\left(-\frac{(x_i -\mu_k)^2}{2\sigma_k^2}\right)\right)\right) \\ &= \sum_{i=1}^n \sum_{k=1}^K z_{ik} \left(\log \pi_k -\frac12\log(2\pi) -\frac12\log\sigma_k^2 - \frac{(x_i-\mu_k)^2}{2\sigma^2_k}\right). \end{align*}\]

Wesley’s Notes

We are given cluster assignments \(z_{ik} \in \{0,1\}\) with \(\sum_{k=1}^K z_{ik} = 1\), so each point belongs to exactly one component.

For each \(x_i\), only one \(z_{ik}\) equals 1. Thus the joint likelihood is \(p(x, z \mid \theta) = \prod_{i=1}^n \prod_{k=1}^K \bigl(\pi_k \mathcal N(x_i \mid \mu_k, \sigma_k^2)\bigr)^{z_{ik}}\). The exponent \(z_{ik}\) selects the correct component.

Differentiating \(\ell\) with respect to \(\mu_k\) gives \[\begin{align*} \frac{\partial\ell}{\partial\mu_k} &=\sum_{i=1}^n z_{ik}\left(\frac{x_i - \mu_k}{\sigma_k^2}\right). \end{align*}\] Setting this equal to zero gives \[\begin{align*} \widehat{\mu_k}&= \frac{\sum_{i=1}^n z_{ik} x_i}{\sum_{i=1}^n z_{ik}}. \end{align*}\]

Next, differentiating \(\ell\) with respect to \(\sigma^2_k\) gives \[\begin{align*} \frac{\partial\ell}{\partial\sigma_k^2} &= \sum_{i=1}^n z_{ik}\left(- \frac{1}{2\sigma^2_k} + \frac{(x_i - \mu_k)^2}{2(\sigma_k^2)^2}\right). \end{align*}\] Setting this equal to zero and utilizing \(\widehat{\mu_k}\) in place of \(\mu_k\) gives \[\begin{align*} \widehat{\sigma_k^2} = \frac{\sum_{i=1}^n z_{ik} (x_i - \widehat{\mu_k})^2}{\sum_{i=1}^n z_{ik}}. \end{align*}\]

Finally, note that we may isolate the contributions of all \(\pi_k\) to \(\ell\) as \[\sum_{i=1}^n \sum_{k=1}^K z_{ik} \log \pi_k.\] We seek to maximize this expression, subject to the constraint that \(\sum_{k=1}^K \pi_k = 1\). Thus, we construct the Lagrangian \[\mathcal{L}(\pi_1,\dots,\pi_K, \lambda) = \sum_{i=1}^n \sum_{k=1}^K z_{ik} \log \pi_k - \lambda\left(\sum_{k=1}^K \pi_k - 1\right).\] Differentiating \(\mathcal L\) with respect to \(\pi_k\) gives \[\begin{align*} \frac{\partial\mathcal L}{\partial \pi_k} = \sum_{i=1}^n \frac{z_{ik}}{\pi_k} - \lambda. \end{align*}\] Setting this equal to zero gives \[\pi_k = \frac{1}{\lambda}\sum_{i=1}^n z_{ik}.\] We then recall the constraint \(\sum_{k=1}^K \pi_k = 1\), where the LHS can be rewritten as \[\sum_{k=1}^K \pi_k = \sum_{k=1}^K \left(\frac1\lambda\sum_{i=1}^n z_{ik}\right) = \frac1\lambda\sum_{i=1}^n\left(\sum_{k=1}^K z_{ik} \right)= \frac1\lambda\sum_{i=1}^n(1) = \frac n\lambda,\] thus \[\lambda = n.\] It follows that \[\widehat{\pi_k} = \frac1n\sum_{i=1}^n z_{ik}.\]

4.3 Ridge Regression

Consider the regression model \[y = X\beta + \varepsilon,\] where \(X\in \mathbb{R}^{n\times d}\), \(y\in \mathbb{R}^n\), and \(\beta\in\mathbb{R}^d\). extends classical linear regression by including an \(\ell_2\)-regularization term on \(\beta\), i.e., its objective function is \[J(\beta) = \|y - X\beta\|_2^2 + \lambda \|\beta\|_2^2,\] where \(\lambda > 0\) is a hyperparameter. Find the minimizer \(\widehat\beta\).

Ridge Regression (L2-Regularized Linear Regression)

Ridge regression is linear regression with an added L2 penalty to control model complexity and improve generalization.

Model: \[ y = Xw + \varepsilon \]

Objective (L2-regularized least squares): \[ \min_w \; \|Xw - y\|^2 + \lambda \|w\|^2. \]

Adds an L2 loss term on the parameters.
Penalizes large coefficients to reduce overfitting.
Especially helpful when features are highly correlated (multicollinearity).
Shrinks weights smoothly rather than eliminating them.

Effect of \(\lambda\) (regularization strength): - \(\lambda \to 0\) → same as ordinary linear regression. - \(\lambda \to \infty\) → all coefficients shrink toward \(0\). - Larger \(\lambda\) → simpler model, higher bias, lower variance.

Closed-form solution: \[ w = (X^T X + \lambda I)^{-1} X^T y. \]

Probabilistic interpretation (Bayesian view): - Assumes Gaussian noise in targets: \[ y \mid w \sim \mathcal{N}(Xw, \sigma^2 I). \] - Assumes parameters come from a zero-mean normal prior: \[ w \sim \mathcal{N}(0, \tau^2 I). \]

The L2 penalty corresponds to the log of this Gaussian prior, meaning: - Small weights are more likely. - Large weights are unlikely.

Thus, ridge regression = maximum a posteriori (MAP) estimate under a normal prior on parameters.

Answer

We rewrite the objective function as \[\begin{align*} J(\beta) &= (y-X\beta)^\top(y-X\beta) + \lambda\beta^\top\beta. \end{align*}\]

Wesley’s Notes

The squared norm identity is \(\|v\|_2^2 = v^\top v\). So the objective becomes \(J(\beta) = (y - X\beta)^\top(y - X\beta) + \lambda \beta^\top \beta\). This is just rewriting the norms.

Taking the gradient with respect to \(\beta\) yields \[\begin{align*} \nabla J(\beta) &= -2X^\top(y-X\beta) + 2\lambda \beta \\ &= 2(X^\top X + \lambda I_{d\times d})\beta - 2X^\top y. \end{align*}\]

Setting this equal to zero yields \[\begin{align*} \widehat\beta = (X^\top X + \lambda I_{d\times d})^{-1} X^\top y. \end{align*}\]

This expression is well-defined if and only if \(X^\top X + \lambda I_{d\times d}\) is invertible. Indeed, \(X^\top X + \lambda I_{d\times d}\) is positive definite, as for \(v\in \mathbb{R}^d\) such that \(v\neq 0\), \[\begin{align*} v^\top (X^\top X + \lambda I_{d\times d})v = v^\top X^\top Xv + \lambda v^\top v = \|Xv\|_2^2 + \lambda \|v\|_2^2 > 0, \end{align*}\] from which it follows that \(X^\top X + \lambda I_{d\times d}\) is invertible (refer to Discussion 3).

(Note. If \(\lambda = 0\), our expression for \(\widehat\beta\) is well-defined if and only if \(X^\top X\) is invertible. If not, then there exist multiple minimizers for the objective function. Here, we may utilize instead of a proper inverse. For instance, utilizing the Moore-Penrose pseudoinverse yields the minimizer with smallest Euclidean norm.)

Wesley’s Notes

If \(\lambda = 0\), we recover OLS: \(\widehat\beta = (X^\top X)^{-1} X^\top y\). But \(X^\top X\) might not be invertible when features are linearly dependent or \(d > n\). Adding \(\lambda I\) fixes this.

Ridge regression shrinks coefficients toward zero. Large \(\lambda\) → stronger regularization. Small \(\lambda\) → closer to ordinary least squares.