7 CS 189 Discussion 07: Gradient Descent I (From Spring 2026)

7.0.1 Contact Information

Name	Wesley Zheng
Pronouns	He/him/his
Email	wzheng0302@berkeley.edu
Discussion	Wednesdays, 11–12 PM @ Wheeler 120 & Wednesdays, 3-4 PM @ Hildebrand Hall B51
Office Hours	Tuesdays, 11–1 PM @ Cory Courtyard

7.1 Gradient Descent on Squared Loss

Consider a regression problem with design matrix \(X\in \mathbb{R}^{n\times d}\), corresponding labels \(y\in \mathbb{R}^n\), parameter vector \(\theta\in \mathbb{R}^d\), and loss function \[ L(\theta) = \|y - X\theta\|_2^2.\] Assume that \(d > n\), i.e., the system is over-parametrized, and that \(X\) is full-rank. Gradient descent generates a sequence of points \(\{\theta_k\}_{k\geq 0}\) according to the update rule \[\theta_{k+1} = \theta_k - \alpha_k \nabla L(\theta_k),\] where \(\{\alpha_k\}_{k\geq 0}\) is a fixed sequence of step sizes. Assume that \(\theta_0 = 0\) and that \(\{\alpha_k\}_{k\geq 0}\) is chosen so that gradient descent converges to a minimizer of the loss function.

Gradient Descent

Gradient Descent is an iterative optimization algorithm used to minimize a loss function by repeatedly updating model parameters in the direction that most rapidly decreases the loss.

Core Idea

At each step, parameters are updated by moving in the negative direction of the gradient of the loss function.
The gradient indicates the direction of steepest increase, so moving opposite reduces the loss.

Update Rule

Parameter update equation: \[ w_{t+1} = w_t - \eta \nabla L(w_t) \]

where:
- \(w_t\) = parameters at iteration \(t\)
- \(\nabla L(w_t)\) = gradient of the loss function at \(w_t\)
- \(\eta\) = learning rate controlling the step size

Key Properties

Repeatedly updates parameters to gradually reduce the loss.
Stops when:
- the algorithm converges, or
- a maximum number of iterations is reached.
Widely used to train machine learning models.
Particularly effective for large-scale optimization problems.

Variants

Batch Gradient Descent
- Uses the entire dataset to compute the gradient for each update.
Stochastic Gradient Descent (SGD)
- Updates parameters using one training example at a time.
Mini-batch Gradient Descent
- Computes gradients using small batches of data, balancing efficiency and stability.

Convergence Considerations

The learning rate \(\eta\) strongly affects convergence:
- Too large → updates may diverge.
- Too small → slow convergence.
Convergence behavior also depends on properties of the loss function, such as smoothness and convexity.

7.1.1 (a)

Show that there exist infinitely-many \(\theta\in\mathbb{R}^d\) such that \(X\theta = y\).

Answer

Note that \(X\) is surjective on \(\mathbb{R}^n\), thus there exists some solution \(\theta^*\) to \(X\theta = y\).

Wesley’s Notes

Since \(X\) has full row rank \(n\) and maps \(X:\mathbb{R}^d\to\mathbb{R}^n\), its image is all of \(\mathbb{R}^n\).

This means \(X\) is surjective onto \(\mathbb{R}^n\). So for any \(y\in\mathbb{R}^n\), there exists some vector \(\theta^*\) such that \(X\theta^* = y\).

So at least one solution exists.

Now, since \(dim(Ker(X)) \geq 1\), there exists some \(z\in Ker(X)\) such that \(z\neq 0\).

Wesley’s Notes

Since \(d > n\), the matrix has more columns than rows.

From the rank–nullity theorem: \(rank(X) + dim(Ker(X)) = d\). Since \(rank(X) = n\), we get \(dim(Ker(X)) = d - n\), and because \(d > n\), \(d - n \geq 1\).

So the kernel (nullspace) is not just \(\{0\}\). There exists a nonzero vector \(z\in Ker(X)\), \(z\neq 0\), such that \(Xz = 0\).

It follows that for any of the infinitely-many scalars \(\alpha\in\mathbb{R}\), \(\theta' = \theta^* + \alpha z\) is also a solution, since \[X\theta' = X(\theta^* + \alpha z) = X\theta^* + \alpha Xz = y + \alpha\cdot 0 = y.\]

Wesley’s Notes

\(X\theta = y\) constrains only \(n\) directions. But \(\theta\) lives in \(d\)-dimensional space. The remaining \(d - n\) directions don’t affect the prediction—you can move in those directions freely without changing \(X\theta\). Those directions are exactly \(Ker(X)\).

7.1.2 (b)

Show that the gradient descent limit \(\theta_\infty\) is the minimum Euclidean norm solution to \(X\theta = y\).

Answer

We know that \(X\theta_\infty = y\) by the assumption of convergence and Part (a).

Wesley’s Notes

Gradient descent is assumed to converge to a minimizer of the loss \(L(\theta) = \|y - X\theta\|_2^2\). The minimum value of this loss is 0, because from part (a) there exist solutions with \(X\theta = y\).

So at convergence, \(X\theta_\infty = y\), meaning \(\theta_\infty\) is one of the many exact solutions. The question is: which one does gradient descent pick?

We compute \(\nabla L(\theta)\) to rewrite the update rule as \[\theta_{k+1} = \theta_k + 2\alpha_k X^\top (y - X\theta_k).\]

Wesley’s Notes

The loss is \(L(\theta) = \|y - X\theta\|_2^2\). The gradient is \(\nabla L(\theta) = -2 X^\top (y - X\theta)\).

So the GD update \(\theta_{k+1} = \theta_k - \alpha_k \nabla L(\theta_k)\) becomes \(\theta_{k+1} = \theta_k + 2\alpha_k X^\top (y - X\theta_k)\).

Important observation: every step moves in the direction \(X^\top(\cdot)\).

We show via induction that all \(\theta_k\in Ker(X)^\perp\). This is trivial for \(k = 0\), and assuming that \(\theta_k \in Ker(X)^\perp\) for an arbitrary index \(k\), we have that for \(z\in Ker(X)\), \[\langle X^\top (y - X\theta_k), z\rangle = (y - X\theta_k)^\top Xz = (y - X\theta_k)^\top 0 = 0,\] thus \(X^\top (y - X\theta_k)\in Ker(X)^\perp\), which implies that \(\theta_{k+1}\in Ker(X)^\perp\) by the update rule. It follows that \(\theta_\infty \in Ker(X)^\perp\).

Now, recall that \(X\theta_\infty = y\). We want to show that for arbitrary \(\theta'\in\mathbb{R}^d\) such that \(X\theta' = y\), \(\|\theta_\infty\|_2 \leq \|\theta'\|_2\). Indeed, \(\theta' - \theta_\infty\in Ker(X)\) since \(X(\theta' - \theta_\infty) = y - y = 0\), thus there exists \(z\in Ker(X)\) such that \(\theta' = \theta_\infty + z\).

Wesley’s Notes

\(z\in Ker(X)\) and \(\theta_\infty\in Ker(X)^\perp\), so they are orthogonal: \(\theta_\infty^\top z = 0\). Thus \(\|\theta'\|_2^2 = \|\theta_\infty\|_2^2 + \|z\|_2^2\). Since \(\|z\|_2^2 \geq 0\), \(\|\theta'\|_2^2 \geq \|\theta_\infty\|_2^2\).

In particular, the minimum Euclidean norm solution to \(X\theta = y\) is the unique solution contained in \(Ker(X)^\perp\).

Wesley’s Notes

All solutions look like \(\theta^* + z\), \(z\in Ker(X)\). So the solution set is a flat affine space.

Gradient descent: starts at 0; only moves inside \(Ker(X)^\perp\); never adds any kernel component.

So it finds the point in the solution set closest to the origin. That is exactly the minimum-norm solution.

7.1.3 (c)

Show that \(\theta_\infty = X^\top (XX^\top)^{-1} y\).

Answer

Since \(rank(X) = n\) and \(XX^\top\in \mathbb{R}^{n\times n}\), it follows that \(XX^\top\) is invertible, thus \((XX^\top)^{-1}\) is well-defined.

Wesley’s Notes

If \(X\) has full row rank, then \(XX^\top\) is invertible.

For any vector \(v\), \(v^\top XX^\top v = \|X^\top v\|_2^2\). This is zero only if \(v = 0\) when \(X\) has full row rank. So \(XX^\top\) is positive definite \(\Rightarrow\) invertible.

Thus \((XX^\top)^{-1}\) exists.

Further, \[X(X^\top (XX^\top)^{-1}y) = (XX^\top)(XX^\top)^{-1}y = y,\] thus \(X^\top (XX^\top)^{-1}y\) is a solution to \(X\theta = y\).

By Part (b), it remains to be shown that \(X^\top (XX^\top)^{-1} y\in Ker(X)^\perp\).

Wesley’s Notes

From part (b) we already proved: the gradient descent limit is the unique solution in \(Ker(X)^\perp\).

So now we just need to show that this candidate solution lies in that subspace.

Indeed, for arbitrary \(z\in Ker(X)\), \[\langle X^\top (XX^\top)^{-1} y, z\rangle = y^\top ((XX^\top)^{-1})^\top Xz = y^\top ((XX^\top)^{-1})^\top 0 = 0,\] which yields the result.

Wesley’s Notes

Take any \(z\in Ker(X)\), which means \(Xz = 0\). Compute the inner product: \(\langle X^\top (XX^\top)^{-1} y, z\rangle\).

Rewrite using transpose properties: \(= y^\top ((XX^\top)^{-1})^\top Xz\). But \(Xz = 0\), so this becomes \(= y^\top ((XX^\top)^{-1})^\top 0 = 0\).

Thus \(\langle \theta, z \rangle = 0\) for all \(z\in Ker(X)\). Therefore \(\theta\in Ker(X)^\perp\).

We now know two things:

\(X^\top (XX^\top)^{-1} y\) solves \(X\theta = y\).
It lies in \(Ker(X)^\perp\).

From part (b): the unique solution in \(Ker(X)^\perp\) is the minimum-norm solution, which is exactly the gradient descent limit.

Therefore \(\theta_\infty = X^\top (XX^\top)^{-1} y\).

This formula is the Moore–Penrose pseudoinverse solution: \(\theta_\infty = X^\dagger y\), where \(X^\dagger = X^\top (XX^\top)^{-1}\).

This is the standard solution when \(d > n\) (more parameters than data) and we want the smallest-norm weights.

7.2 Area Under ROC Curve

Suppose we observe \((X,Y)\), where \(X\in\mathcal X\) and \(Y\in \{0,1\}\). Here, \(Y = 1\) and \(Y = 0\) denote the “positive” and “negative” classes, respectively. Let \[\phi:\mathcal X\to\mathbb{R}\] be a , where \(X\) is given the \(S = \phi(X)\). Assume that the conditional distribution of \(S\) given \(Y = 0\) admits a continuous PDF \(p_0(s)\), and that the conditional distribution of \(S\) given \(Y = 1\) admits a continuous PDF \(p_1(s)\).

Now, for each \(t\in\mathbb{R}\), we may derive a that outputs the label \(1\) for an observation \(X\) if and only if \(\phi(X) > t\). The (TPR) and (FPR) at this threshold are \[\text{TPR}(t) = \Pr[\phi(X) > t \mid Y = 1]\qquad\text{and}\qquad \text{FPR}(t) = \Pr[\phi(X) > t \mid Y = 0],\] respectively. Then, the (ROC) curve is the parametric curve \[\{(\text{FPR}(t),\text{TPR}(t)) : t\in\mathbb{R}\}\subset [0,1]^2.\] Show that the (AUC) equals the probability that a randomly-drawn positive sample is scored higher than a randomly-drawn negative sample under \(\phi\). More precisely, show that \[\text{AUC} = \Pr[S^+ \geq S^-],\] where \(S^+\) and \(S^-\) are independent random variables with PDFs \(p_1(s)\) and \(p_0(s)\), respectively.

Receiver Operating Characteristic (ROC) Curve

A ROC curve evaluates the performance of a binary classifier by visualizing the trade-off between correctly identifying positives and incorrectly classifying negatives as positives across different thresholds.

Definition

The ROC curve plots True Positive Rate (TPR) against False Positive Rate (FPR).
Each point on the curve corresponds to a different classification threshold.
True Positive Rate (Recall / Sensitivity): \[ \text{TPR} = \frac{TP}{TP + FN} \]
False Positive Rate: \[ \text{FPR} = \frac{FP}{FP + TN} \]

Properties

Measures classifier performance independent of a specific threshold.
Shows the trade-off between sensitivity and false alarms.
The top-left corner represents ideal performance:
- high TPR
- low FPR
The diagonal line corresponds to random guessing.
Curves closer to the top-left corner indicate better classifiers.

Area Under the Curve (AUC)

A common summary statistic for ROC performance.
\[ \text{AUC} = \text{Area under the ROC curve} \]
Interpretation:
- AUC = 1.0 → perfect classifier
- AUC = 0.5 → random classifier
- 0.5 < AUC < 1 → varying degrees of predictive power
Probabilistic interpretation:
AUC equals the probability that the classifier ranks a random positive example higher than a random negative example.

How the Curve is Constructed

Sort predictions by model score or predicted probability.
Gradually lower the classification threshold.
At each threshold:
- compute TPR
- compute FPR
Plot the resulting points to form the ROC curve.

Answer

\[ \begin{align*} \text{AUC} &= \int_0^1 \text{TPR} \,d(\text{FPR}) \\ &= -\int_{-\infty}^\infty \text{TPR}(t)\,\frac{d(\text{FPR}(t))}{dt}\,dt \end{align*} \]

Wesley’s Notes

The ROC curve plots \((\text{FPR}(t), \text{TPR}(t))\) as the threshold \(t\) varies. The area under the ROC curve is \(\text{AUC} = \int_0^1 \text{TPR} \,d(\text{FPR})\). This is a line integral along the ROC curve. Since both TPR and FPR depend on \(t\), we reparameterize using \(t\).

By definition: \(\text{TPR}(t) = \Pr[S > t \mid Y = 1]\). Since the positive scores have PDF \(p_1(s)\): \(\text{TPR}(t) = \int_t^\infty p_1(s)\,ds\).

Similarly: \(\text{FPR}(t) = \Pr[S > t \mid Y = 0]\), and with PDF \(p_0(s)\): \(\text{FPR}(t) = \int_t^\infty p_0(s)\,ds\).

We change variables: \(\text{AUC} = \int_0^1 \text{TPR} \,d(\text{FPR})\) becomes \(\text{AUC} = -\int_{-\infty}^\infty \text{TPR}(t)\,\frac{d}{dt}\text{FPR}(t)\,dt\). The minus sign appears because as \(t\) increases, FPR decreases.

\[ \begin{align*} &= -\int_{-\infty}^\infty \text{TPR}(t)\,\frac{d}{dt}\left(\int_t^\infty p_0(s)\,ds\right)\,dt \\ &= -\int_{-\infty}^\infty -\text{TPR}(t)\,p_0(t)\,dt \\ &= \int_{-\infty}^\infty \text{TPR}(t)\,p_0(t)\,dt \end{align*} \]

Wesley’s Notes

Recall \(\text{FPR}(t) = \int_t^\infty p_0(s)\,ds\). By the fundamental theorem of calculus: \(\frac{d}{dt}\text{FPR}(t) = -p_0(t)\).

Substitute into the integral: \(\text{AUC} = -\int_{-\infty}^\infty \text{TPR}(t)(-p_0(t))\,dt = \int_{-\infty}^\infty \text{TPR}(t)\,p_0(t)\,dt\).

\[ \begin{align*} &= \int_{-\infty}^\infty \left(\int_t^\infty p_1(s)\,ds\right)\,p_0(t)\,dt \\ &= \int_{-\infty}^\infty \int_t^\infty p_1(s)\,p_0(t)\,ds\,dt \\ &= \iint_{\{(s,t)\,:\,s\geq t\}} p_1(s)\,p_0(t)\,ds\,dt \end{align*} \]

Wesley’s Notes

Recall \(\text{TPR}(t) = \int_t^\infty p_1(s)\,ds\). So \(\text{AUC} = \int_{-\infty}^\infty \bigl(\int_t^\infty p_1(s)\,ds\bigr)\,p_0(t)\,dt\).

Rewriting: \(\text{AUC} = \int_{-\infty}^\infty \int_t^\infty p_1(s)\,p_0(t)\,ds\,dt\). This integrates over all pairs \((s,t)\) satisfying \(s \geq t\).

So \(\text{AUC} = \iint_{\{s \geq t\}} p_1(s)\,p_0(t)\,ds\,dt\).

\[ \begin{align*} &= \iint_{\{(s,t)\,:\,s\geq t\}} f_{S^+,S^-} (s,t)\, ds\,dt \\ &= \Pr[S^+ \geq S^-]. \end{align*} \]

Wesley’s Notes

Let \(S^+ \sim p_1(s)\), \(S^- \sim p_0(s)\), and assume they are independent. Then their joint PDF is \(f_{S^+,S^-}(s,t) = p_1(s)\,p_0(t)\).

Thus \(\text{AUC} = \iint_{s \geq t} f_{S^+,S^-}(s,t)\,ds\,dt\).

That double integral is exactly \(\Pr[S^+ \geq S^-]\): the probability that a random positive example receives a higher score than a random negative example.

Imagine this experiment: pick a random positive sample; pick a random negative sample; compare their scores. If the positive score is higher \(\Rightarrow\) the classifier ranked them correctly.

Then: AUC = probability the classifier ranks the pair correctly.

Examples: AUC = 0.5 \(\Rightarrow\) random guessing; AUC = 1 \(\Rightarrow\) perfect ranking; AUC = 0.8 \(\Rightarrow\) positive beats negative 80% of the time.