7  CS 189 Discussion 07: Gradient Descent & ROC Curves

7.0.1 Contact Information

Name Wesley Zheng
Pronouns He/him/his
Email wzheng0302@berkeley.edu
Discussion Wednesdays, 11–12 PM @ Wheeler 120 & Wednesdays, 3-4 PM @ Hildebrand Hall B51
Office Hours Tuesdays, 11–1 PM @ Cory Courtyard

7.1 Gradient Descent on Squared Loss

Consider a regression problem with design matrix \(X\in \mathbb{R}^{n\times d}\), corresponding labels \(y\in \mathbb{R}^n\), parameter vector \(\theta\in \mathbb{R}^d\), and loss function \[ L(\theta) = \|y - X\theta\|_2^2.\] Assume that \(d > n\), i.e., the system is over-parametrized, and that \(X\) is full-rank. Gradient descent generates a sequence of points \(\{\theta_k\}_{k\geq 0}\) according to the update rule \[\theta_{k+1} = \theta_k - \alpha_k \nabla L(\theta_k),\] where \(\{\alpha_k\}_{k\geq 0}\) is a fixed sequence of step sizes. Assume that \(\theta_0 = 0\) and that \(\{\alpha_k\}_{k\geq 0}\) is chosen so that gradient descent converges to a minimizer of the loss function.

ImportantGradient Descent

Gradient Descent is an iterative optimization algorithm used to minimize a loss function by repeatedly updating model parameters in the direction that most rapidly decreases the loss.

Core Idea

  • At each step, parameters are updated by moving in the negative direction of the gradient of the loss function.
  • The gradient indicates the direction of steepest increase, so moving opposite reduces the loss.

Update Rule

  • Parameter update equation: \[ w_{t+1} = w_t - \eta \nabla L(w_t) \]

    where:

    • \(w_t\) = parameters at iteration \(t\)
    • \(\nabla L(w_t)\) = gradient of the loss function at \(w_t\)
    • \(\eta\) = learning rate controlling the step size

Key Properties

  • Repeatedly updates parameters to gradually reduce the loss.
  • Stops when:
    • the algorithm converges, or
    • a maximum number of iterations is reached.
  • Widely used to train machine learning models.
  • Particularly effective for large-scale optimization problems.

Variants

  • Batch Gradient Descent
    • Uses the entire dataset to compute the gradient for each update.
  • Stochastic Gradient Descent (SGD)
    • Updates parameters using one training example at a time.
  • Mini-batch Gradient Descent
    • Computes gradients using small batches of data, balancing efficiency and stability.

Convergence Considerations

  • The learning rate \(\eta\) strongly affects convergence:
    • Too large → updates may diverge.
    • Too small → slow convergence.
  • Convergence behavior also depends on properties of the loss function, such as smoothness and convexity.

7.1.1 (a)

Show that there exist infinitely-many \(\theta\in\mathbb{R}^d\) such that \(X\theta = y\).

Answer

Note that \(X\) is surjective on \(\mathbb{R}^n\), thus there exists some solution \(\theta^*\) to \(X\theta = y\).

Since \(X\) has full row rank \(n\) and maps \(X:\mathbb{R}^d\to\mathbb{R}^n\), its image is all of \(\mathbb{R}^n\).

This means \(X\) is surjective onto \(\mathbb{R}^n\). So for any \(y\in\mathbb{R}^n\), there exists some vector \(\theta^*\) such that \(X\theta^* = y\).

So at least one solution exists.

Now, since \(dim(Ker(X)) \geq 1\), there exists some \(z\in Ker(X)\) such that \(z\neq 0\).

Since \(d > n\), the matrix has more columns than rows.

From the rank–nullity theorem: \(rank(X) + dim(Ker(X)) = d\). Since \(rank(X) = n\), we get \(dim(Ker(X)) = d - n\), and because \(d > n\), \(d - n \geq 1\).

So the kernel (nullspace) is not just \(\{0\}\). There exists a nonzero vector \(z\in Ker(X)\), \(z\neq 0\), such that \(Xz = 0\).

It follows that for any of the infinitely-many scalars \(\alpha\in\mathbb{R}\), \(\theta' = \theta^* + \alpha z\) is also a solution, since \[X\theta' = X(\theta^* + \alpha z) = X\theta^* + \alpha Xz = y + \alpha\cdot 0 = y.\]

\(X\theta = y\) constrains only \(n\) directions. But \(\theta\) lives in \(d\)-dimensional space. The remaining \(d - n\) directions don’t affect the prediction—you can move in those directions freely without changing \(X\theta\). Those directions are exactly \(Ker(X)\).


7.1.2 (b)

Show that the gradient descent limit \(\theta_\infty\) is the minimum Euclidean norm solution to \(X\theta = y\).

Answer

We know that \(X\theta_\infty = y\) by the assumption of convergence and Part (a).

Gradient descent is assumed to converge to a minimizer of the loss \(L(\theta) = \|y - X\theta\|_2^2\). The minimum value of this loss is 0, because from part (a) there exist solutions with \(X\theta = y\).

So at convergence, \(X\theta_\infty = y\), meaning \(\theta_\infty\) is one of the many exact solutions. The question is: which one does gradient descent pick?

We compute \(\nabla L(\theta)\) to rewrite the update rule as \[\theta_{k+1} = \theta_k + 2\alpha_k X^\top (y - X\theta_k).\]

The loss is \(L(\theta) = \|y - X\theta\|_2^2\). The gradient is \(\nabla L(\theta) = -2 X^\top (y - X\theta)\).

So the GD update \(\theta_{k+1} = \theta_k - \alpha_k \nabla L(\theta_k)\) becomes \(\theta_{k+1} = \theta_k + 2\alpha_k X^\top (y - X\theta_k)\).

Important observation: every step moves in the direction \(X^\top(\cdot)\).

We show via induction that all \(\theta_k\in Ker(X)^\perp\). This is trivial for \(k = 0\), and assuming that \(\theta_k \in Ker(X)^\perp\) for an arbitrary index \(k\), we have that for \(z\in Ker(X)\), \[\langle X^\top (y - X\theta_k), z\rangle = (y - X\theta_k)^\top Xz = (y - X\theta_k)^\top 0 = 0,\] thus \(X^\top (y - X\theta_k)\in Ker(X)^\perp\), which implies that \(\theta_{k+1}\in Ker(X)^\perp\) by the update rule. It follows that \(\theta_\infty \in Ker(X)^\perp\).

Now, recall that \(X\theta_\infty = y\). We want to show that for arbitrary \(\theta'\in\mathbb{R}^d\) such that \(X\theta' = y\), \(\|\theta_\infty\|_2 \leq \|\theta'\|_2\). Indeed, \(\theta' - \theta_\infty\in Ker(X)\) since \(X(\theta' - \theta_\infty) = y - y = 0\), thus there exists \(z\in Ker(X)\) such that \(\theta' = \theta_\infty + z\).

It follows that \[\|\theta'\|_2^2 = (\theta')^\top \theta' = (\theta_\infty + z)^\top (\theta_\infty + z) = \|\theta_\infty\|_2^2 + 2 \theta_\infty^\top z + \|z\|_2^2 = \|\theta_\infty\|_2^2 + \|z\|_2^2 \geq \|\theta_\infty\|_2^2\] since \(\theta_\infty^\top z = 0\). Thus, \(\theta_\infty\) is the minimum Euclidean norm solution to \(X\theta = y\).

\(z\in Ker(X)\) and \(\theta_\infty\in Ker(X)^\perp\), so they are orthogonal: \(\theta_\infty^\top z = 0\). Thus \(\|\theta'\|_2^2 = \|\theta_\infty\|_2^2 + \|z\|_2^2\). Since \(\|z\|_2^2 \geq 0\), \(\|\theta'\|_2^2 \geq \|\theta_\infty\|_2^2\).

In particular, the minimum Euclidean norm solution to \(X\theta = y\) is the unique solution contained in \(Ker(X)^\perp\).

All solutions look like \(\theta^* + z\), \(z\in Ker(X)\). So the solution set is a flat affine space.

Gradient descent: starts at 0; only moves inside \(Ker(X)^\perp\); never adds any kernel component.

So it finds the point in the solution set closest to the origin. That is exactly the minimum-norm solution.


7.1.3 (c)

Show that \(\theta_\infty = X^\top (XX^\top)^{-1} y\).

Answer

Since \(rank(X) = n\) and \(XX^\top\in \mathbb{R}^{n\times n}\), it follows that \(XX^\top\) is invertible, thus \((XX^\top)^{-1}\) is well-defined.

If \(X\) has full row rank, then \(XX^\top\) is invertible.

For any vector \(v\), \(v^\top XX^\top v = \|X^\top v\|_2^2\). This is zero only if \(v = 0\) when \(X\) has full row rank. So \(XX^\top\) is positive definite \(\Rightarrow\) invertible.

Thus \((XX^\top)^{-1}\) exists.

Further, \[X(X^\top (XX^\top)^{-1}y) = (XX^\top)(XX^\top)^{-1}y = y,\] thus \(X^\top (XX^\top)^{-1}y\) is a solution to \(X\theta = y\).

By Part (b), it remains to be shown that \(X^\top (XX^\top)^{-1} y\in Ker(X)^\perp\).

From part (b) we already proved: the gradient descent limit is the unique solution in \(Ker(X)^\perp\).

So now we just need to show that this candidate solution lies in that subspace.

Indeed, for arbitrary \(z\in Ker(X)\), \[\langle X^\top (XX^\top)^{-1} y, z\rangle = y^\top ((XX^\top)^{-1})^\top Xz = y^\top ((XX^\top)^{-1})^\top 0 = 0,\] which yields the result.

Take any \(z\in Ker(X)\), which means \(Xz = 0\). Compute the inner product: \(\langle X^\top (XX^\top)^{-1} y, z\rangle\).

Rewrite using transpose properties: \(= y^\top ((XX^\top)^{-1})^\top Xz\). But \(Xz = 0\), so this becomes \(= y^\top ((XX^\top)^{-1})^\top 0 = 0\).

Thus \(\langle \theta, z \rangle = 0\) for all \(z\in Ker(X)\). Therefore \(\theta\in Ker(X)^\perp\).

We now know two things:

  1. \(X^\top (XX^\top)^{-1} y\) solves \(X\theta = y\).
  2. It lies in \(Ker(X)^\perp\).

From part (b): the unique solution in \(Ker(X)^\perp\) is the minimum-norm solution, which is exactly the gradient descent limit.

Therefore \(\theta_\infty = X^\top (XX^\top)^{-1} y\).

This formula is the Moore–Penrose pseudoinverse solution: \(\theta_\infty = X^\dagger y\), where \(X^\dagger = X^\top (XX^\top)^{-1}\).

This is the standard solution when \(d > n\) (more parameters than data) and we want the smallest-norm weights.

7.2 Area Under ROC Curve

Suppose we observe \((X,Y)\), where \(X\in\mathcal X\) and \(Y\in \{0,1\}\). Here, \(Y = 1\) and \(Y = 0\) denote the “positive” and “negative” classes, respectively. Let \[\phi:\mathcal X\to\mathbb{R}\] be a , where \(X\) is given the \(S = \phi(X)\). Assume that the conditional distribution of \(S\) given \(Y = 0\) admits a continuous PDF \(p_0(s)\), and that the conditional distribution of \(S\) given \(Y = 1\) admits a continuous PDF \(p_1(s)\).

Now, for each \(t\in\mathbb{R}\), we may derive a that outputs the label \(1\) for an observation \(X\) if and only if \(\phi(X) > t\). The (TPR) and (FPR) at this threshold are \[\text{TPR}(t) = \Pr[\phi(X) > t \mid Y = 1]\qquad\text{and}\qquad \text{FPR}(t) = \Pr[\phi(X) > t \mid Y = 0],\] respectively. Then, the (ROC) curve is the parametric curve \[\{(\text{FPR}(t),\text{TPR}(t)) : t\in\mathbb{R}\}\subset [0,1]^2.\] Show that the (AUC) equals the probability that a randomly-drawn positive sample is scored higher than a randomly-drawn negative sample under \(\phi\). More precisely, show that \[\text{AUC} = \Pr[S^+ \geq S^-],\] where \(S^+\) and \(S^-\) are independent random variables with PDFs \(p_1(s)\) and \(p_0(s)\), respectively.

ImportantReceiver Operating Characteristic (ROC) Curve

A ROC curve evaluates the performance of a binary classifier by visualizing the trade-off between correctly identifying positives and incorrectly classifying negatives as positives across different thresholds.

Definition

  • The ROC curve plots True Positive Rate (TPR) against False Positive Rate (FPR).

  • Each point on the curve corresponds to a different classification threshold.

  • True Positive Rate (Recall / Sensitivity): \[ \text{TPR} = \frac{TP}{TP + FN} \]

  • False Positive Rate: \[ \text{FPR} = \frac{FP}{FP + TN} \]

Properties

  • Measures classifier performance independent of a specific threshold.
  • Shows the trade-off between sensitivity and false alarms.
  • The top-left corner represents ideal performance:
    • high TPR
    • low FPR
  • The diagonal line corresponds to random guessing.
  • Curves closer to the top-left corner indicate better classifiers.

Area Under the Curve (AUC)

  • A common summary statistic for ROC performance.

  • \[ \text{AUC} = \text{Area under the ROC curve} \]

  • Interpretation:

    • AUC = 1.0 → perfect classifier
    • AUC = 0.5 → random classifier
    • 0.5 < AUC < 1 → varying degrees of predictive power
  • Probabilistic interpretation:
    AUC equals the probability that the classifier ranks a random positive example higher than a random negative example.

How the Curve is Constructed

  • Sort predictions by model score or predicted probability.
  • Gradually lower the classification threshold.
  • At each threshold:
    • compute TPR
    • compute FPR
  • Plot the resulting points to form the ROC curve.

Answer

\[ \begin{align*} \text{AUC} &= \int_0^1 \text{TPR} \,d(\text{FPR}) \\ &= -\int_{-\infty}^\infty \text{TPR}(t)\,\frac{d(\text{FPR}(t))}{dt}\,dt \end{align*} \]

The ROC curve plots \((\text{FPR}(t), \text{TPR}(t))\) as the threshold \(t\) varies. The area under the ROC curve is \(\text{AUC} = \int_0^1 \text{TPR} \,d(\text{FPR})\). This is a line integral along the ROC curve. Since both TPR and FPR depend on \(t\), we reparameterize using \(t\).

By definition: \(\text{TPR}(t) = \Pr[S > t \mid Y = 1]\). Since the positive scores have PDF \(p_1(s)\): \(\text{TPR}(t) = \int_t^\infty p_1(s)\,ds\).

Similarly: \(\text{FPR}(t) = \Pr[S > t \mid Y = 0]\), and with PDF \(p_0(s)\): \(\text{FPR}(t) = \int_t^\infty p_0(s)\,ds\).

We change variables: \(\text{AUC} = \int_0^1 \text{TPR} \,d(\text{FPR})\) becomes \(\text{AUC} = -\int_{-\infty}^\infty \text{TPR}(t)\,\frac{d}{dt}\text{FPR}(t)\,dt\). The minus sign appears because as \(t\) increases, FPR decreases.

\[ \begin{align*} &= -\int_{-\infty}^\infty \text{TPR}(t)\,\frac{d}{dt}\left(\int_t^\infty p_0(s)\,ds\right)\,dt \\ &= -\int_{-\infty}^\infty -\text{TPR}(t)\,p_0(t)\,dt \\ &= \int_{-\infty}^\infty \text{TPR}(t)\,p_0(t)\,dt \end{align*} \]

Recall \(\text{FPR}(t) = \int_t^\infty p_0(s)\,ds\). By the fundamental theorem of calculus: \(\frac{d}{dt}\text{FPR}(t) = -p_0(t)\).

Substitute into the integral: \(\text{AUC} = -\int_{-\infty}^\infty \text{TPR}(t)(-p_0(t))\,dt = \int_{-\infty}^\infty \text{TPR}(t)\,p_0(t)\,dt\).

\[ \begin{align*} &= \int_{-\infty}^\infty \left(\int_t^\infty p_1(s)\,ds\right)\,p_0(t)\,dt \\ &= \int_{-\infty}^\infty \int_t^\infty p_1(s)\,p_0(t)\,ds\,dt \\ &= \iint_{\{(s,t)\,:\,s\geq t\}} p_1(s)\,p_0(t)\,ds\,dt \end{align*} \]

Recall \(\text{TPR}(t) = \int_t^\infty p_1(s)\,ds\). So \(\text{AUC} = \int_{-\infty}^\infty \bigl(\int_t^\infty p_1(s)\,ds\bigr)\,p_0(t)\,dt\).

Rewriting: \(\text{AUC} = \int_{-\infty}^\infty \int_t^\infty p_1(s)\,p_0(t)\,ds\,dt\). This integrates over all pairs \((s,t)\) satisfying \(s \geq t\).

So \(\text{AUC} = \iint_{\{s \geq t\}} p_1(s)\,p_0(t)\,ds\,dt\).

\[ \begin{align*} &= \iint_{\{(s,t)\,:\,s\geq t\}} f_{S^+,S^-} (s,t)\, ds\,dt \\ &= \Pr[S^+ \geq S^-]. \end{align*} \]

Let \(S^+ \sim p_1(s)\), \(S^- \sim p_0(s)\), and assume they are independent. Then their joint PDF is \(f_{S^+,S^-}(s,t) = p_1(s)\,p_0(t)\).

Thus \(\text{AUC} = \iint_{s \geq t} f_{S^+,S^-}(s,t)\,ds\,dt\).

That double integral is exactly \(\Pr[S^+ \geq S^-]\): the probability that a random positive example receives a higher score than a random negative example.

Imagine this experiment: pick a random positive sample; pick a random negative sample; compare their scores. If the positive score is higher \(\Rightarrow\) the classifier ranked them correctly.

Then: AUC = probability the classifier ranks the pair correctly.

Examples: AUC = 0.5 \(\Rightarrow\) random guessing; AUC = 1 \(\Rightarrow\) perfect ranking; AUC = 0.8 \(\Rightarrow\) positive beats negative 80% of the time.