8 CS 189 Discussion 08: Gradient Descent II (From Spring 2026)

8.0.1 Contact Information

Name	Wesley Zheng
Pronouns	He/him/his
Email	wzheng0302@berkeley.edu
Discussion	Wednesdays, 11–12 PM @ Wheeler 120 & Wednesdays, 3-4 PM @ Hildebrand Hall B51
Office Hours	Tuesdays, 11–1 PM @ Cory Courtyard

8.1 Gradient Descent Convergence

Let $f:\mathbb{R}^n\to\mathbb{R}$ be smooth. Assume there exist constants $\mu,L > 0$, where $\mu \leq L$, such that $f$ is $\mu$-strongly-convex and has $L$-Lipschitz gradients, i.e., \[\mu I_{n\times n} \preceq \nabla^2 f(x) \preceq L I_{n\times n}\qquad\text{for all $x\in\mathbb{R}^n$}.\] It can be shown that there exists a local minimizer $x^*$ of $f$.

Gradient descent generates a sequence of points $\{x_k\}_{k\geq 0}$ according to the update rule \[x_{k+1} = x_k - \alpha \nabla f(x_k),\] where $\alpha$ is a fixed step size. Let $x_0$ be a fixed initial iterate.

Strong Convexity & Smoothness

Strong Convexity (μ-strong convexity)

Function curves upward everywhere (bowl-shaped).
Guarantees a unique global minimum.
Mathematically: \[ \mu I \preceq \nabla^2 f(x) \]
Prevents flat regions → leads to faster and more reliable convergence.

Smoothness (L-Lipschitz Gradients)

Gradients do not change too abruptly.
Bounds curvature from above: \[ \nabla^2 f(x) \preceq L I \]
Ensures stable gradient descent updates (avoids sudden large jumps).

8.1.1 (a)

Show that $x^*$ is the unique local minimizer of $f$.

Answer

Let $x'$ be some (local) minimizer. By way of contradiction, assume that $x' \neq x^*$.

Wesley’s Notes

We reduce the problem to 1D by looking at the line segment between $x^*$ and $x'$. If both are minimizers, then along this line the function should behave “flat” at both endpoints.

So we study $g(t)$, which tracks $f$ along that line.

Now, define $g : [0,1]\to\mathbb{R}$ by \[g(t) = f(x^* + t(x' - x^*))\qquad\text{for $t\in [0,1]$}.\] Then, \[g'(t) = \nabla f(x^* + t(x' - x^*))^\top (x' - x^*),\] where $g'(0) = g'(1) = 0$, and \[g''(t) = (x' - x^*)^\top \nabla^2 f(x^* + t(x' - x^*)) (x' - x^*) \geq \mu \|x' - x^*\|_2^2.\]

Wesley’s Notes

Strong convexity means curvature is strictly positive in every direction.

So along this line, the second derivative $g''(t)$ is strictly positive → the function is strictly convex in 1D.

That means it cannot have two flat minima unless the points are identical.

By assumption, for all $t\in [0,1]$, $g''(t) > 0$, thus \[g'(1) - g'(0) = \int_0^1 g''(t) dt > 0,\] yielding a contradiction. Thus, $x^*$ is the unique local minimizer.

Wesley’s Notes

If both endpoints were minima, we would have $g'(0)=g'(1)=0$.

But strict convexity implies the derivative must strictly increase → impossible for both endpoints to have zero slope.

Contradiction ⇒ only one minimizer exists.

8.1.2 (b)

Show that if \[0 < \alpha < \frac 2 L,\] then gradient descent converges to $x^*$.

Answer

Define the update mapping $T:\mathbb{R}^n\to\mathbb{R}^n$ such that \[T(x) = x - \alpha \nabla f(x).\]

Wesley’s Notes

Instead of thinking about iterations, we think of GD as repeatedly applying a function $T$.

If $T$ is a contraction (shrinks distances), then repeatedly applying it will converge to a fixed point.

Then, \[(1 - \alpha L) I \preceq J_T(x) \preceq (1 - \alpha\mu)I,\] thus all eigenvalues of $J_T(x)$ have magnitude at most $M = \max\{|1 - \alpha \mu|, |1 - \alpha L|\}$, where $M < 1$.

Wesley’s Notes

Start from: \[ \mu I \preceq \nabla^2 f(x) \preceq L I \]

Multiply by $-\alpha$ (flips inequality): \[ -\alpha L I \preceq -\alpha \nabla^2 f(x) \preceq -\alpha \mu I \]

Add $I$ everywhere: \[ (1 - \alpha L) I \preceq I - \alpha \nabla^2 f(x) \preceq (1 - \alpha \mu) I \]

Recognize: \[ J_T(x) = I - \alpha \nabla^2 f(x) \]

So we get the bound.

If $A \preceq B \preceq C$ for symmetric matrices, then all eigenvalues of $B$ lie between those of $A$ and $C$.

So every eigenvalue $\lambda$ of $J_T(x)$ satisfies: \[ 1 - \alpha L \leq \lambda \leq 1 - \alpha \mu \]

Thus: \[ |\lambda| \leq \max\{|1 - \alpha \mu|, |1 - \alpha L|\} = M \]

We assume: \[ 0 < \alpha < \frac{2}{L} \]

Check endpoints:

$|1 - \alpha L| < 1$ because $-1 < 1 - \alpha L < 1$
Since $\mu \le L$, we also get $|1 - \alpha \mu| < 1$

So both terms are strictly less than 1 ⇒ $M < 1$.

Interpretation: every direction shrinks → contraction.

It follows that for all $x_1,x_2\in\mathbb{R}^n$, \[\|T(x_2) - T(x_1)\|_2 \leq M \|x_2 - x_1\|_2.\] As $T(x^*) = x^*$, it follows from induction that \[\|x_k - x^*\|_2 \leq M^k \|x_0 - x^*\|_2.\]

Wesley’s Notes

Using the vector mean value theorem: \[ T(x_2) - T(x_1) = \left(\int_0^1 J_T(x_1 + t(x_2 - x_1)) dt \right)(x_2 - x_1) \]

Taking norms: \[ \|T(x_2) - T(x_1)\| \leq \sup_x \|J_T(x)\| \cdot \|x_2 - x_1\| \]

Now:

eigenvalues of $J_T(x)$ are bounded by $M$
for symmetric matrices, operator norm = max eigenvalue magnitude

So: \[ \|J_T(x)\| \le M \]

⇒ contraction inequality.

As $k\to\infty$, $M^k\to 0$, thus $x_k\to x^*$, proving convergence.

8.1.3 (c)

For fixed $\epsilon > 0$, show that \[k\in \mathcal O\left(\log \frac{1}{\epsilon}\right)\] iterations are needed to achieve \[\|x_k - x^*\|_2 \leq \epsilon.\]

Answer

By Part (b), it is sufficient to bound $k$ to achieve \[M^k \|x_0 - x^*\|_2 \leq \epsilon.\] Isolating $k$ yields the desired result.

8.1.4 (d)

Compute the optimal step size $\alpha^*$.

Answer

The contraction factor $M = \max\{|1 - \alpha\mu|, |1 - \alpha L|\}$ is minimized with \[\alpha^* = \frac{2}{L + \mu}.\]

Wesley’s Notes

We are balancing two “worst-case directions”:

one controlled by μ (flatter direction)
one controlled by L (steeper direction)

Optimal α equalizes the two errors so neither dominates.

Here, \[M^* = \frac{\kappa - 1}{\kappa + 1},\] where \[\kappa = \frac{L}{\mu}\] denotes the condition number.

Wesley’s Notes

Condition number κ measures how “stretched” the problem is.

κ ≈ 1 → fast convergence (nice round bowl)
κ large → slow convergence (elongated bowl)

This explains why ill-conditioned problems are hard for GD.

8.2 Universal Approximation

Let $f:[0,1]\to\mathbb{R}$ be continuous. It can be shown that for every $\epsilon > 0$, there exists $\delta > 0$, such that for all $x,y\in [0,1]$, if $|x-y| < \delta$, then $|f(x) - f(y)| < \epsilon$.

For fixed $\epsilon > 0$, show that there exists a function $F:[0,1]\to\mathbb{R}$, where \[F(x) = b_0 + \sum_{i=1}^N a_i \text{ReLU} (w_ix + b_i)\] for some $N\in\mathbb{N}$ and $a_i,w_i,b_i\in\mathbb{R}$, such that \[\max_{x\in [0,1]} |f(x) - F(x)| < \epsilon.\] Here, $\text{ReLU}(x) = \max\{0, x\}$.

Answer

Choose $\delta > 0$, such that for all $x,y\in [0,1]$, if $|x-y| < \delta$, then $|f(x) - f(y)| < \epsilon$.

Wesley’s Notes

Continuity ⇒ function doesn’t jump abruptly.

So we can approximate it locally with simple functions (like lines).

Choose a partition \[0 = x_0 < x_1 < \cdots < x_{n-1} < x_n = 1\] for some $n\in\mathbb{N}$, such that \[x_k - x_{k-1} < \delta\qquad\text{for all $k\in [n]$.}\] Define the linear interpolation \[p(x) = f(x_{k-1}) + m_k (x - x_{k-1})\qquad\text{for all $k\in [n]$},\] where \[m_k = \frac{f(x_k) - f(x_{k-1})}{x_k - x_{k-1}}.\] For $x\in [x_{k-1}, x_k]$, $p(x)$ is a convex combination of $f(x_{k-1})$ and $f(x_k)$, from which it follows that $|f(x) - p(x)| < \epsilon$. Applying this across all intervals $[x_{k-1}, x_k]$ yields that \[\max_{x\in [0,1]}|f(x) - p(x)| < \epsilon.\] Now, define \[F(x) = C + m_1 \text{ReLU}(x) + \sum_{k=1}^{n-1} (m_{k+1} - m_k) \text{ReLU}(x - x_k)\] for some $C\in\mathbb{R}$.

Wesley’s Notes

Key idea: ReLU builds piecewise linear functions.

Each ReLU introduces a “kink” (change in slope).

By stacking them, we exactly reproduce the slopes of the linear interpolation.

Note that both $p$ and $F$ have the same derivative, as for all $x\in (x_{k-1}, x_k)$, \[p'(x) = F'(x) = m_k.\] It follows from continuity that by choosing $C$ such that $F(x_0) = f(x_0)$, we have that \[p(x) = F(x)\qquad\text{for all $x\in [0,1]$},\] where $F$ has the desired form.