8  CS 189 Discussion 08: Gradient Descent II

8.0.1 Contact Information

Name Wesley Zheng
Pronouns He/him/his
Email wzheng0302@berkeley.edu
Discussion Wednesdays, 11–12 PM @ Wheeler 120 & Wednesdays, 3-4 PM @ Hildebrand Hall B51
Office Hours Tuesdays, 11–1 PM @ Cory Courtyard

8.1 Gradient Descent Convergence

Let \(f:\mathbb{R}^n\to\mathbb{R}\) be smooth. Assume there exist constants \(\mu,L > 0\), where \(\mu \leq L\), such that \(f\) is \(\mu\)-strongly-convex and has \(L\)-Lipschitz gradients, i.e., \[\mu I_{n\times n} \preceq \nabla^2 f(x) \preceq L I_{n\times n}\qquad\text{for all $x\in\mathbb{R}^n$}.\] It can be shown that there exists a local minimizer \(x^*\) of \(f\).

Gradient descent generates a sequence of points \(\{x_k\}_{k\geq 0}\) according to the update rule \[x_{k+1} = x_k - \alpha \nabla f(x_k),\] where \(\alpha\) is a fixed step size. Let \(x_0\) be a fixed initial iterate.

ImportantStrong Convexity & Smoothness

Strong Convexity (μ-strong convexity)

  • Function curves upward everywhere (bowl-shaped).
  • Guarantees a unique global minimum.
  • Mathematically: \[ \mu I \preceq \nabla^2 f(x) \]
  • Prevents flat regions → leads to faster and more reliable convergence.

Smoothness (L-Lipschitz Gradients)

  • Gradients do not change too abruptly.
  • Bounds curvature from above: \[ \nabla^2 f(x) \preceq L I \]
  • Ensures stable gradient descent updates (avoids sudden large jumps).

8.1.1 (a)

Show that \(x^*\) is the unique local minimizer of \(f\).

Answer

Let \(x'\) be some (local) minimizer. By way of contradiction, assume that \(x' \neq x^*\).

We reduce the problem to 1D by looking at the line segment between \(x^*\) and \(x'\). If both are minimizers, then along this line the function should behave “flat” at both endpoints.

So we study \(g(t)\), which tracks \(f\) along that line.

Now, define \(g : [0,1]\to\mathbb{R}\) by \[g(t) = f(x^* + t(x' - x^*))\qquad\text{for $t\in [0,1]$}.\] Then, \[g'(t) = \nabla f(x^* + t(x' - x^*))^\top (x' - x^*),\] where \(g'(0) = g'(1) = 0\), and \[g''(t) = (x' - x^*)^\top \nabla^2 f(x^* + t(x' - x^*)) (x' - x^*) \geq \mu \|x' - x^*\|_2^2.\]

Strong convexity means curvature is strictly positive in every direction.

So along this line, the second derivative \(g''(t)\) is strictly positive → the function is strictly convex in 1D.

That means it cannot have two flat minima unless the points are identical.

By assumption, for all \(t\in [0,1]\), \(g''(t) > 0\), thus \[g'(1) - g'(0) = \int_0^1 g''(t) dt > 0,\] yielding a contradiction. Thus, \(x^*\) is the unique local minimizer.

If both endpoints were minima, we would have \(g'(0)=g'(1)=0\).

But strict convexity implies the derivative must strictly increase → impossible for both endpoints to have zero slope.

Contradiction ⇒ only one minimizer exists.


8.1.2 (b)

Show that if \[0 < \alpha < \frac 2 L,\] then gradient descent converges to \(x^*\).

Answer

Define the update mapping \(T:\mathbb{R}^n\to\mathbb{R}^n\) such that \[T(x) = x - \alpha \nabla f(x).\]

Instead of thinking about iterations, we think of GD as repeatedly applying a function \(T\).

If \(T\) is a contraction (shrinks distances), then repeatedly applying it will converge to a fixed point.

Then, \[(1 - \alpha L) I \preceq J_T(x) \preceq (1 - \alpha\mu)I,\] thus all eigenvalues of \(J_T(x)\) have magnitude at most \(M = \max\{|1 - \alpha \mu|, |1 - \alpha L|\}\), where \(M < 1\).

Start from: \[ \mu I \preceq \nabla^2 f(x) \preceq L I \]

Multiply by \(-\alpha\) (flips inequality): \[ -\alpha L I \preceq -\alpha \nabla^2 f(x) \preceq -\alpha \mu I \]

Add \(I\) everywhere: \[ (1 - \alpha L) I \preceq I - \alpha \nabla^2 f(x) \preceq (1 - \alpha \mu) I \]

Recognize: \[ J_T(x) = I - \alpha \nabla^2 f(x) \]

So we get the bound.

If \(A \preceq B \preceq C\) for symmetric matrices, then all eigenvalues of \(B\) lie between those of \(A\) and \(C\).

So every eigenvalue \(\lambda\) of \(J_T(x)\) satisfies: \[ 1 - \alpha L \leq \lambda \leq 1 - \alpha \mu \]

Thus: \[ |\lambda| \leq \max\{|1 - \alpha \mu|, |1 - \alpha L|\} = M \]

We assume: \[ 0 < \alpha < \frac{2}{L} \]

Check endpoints:

  • \(|1 - \alpha L| < 1\) because \(-1 < 1 - \alpha L < 1\)
  • Since \(\mu \le L\), we also get \(|1 - \alpha \mu| < 1\)

So both terms are strictly less than 1 ⇒ \(M < 1\).

Interpretation: every direction shrinks → contraction.

It follows that for all \(x_1,x_2\in\mathbb{R}^n\), \[\|T(x_2) - T(x_1)\|_2 \leq M \|x_2 - x_1\|_2.\] As \(T(x^*) = x^*\), it follows from induction that \[\|x_k - x^*\|_2 \leq M^k \|x_0 - x^*\|_2.\]

Using the vector mean value theorem: \[ T(x_2) - T(x_1) = \left(\int_0^1 J_T(x_1 + t(x_2 - x_1)) dt \right)(x_2 - x_1) \]

Taking norms: \[ \|T(x_2) - T(x_1)\| \leq \sup_x \|J_T(x)\| \cdot \|x_2 - x_1\| \]

Now:

  • eigenvalues of \(J_T(x)\) are bounded by \(M\)
  • for symmetric matrices, operator norm = max eigenvalue magnitude

So: \[ \|J_T(x)\| \le M \]

⇒ contraction inequality.

As \(k\to\infty\), \(M^k\to 0\), thus \(x_k\to x^*\), proving convergence.

8.1.3 (c)

For fixed \(\epsilon > 0\), show that \[k\in \mathcal O\left(\log \frac{1}{\epsilon}\right)\] iterations are needed to achieve \[\|x_k - x^*\|_2 \leq \epsilon.\]

Answer By Part (b), it is sufficient to bound \(k\) to achieve \[M^k \|x_0 - x^*\|_2 \leq \epsilon.\] Isolating \(k\) yields the desired result.

8.1.4 (d)

Compute the optimal step size \(\alpha^*\).

Answer

The contraction factor \(M = \max\{|1 - \alpha\mu|, |1 - \alpha L|\}\) is minimized with \[\alpha^* = \frac{2}{L + \mu}.\]

We are balancing two “worst-case directions”:

  • one controlled by μ (flatter direction)
  • one controlled by L (steeper direction)

Optimal α equalizes the two errors so neither dominates.

Here, \[M^* = \frac{\kappa - 1}{\kappa + 1},\] where \[\kappa = \frac{L}{\mu}\] denotes the condition number.

Condition number κ measures how “stretched” the problem is.

  • κ ≈ 1 → fast convergence (nice round bowl)
  • κ large → slow convergence (elongated bowl)

This explains why ill-conditioned problems are hard for GD.

8.2 Universal Approximation

Let \(f:[0,1]\to\mathbb{R}\) be continuous. It can be shown that for every \(\epsilon > 0\), there exists \(\delta > 0\), such that for all \(x,y\in [0,1]\), if \(|x-y| < \delta\), then \(|f(x) - f(y)| < \epsilon\).

For fixed \(\epsilon > 0\), show that there exists a function \(F:[0,1]\to\mathbb{R}\), where \[F(x) = b_0 + \sum_{i=1}^N a_i \text{ReLU} (w_ix + b_i)\] for some \(N\in\mathbb{N}\) and \(a_i,w_i,b_i\in\mathbb{R}\), such that \[\max_{x\in [0,1]} |f(x) - F(x)| < \epsilon.\] Here, \(\text{ReLU}(x) = \max\{0, x\}\).


Answer

Choose \(\delta > 0\), such that for all \(x,y\in [0,1]\), if \(|x-y| < \delta\), then \(|f(x) - f(y)| < \epsilon\).

Continuity ⇒ function doesn’t jump abruptly.

So we can approximate it locally with simple functions (like lines).

Choose a partition \[0 = x_0 < x_1 < \cdots < x_{n-1} < x_n = 1\] for some \(n\in\mathbb{N}\), such that \[x_k - x_{k-1} < \delta\qquad\text{for all $k\in [n]$.}\] Define the linear interpolation \[p(x) = f(x_{k-1}) + m_k (x - x_{k-1})\qquad\text{for all $k\in [n]$},\] where \[m_k = \frac{f(x_k) - f(x_{k-1})}{x_k - x_{k-1}}.\] For \(x\in [x_{k-1}, x_k]\), \(p(x)\) is a convex combination of \(f(x_{k-1})\) and \(f(x_k)\), from which it follows that \(|f(x) - p(x)| < \epsilon\). Applying this across all intervals \([x_{k-1}, x_k]\) yields that \[\max_{x\in [0,1]}|f(x) - p(x)| < \epsilon.\] Now, define \[F(x) = C + m_1 \text{ReLU}(x) + \sum_{k=1}^{n-1} (m_{k+1} - m_k) \text{ReLU}(x - x_k)\] for some \(C\in\mathbb{R}\).

Key idea: ReLU builds piecewise linear functions.

Each ReLU introduces a “kink” (change in slope).

By stacking them, we exactly reproduce the slopes of the linear interpolation.

Note that both \(p\) and \(F\) have the same derivative, as for all \(x\in (x_{k-1}, x_k)\), \[p'(x) = F'(x) = m_k.\] It follows from continuity that by choosing \(C\) such that \(F(x_0) = f(x_0)\), we have that \[p(x) = F(x)\qquad\text{for all $x\in [0,1]$},\] where \(F\) has the desired form.