3  CS 189 Discussion 03: Gaussians and K-Means

3.0.1 Contact Information

Name Wesley Zheng
Pronouns He/him/his
Email wzheng0302@berkeley.edu
Discussion Wednesdays, 11–12 PM @ Wheeler 120 & Wednesdays, 3-4 PM @ Hildebrand Hall B51
Office Hours Tuesdays, 11–1 PM @ Cory Courtyard

3.1 Linear Algebra

A symmetric matrix \(\Sigma\in \mathbb{R}^{n\times n}\) is if for all \(x\in\mathbb{R}^n\) such that \(x\neq 0\), \(x^\top \Sigma x > 0\). Show that \(\Sigma\) is positive definite if and only if there exists an invertible matrix \(A\in\mathbb{R}^{n\times n}\) such that \(\Sigma = AA^\top\).

Answer

Since \(\Sigma\) is symmetric, by the Spectral Theorem, there exist an orthogonal matrix \(V\in\mathbb{R}^{n\times n}\) of eigenvectors and a diagonal matrix \(\Lambda\in\mathbb{R}^{n\times n}\) of eigenvalues, such that \(\Sigma = V\Lambda V^\top\).

  • Assume \(\Sigma\) is positive definite. Thus, all eigenvalues \(\lambda_1,\dots,\lambda_n\) of \(\Sigma\) are strictly positive (refer to Discussion 2). Now, let \(B = \text{diag}(\sqrt{\lambda_1},\dots,\sqrt{\lambda_n})\). Then, \(\Lambda = BB^\top\), so \[\Sigma = V\Lambda V^\top = VBB^\top V^\top = (VB)(VB)^\top,\] where \(VB\) is invertible since both \(V\) and \(B\) are invertible.

If \(\Sigma\) is positive definite, then \(x^\top \Sigma x > 0\) for all \(x\neq 0\). For symmetric matrices, this is equivalent to all eigenvalues being strictly positive: \(\lambda_1,\dots,\lambda_n > 0\).

  • Assume \(\Sigma = AA^\top\) for some invertible matrix \(A\in\mathbb{R}^{n\times n}\). For \(x\neq 0\), \[x^\top \Sigma x = x^\top AA^\top x = (A^\top x)^\top (A^\top x) = \|A^\top x\|_2^2>0,\] where we use that \(A^\top x\neq 0\) since \(A^\top\) is invertible.

Because \(A^\top\) is invertible, it cannot map a nonzero vector to \(0\), so \(A^\top x \neq 0\) when \(x\neq 0\). Hence \(\|A^\top x\|_2^2 > 0\).

Therefore \(x^\top \Sigma x > 0\) for all \(x\neq 0\), so \(\Sigma\) is positive definite.

3.2 Multivariate Gaussians

ImportantGaussian (Normal) Distribution Essentials

The Gaussian (normal) distribution is one of the most important concepts in probability and machine learning. It appears in modeling noise, measurement error, feature distributions, and as a key assumption behind many algorithms.

Parameters and Intuition

  • \(\mu\) (mean / expectation):
    • The center of the distribution.
    • Controls where the bell curve is located.
    • In ML, this often represents the “typical” or average value of a feature.
  • \(\sigma^2\) (variance):
    • Measures how spread out the distribution is.
    • Larger \(\sigma^2\) → wider, flatter curve.
    • Smaller \(\sigma^2\) → narrower, more concentrated curve.
    • \(\sigma = \sqrt{\sigma^2}\) is the standard deviation.

Probability Density Function (PDF)

  • A 1D Gaussian has the form: \[ p(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x-\mu)^2}{2\sigma^2}\right). \]
  • The \(\exp(\cdot)\) means: \[ \exp(z) = e^z, \] where \(e \approx 2.718\) (Euler’s number).
  • Key intuition:
    • Points closer to \(\mu\) are more likely.
    • Probability decreases smoothly as you move away from the mean.

Standard Normal Distribution

  • The standard normal is: \[ \mathcal{N}(0,1), \] with mean \(0\) and variance \(1\).
  • Any Gaussian can be converted into a standard normal: \[ Z = \frac{X - \mu}{\sigma}. \]
  • This idea is used in:
    • Feature normalization
    • Z-scores
    • Statistical testing

Linear Transformations

  • If \[ X \sim \mathcal{N}(\mu, \sigma^2), \quad Y = aX + b, \] then \[ Y \sim \mathcal{N}(a\mu + b,\; a^2\sigma^2). \]
  • Intuition:
    • Adding \(b\) shifts the distribution.
    • Multiplying by \(a\) stretches or compresses it.
  • This explains why scaling features changes their variance.

Sums of Gaussians

  • If \[ X \sim \mathcal{N}(\mu_1, \sigma_1^2), \quad Y \sim \mathcal{N}(\mu_2, \sigma_2^2), \] and \(X, Y\) are independent, then: \[ X + Y \sim \mathcal{N}(\mu_1 + \mu_2,\; \sigma_1^2 + \sigma_2^2). \]
  • Intuition:
    • Means add.
    • Variances add.
  • In ML, this models:
    • Signal + noise
    • Aggregated errors
    • Combined uncertainty

Multivariate Normal (Multiple Features)

  • For vectors: \[ X \sim \mathcal{N}(\mu, \Sigma), \] where:
    • \(\mu \in \mathbb{R}^d\) is a mean vector.
    • \(\Sigma \in \mathbb{R}^{d \times d}\) is a covariance matrix.
  • What \(\Sigma\) tells us:
    • Diagonal entries = variances of each feature.
    • Off-diagonal entries = correlations between features.
  • Why this matters in ML:
    • Models relationships between features.
    • Used in:
      • Gaussian Naive Bayes
      • Gaussian Mixture Models (GMMs)
      • PCA (assumes Gaussian-like structure)
      • Kalman filters

Log of a Gaussian (Very Important in ML)

  • Taking the log of the Gaussian density gives:
    • A quadratic expression in \(x\).
    • Sums instead of products.
  • Benefits:
    • Easier optimization.
    • More numerically stable.
  • Many ML methods optimize log-likelihood, not likelihood:
    • Linear regression
    • Logistic regression
    • Maximum likelihood estimation (MLE)

Central Limit Theorem (CLT)

  • The average (or sum) of many independent random variables tends to look Gaussian, even if the original variables are not.
  • This explains why Gaussians appear everywhere:
    • Measurement noise
    • Aggregated user behavior
    • Averaged features
  • In practice:
    • Justifies modeling noise as Gaussian.
    • Helps explain why errors in models often look bell-shaped.

Why Gaussians Matter in Machine Learning

  • Common assumptions:
    • Noise is Gaussian.
    • Features are approximately Gaussian.
  • Appears in:
    • Linear regression (Gaussian noise assumption)
    • Bayesian models
    • Generative modeling
    • Density estimation
  • Mathematically convenient:
    • Closed-form solutions
    • Easy to differentiate
    • Stable under addition and scaling

3.2.1 (a)

A random vector \(X\in\mathbb{R}^d\) follows a multivariate Gaussian distribution with mean \(\mu\in\mathbb{R}^d\) and covariance matrix \(\Sigma\in\mathbb{R}^{d\times d}\), i.e., \(X\sim\mathcal N(\mu, \Sigma)\), if its PDF is \[f_X(x) = \frac{1}{(2\pi)^{d/2}\det(\Sigma)^{1/2}}\cdot\exp\left(-\frac12 (x-\mu)^\top \Sigma^{-1} (x-\mu)\right).\] The corresponding MGF is \[M_X(t) = \exp\left(t^\top \mu + \frac12 t^\top \Sigma t\right).\] (We assume non-degeneracy, i.e., that \(\Sigma\) is positive definite.) Show that \(X\) is multivariate Gaussian if and only if for all \(a\in\mathbb{R}^d\), \(a^\top X\) follows a univariate Gaussian distribution.

Answer

Define \(\mu = \mathbb{E}[X]\) and \(\Sigma = \text{Cov}(X)\).

  • Assume that \(X\) is multivariate Gaussian, i.e., \(X\sim\mathcal N(\mu, \Sigma)\). Let \(a\in\mathbb{R}^d\). Then, the MGF of \(Y = a^\top X\) is \[M_Y(s) = \mathbb{E}[e^{sY}] = \mathbb{E}[e^{sa^\top X}] = M_X(sa) = \exp(s\cdot a^\top\mu + \frac12 s^2\cdot a^\top \Sigma a).\] This is the MGF of a univariate Gaussian distribution with mean \(a^\top\mu\) and variance \(a^\top\Sigma a\).

This is the MGF of a univariate Gaussian with mean \(a^\top\mu\) and variance \(a^\top \Sigma a\), so \(a^\top X \sim \mathcal N(a^\top\mu,\, a^\top \Sigma a)\).

  • Assume that for all \(a\in\mathbb{R}^d\), \(a^\top X\) is univariate Gaussian. Note that \[\mathbb{E}[a^\top X] = a^\top \mathbb{E}[X] = a^\top \mu\] and that \[ \begin{aligned} \mathrm{Var}(a^\top X) &= \mathbb{E}[(a^\top X)^2] - \mathbb{E}[a^\top X]^2 \\ &= \mathbb{E}[a^\top X X^\top a] - \mathbb{E}[a^\top X]^2 \\ &= a^\top \mathbb{E}[X X^\top] a - a^\top \mathbb{E}[X]\mathbb{E}[X]^\top a \\ &= a^\top \big(\mathbb{E}[X X^\top] - \mathbb{E}[X]\mathbb{E}[X]^\top\big) a \\ &= a^\top \Sigma a. \end{aligned} \]

Using linearity of expectation, \[ \mathbb{E}[a^\top X] = a^\top \mathbb{E}[X]. \]

Thus, the MGF of \(a^\top X\) is \[\mathbb{E}\left[\exp(s\cdot a^\top X)\right] = \exp\left(s\cdot a^\top\mu + \frac12 s^2\cdot a^\top\Sigma a\right).\] Taking \(s = 1\) yields that \[\mathbb{E}\left[\exp(a^\top X)\right] = \exp\left(a^\top\mu + \frac12 a^\top\Sigma a\right).\] As this holds for all \(a\in\mathbb{R}^d\), it follows that the MGF of \(X\) is \[M_X(a) = \exp\left(a^\top\mu + \frac12 a^\top \Sigma a\right),\] i.e., \(X\) is multivariate Gaussian. In both implications, we utilize that a distribution is uniquely determined by its MGF.

A distribution is uniquely determined by its MGF, so matching MGFs implies \(X\) is multivariate Gaussian.


3.2.2 (b)

Let \(x_1,\dots,x_n\in\mathbb{R}^d\) be i.i.d. samples from a multivariate Gaussian \(\sim\mathcal N(\mu,\Sigma)\) (assume that \(\Sigma\) is positive definite). Find the maximum likelihood estimates of \(\mu\) and \(\Sigma\). You may use that for a symmetric invertible matrix \(A\in\mathbb{R}^{d\times d}\) and a symmetric matrix \(B\in\mathbb{R}^{d\times d}\), \[\nabla_A \log(\det(A)) = A^{-1}\] and \[\nabla_A \text{Tr} (A^{-1} B) = -A^{-1}BA^{-1}.\]

Answer

The \(\log\)-likelihood for all \(n\) samples is

\[ \begin{aligned} \ell(\mu,\Sigma) &= \log p(x_1,\dots,x_n\mid \mu,\Sigma) \\ &= \log \prod_{i=1}^n p(x_i\mid \mu,\Sigma) \\ &= \sum_{i=1}^n \log p(x_i\mid \mu, \Sigma) \\ &= \sum_{i=1}^n \log \left(\frac{1}{(2\pi)^{d/2}\det(\Sigma)^{1/2}}\cdot\exp\left(-\frac12 (x_i-\mu)^\top \Sigma^{-1} (x_i-\mu)\right)\right) \\ &= \sum_{i=1}^n \left(-\frac d2\log(2\pi) - \frac12\log (\det(\Sigma)) - \frac12 (x_i-\mu)^\top\Sigma^{-1}(x_i-\mu)\right) \\ &= -\frac{nd}{2}\log(2\pi) - \frac n2 \log(\det(\Sigma)) - \frac12 \sum_{i=1}^n (x_i-\mu)^\top\Sigma^{-1}(x_i-\mu). \end{aligned} \]

Since \(x_1,\dots,x_n \sim \mathcal N(\mu,\Sigma)\) independently, the joint likelihood is \[ p(x_1,\dots,x_n \mid \mu,\Sigma)=\prod_{i=1}^n p(x_i\mid \mu,\Sigma). \]

The constant term \(-\frac{nd}{2}\log(2\pi)\) does not affect the maximizer.

Taking the gradient with respect to \(\mu\) yields

\[ \begin{aligned} \nabla_\mu \ell &= -\frac12 \sum_{i=1}^n \nabla_\mu \left((x_i-\mu)^\top\Sigma^{-1}(x_i-\mu)\right) \\ &= -\frac12\sum_{i=1}^n (-2\Sigma^{-1}(x_i - \mu)) \\ &= \Sigma^{-1} \sum_{i=1}^n (x_i - \mu). \end{aligned} \]

Setting this equal to zero yields \[\widehat\mu = \frac1n\sum_{i=1}^n x_i.\]

Using \[ \nabla_\mu\big((x-\mu)^\top A (x-\mu)\big) = -2A(x-\mu) \] for symmetric \(A\), with \(A=\Sigma^{-1}\), we get \(\nabla_\mu\big((x_i-\mu)^\top\Sigma^{-1}(x_i-\mu)\big) = -2\Sigma^{-1}(x_i-\mu)\).

Next, taking the gradient with respect to \(\Sigma\) yields

\[ \begin{aligned} \nabla_\Sigma \ell &= -\frac n2 \nabla_\Sigma(\log(\det(\Sigma))) - \frac12 \sum_{i=1}^n \nabla_\Sigma ((x_i-\mu)^\top\Sigma^{-1}(x_i-\mu)) \\ &= -\frac n2 \Sigma^{-1} - \frac12 \sum_{i=1}^n \nabla_\Sigma ((x_i-\mu)^\top\Sigma^{-1}(x_i-\mu)) \\ &= -\frac n2 \Sigma^{-1} - \frac12 \sum_{i=1}^n \nabla_\Sigma \text{Tr}((x_i-\mu)^\top\Sigma^{-1}(x_i-\mu)) \\ &= -\frac n2 \Sigma^{-1} - \frac12 \sum_{i=1}^n \nabla_\Sigma \text{Tr}(\Sigma^{-1}(x_i-\mu)(x_i-\mu)^\top) \\ &= -\frac n2 \Sigma^{-1} - \frac12 \sum_{i=1}^n \left(-\Sigma^{-1}(x_i-\mu)(x_i-\mu)^\top\Sigma^{-1}\right) \\ &= -\frac n2 \Sigma^{-1} + \frac12 \Sigma^{-1}\left(\sum_{i=1}^n (x_i-\mu)(x_i-\mu)^\top\right)\Sigma^{-1}. \end{aligned} \]

Setting this equal to zero and utilizing \(\widehat\mu\) in place of \(\mu\) yields \[\widehat\Sigma = \frac1n \sum_{i=1}^n (x_i-\widehat\mu)(x_i-\widehat\mu)^\top.\]

Use \(x^\top A x=\mathrm{Tr}(Axx^\top)\): \[ (x_i-\mu)^\top\Sigma^{-1}(x_i-\mu) = \mathrm{Tr}\big(\Sigma^{-1}(x_i-\mu)(x_i-\mu)^\top\big). \]

3.3 \(K\)-Means Algorithm

Consider the \(K\)-means algorithm applied to the data points \(x_1,\dots,x_n\in\mathbb{R}^d\). For each point \(x_i\), let \(r_{ik}\) denote the binary indicator variable for if \(x_i\) belongs to cluster \(k \in \{1, \dots, K\}\). Let \(\mu_1,\dots,\mu_K\) denote the cluster centroids at any given time. The \(K\)-means algorithm alternates between the following two steps.

  • For each point \(x_i\), choose \[k \in \arg\min_{j\in \{1, \dots, K\}} \|x_i - \mu_j\|_2^2.\] (Ties are broken deterministically.) Let \(r_{ik} = 1\) and \(r_{ij} = 0\) for all \(j\neq k\).
  • For each non-empty cluster \(k\), let \[\mu_k = \frac{\sum_{i=1}^n r_{ik} x_i}{\sum_{i=1}^n r_{ik}}.\]

Formulate the objective function for this algorithm, and show that this algorithm converges after a finite number of iterations.

ImportantK-Means Clustering

K-means is a simple and widely used unsupervised learning algorithm that groups data into clusters based on similarity.

Goal

  • Partition \(n\) data points into \(k\) clusters.

  • Each cluster is represented by a center (called a centroid).

  • The algorithm minimizes the total within-cluster squared distance: \[ \sum_{i=1}^{n} \|x_i - c_{\text{cluster}(i)}\|^2 \] where each point is assigned to the nearest centroid.

  • Intuition:

    • Points in the same cluster should be close to each other.
    • Clusters should be compact and well-separated.

Core Idea

  • Each point belongs to the cluster with the nearest center.
  • Each cluster center is the average (mean) of its assigned points.
  • The algorithm alternates between:
    • Assigning points → based on current centers
    • Updating centers → based on current assignments

Algorithm Steps

  1. Initialize centroids
    • Choose \(k\) starting centers.
    • Common methods:
      • Random points from the dataset
      • Random locations in space
      • K-means++ (a smarter, more stable initialization)
  2. Assign points to nearest centroid
    • For each data point:
      • Compute distance to each centroid (usually Euclidean distance).
      • Assign it to the closest one.
    • This forms \(k\) clusters.
  3. Recalculate cluster centers
    • For each cluster:
      • Compute the mean of all points assigned to it.
      • Move the centroid to that mean.
  4. Repeat steps 2–3
    • Continue until:
      • Centroids stop moving, OR
      • Point assignments stop changing, OR
      • A maximum number of iterations is reached.

Why It Works

  • Each step reduces the total squared distance.
  • The algorithm eventually converges to a stable configuration.
  • It finds a local minimum (not always the global best clustering).

Key Intuition

  • “Cluster center” = average of the points in that group.
  • Points repeatedly get pulled toward the nearest center.
  • Centers repeatedly move toward the densest regions of their assigned points.

Important Notes

  • You must choose \(k\) (number of clusters) in advance.
    • Common method: try multiple values and use the elbow method.
  • Results depend on initialization:
    • Different starting points can give different clusters.
    • K-means++ helps make results more consistent.
  • Works best when:
    • Clusters are roughly spherical.
    • Clusters have similar sizes.
    • Distance is meaningful (numeric features, properly scaled).

Practical Tips

  • Always scale features before running K-means:
    • Use standardization or normalization.
    • Otherwise, large-scale features dominate distance.
  • Sensitive to outliers:
    • A single extreme point can pull a centroid far away.

Common Uses in Machine Learning

  • Customer segmentation
  • Image compression (color clustering)
  • Document/topic grouping
  • Preprocessing for other models
  • Vector quantization

K-means is often one of the first clustering algorithms taught because it is simple, fast, and builds strong intuition about how unsupervised learning discovers structure in data.

Answer

The objective function is given as \[J(\{r_{ik}\}, \{\mu_k\}) = \sum_{k=1}^K\sum_{i=1}^n r_{ik} \|x_i - \mu_k\|_2^2,\] where all \(r_{ik} \in \{0,1\}\) and \(\sum_{k=1}^K r_{ik} = 1\) for all \(i\). We show that neither step increases the objective.

\[ Here $r_{ik}\in\{0,1\}$ indicates whether $x_i$ is assigned to cluster $k$, with \] {k=1}^K r{ik}=1 $$ so each point belongs to exactly one cluster. Interpretation: \(J\) is the total within-cluster squared distance (each point contributes the squared distance to its assigned centroid).

  • Fix the centroids \(\mu_k\), and consider the assignment-update step. For each point \(x_i\), its contribution to the objective is \[\|x_i - \mu_k\|_2^2,\] where \(k\) denotes the cluster \(x_i\) has been assigned to. This contribution is minimized by assigning \(x_i\) to the cluster with the closest centroid, i.e., following the assignment-update rule. Thus, the contribution of \(x_i\) to the objective cannot increase. Summing across all points yields that the objective cannot increase in this step.

  • Fix the assignments \(r_{ik}\), and consider the centroid-update step. For each non-empty cluster \(k\), its contribution to the objective is \[\sum_{i\text{ s.t. }r_{ik} = 1} \|x_i - \mu_k\|_2^2,\] where \(\mu_k\) denotes the cluster’s centroid. This is minimized by letting \(\mu_k\) equal the mean of all points assigned to the cluster, i.e., following the centroid-update rule. (This follows from differentiating with respect to \(\mu_k\) and setting the gradient equal to zero.) Thus, the contribution of cluster \(k\) to the objective cannot increase. Summing across all clusters yields that the objective cannot increase in this step.

Fix the assignments \(r_{ik}\) and update centroids. For a fixed cluster \(k\), the subproblem is \[ \min_{\mu_k}\sum_{i:\,r_{ik}=1}\|x_i-\mu_k\|_2^2. \] Differentiating w.r.t. \(\mu_k\) and setting to zero gives \(\mu_k=\frac{1}{m}\sum_{i:\,r_{ik}=1}x_i\) (the cluster mean, where \(m\) is the cluster size), so this update minimizes the cluster’s contribution. Summing over \(k\) implies the total objective cannot increase in the centroid step.

Each iteration has an assignment step and centroid step, and each cannot increase \(J\), so \(J(t+1)\le J(t)\) (monotone non-increasing).

Further, there are only finitely-many, namely \(K^n\), possible assignments, thus the algorithm can visit only finitely-many distinct assignments. Consequently, the algorithm must terminate after a finite number of iterations, at which point it has converged to a (possibly local) minimum of the objective.