6  CS 189 Discussion 06: Logistic Regression

6.0.1 Contact Information

Name Wesley Zheng
Pronouns He/him/his
Email wzheng0302@berkeley.edu
Discussion Wednesdays, 11–12 PM @ Wheeler 120 & Wednesdays, 3-4 PM @ Hildebrand Hall B51
Office Hours Tuesdays, 11–1 PM @ Cory Courtyard

6.1 Log-Odds in Logistic Regression

Consider a binary classification dataset \(\{(x_i, y_i)\}_{i=1}^n\), where \(x_i\in \mathbb{R}^d\) and \(y_i\in \{0,1\}\). In logistic regression, the conditional probability of label \(1\) given input \(x\) is \[\Pr[Y = 1 \mid X = x; w] = \sigma(w^\top x),\] where \(w\in\mathbb{R}^d\) denotes the weight vector. (Recall that \(\sigma(z) = \frac{1}{1 + e^{-z}}\).)

Show that the log-odds \[\log \frac{\Pr[Y = 1 \mid X = x; w]}{\Pr[Y = 0 \mid X = x; w]}\] is linear in \(x\).


Answer

For \(z\in\mathbb{R}\), \[ \begin{align*} \log \frac{\sigma(z)}{1 - \sigma(z)} = \log \frac{\left(\frac{1}{1 + e^{-z}}\right)}{\left(\frac{e^{-z}}{1 + e^{-z}}\right)} = \log \frac{1}{e^{-z}} = \log e^z = z. \end{align*} \]

Logistic regression assumes \(\Pr(Y = 1 \mid X = x; w) = \sigma(w^\top x)\), where the sigmoid is \(\sigma(z) = \frac{1}{1 + e^{-z}}\).

Since this is binary classification, \(\Pr(Y = 0 \mid X = x; w) = 1 - \sigma(w^\top x)\).

Taking \(z = w^\top x\) gives \[ \begin{align*} \log \frac{\Pr[Y = 1 \mid X = x; w]}{\Pr[Y = 0 \mid X = x; w]} = \log \frac{\sigma(w^\top x)}{1 - \sigma(w^\top x)} = w^\top x. \end{align*} \]

6.2 Cross-Entropy Loss in Logistic Regression

Consider a binary classification dataset \(\{(x_i, y_i)\}_{i=1}^n\), where \(x_i\in \mathbb{R}^d\) and \(y_i\in \{0,1\}\). In logistic regression, the conditional probability of label \(1\) given input \(x\) is \[\widehat p = \Pr[Y = 1 \mid X = x; w] = \sigma(w^\top x),\] where \(w\in\mathbb{R}^d\) denotes the weight vector. (Recall that \(\sigma(z) = \frac{1}{1 + e^{-z}}\).)

Now, let \(\widehat p_i\) denote the predicted probability of label \(1\) for \(x_i\). The corresponding cross-entropy loss is \[\ell(\widehat p_i, y_i) = -y_i \log \widehat p_i - (1-y_i)\log (1 - \widehat p_i),\] and the total loss across the dataset is \[J(w) = \sum_{i=1}^n \ell(\widehat p_i, y_i).\] Show that maximizing the likelihood of the observed labels is equivalent to minimizing \(J(w)\).

ImportantLogistic Regression

Logistic Regression is a statistical model used for binary classification, modeling the probability that an outcome belongs to the positive class.

Model Definition

  • Models the conditional probability: \[ P(y = 1 \mid x) \]

  • Uses the sigmoid function applied to a linear predictor: \[ P(y = 1 \mid x) = \sigma(w^T x) = \frac{1}{1 + e^{-w^T x}} \]

  • Produces probabilities between 0 and 1.

Decision Boundary

  • Classification typically uses threshold 0.5.
  • The decision boundary occurs when: \[ w^T x = 0 \]

Training Objective

  • Maximizes the Bernoulli likelihood of the observed labels.
  • Equivalently minimizes the cross-entropy (log loss): \[ L(w) = - \sum_{i=1}^n \left[y_i \log p_i + (1-y_i)\log(1-p_i)\right] \]

Key Properties

  • The loss function is convex in \(w\), ensuring a global optimum.
  • There is no closed-form solution for the optimal parameters.
  • Typically solved using gradient descent or Newton’s method.

Interpretation of Coefficients

  • Each coefficient represents a change in log-odds: \[ \log\left(\frac{P(y=1|x)}{P(y=0|x)}\right) = w^T x \]

Extensions

  • Can be extended to multiclass classification using softmax regression.

Answer

For \(i\in [n]\), the conditional distribution of \(y_i\) given \(x_i\) is \[\Pr[Y=y_i \mid X=x_i; w] = \widehat p_i^{y_i} (1 - \widehat p_i)^{1 - y_i}.\]

Logistic regression predicts \(\widehat p_i = \Pr(Y = 1 \mid X = x_i; w) = \sigma(w^\top x_i)\), where \(\sigma(z) = \frac{1}{1 + e^{-z}}\). Thus \(\Pr(Y = 0 \mid X = x_i; w) = 1 - \widehat p_i\).

Each label \(y_i\) is either 0 or 1, so we can write both cases in one formula: \(\Pr(Y = y_i \mid X = x_i; w) = \widehat p_i^{y_i} (1 - \widehat p_i)^{1 - y_i}\).

Why this works: If \(y_i = 1\): \(\widehat p_i^1 (1 - \widehat p_i)^0 = \widehat p_i\). If \(y_i = 0\): \(\widehat p_i^0 (1 - \widehat p_i)^1 = 1 - \widehat p_i\). So this compactly represents the correct probability.

Assuming independence, it follows that the likelihood over the dataset is \[L(w) = \prod_{i=1}^n \widehat p_i^{y_i} (1 - \widehat p_i)^{1 - y_i}.\]

Assuming the data points are independent, the likelihood is the product of all probabilities: \(L(w) = \prod_{i=1}^n \widehat p_i^{y_i} (1 - \widehat p_i)^{1 - y_i}\). This is the likelihood function.

Maximizing the likelihood is equivalent to maximizing the log-likelihood \[\mathcal L(w) = \sum_{i=1}^n y_i\log \widehat p_i + (1-y_i)\log (1 - \widehat p_i),\] which is equivalent to minimizing \[-\mathcal L(w) = \sum_{i=1}^n ( -y_i\log \widehat p_i - (1-y_i)\log (1 - \widehat p_i)) = \sum_{i=1}^n \ell(\widehat p_i, y_i) = J(w).\]

6.3 Likelihood Ratio Test

Suppose we observe \(X\in \mathbb{R}\) drawn from one of two distributions: \(P_0\) with PDF \(p_0(x)\), or \(P_1\) with PDF \(p_1(x)\). A (possibly randomized) test is a function \[\phi : \mathbb{R} \to [0,1],\] where \(\phi(x)\) denotes the probability of deciding \(P_1\) when observing \(x\).

The size and power of the test are \(\alpha(\phi) = \mathbb{E}_{P_0}[\phi(x)]\) and \(\beta(\phi) = \mathbb{E}_{P_1} [\phi(x)]\), respectively.

Now, given \(\eta > 0\), the likelihood ratio test with threshold \(\eta\) corresponds to the decision rule \[\phi_\eta(x) = \begin{cases} 1,&\Lambda(x) > \eta,\\ 0,&\Lambda(x) < \eta, \end{cases}\] where the likelihood ratio \(\Lambda(x) = \frac{p_1(x)}{p_0(x)}\) (we employ randomization when \(\Lambda(x) = \eta\)).


6.3.1 (a)

For fixed \(\alpha\in (0,1)\), show that among all tests with size at most \(\alpha\), the likelihood ratio test with size exactly \(\alpha\) has maximum power. (This is the Neyman-Pearson Lemma.)

Answer

We observe \(X\) drawn from either \(P_0\) with PDF \(p_0(x)\) (null hypothesis) or \(P_1\) with PDF \(p_1(x)\) (alternative). A test \(\phi(x)\in [0,1]\) gives the probability of choosing \(P_1\).

So: \(\phi(x) = 1\) \(\Rightarrow\) always choose \(P_1\); \(\phi(x) = 0\) \(\Rightarrow\) always choose \(P_0\).

The size (Type I error rate): \(\alpha(\phi) = \mathbb{E}_{P_0}[\phi(X)] = \int \phi(x)p_0(x)dx\).

The power (probability of correctly detecting \(P_1\)): \(\beta(\phi) = \mathbb{E}_{P_1}[\phi(X)] = \int \phi(x)p_1(x)dx\).

Define the likelihood ratio \(\Lambda(x) = \frac{p_1(x)}{p_0(x)}\). The LRT rule is: \(\phi_\eta(x) = 1\) if \(\Lambda(x) > \eta\), \(\phi_\eta(x) = 0\) if \(\Lambda(x) < \eta\) (with randomization if equal).

Interpretation: choose \(P_1\) when the data is much more likely under \(P_1\).

Let \(\phi_\eta\) be the likelihood ratio test with threshold \(\eta\) chosen to achieve size \(\alpha\), i.e., \[\alpha(\phi_\eta) = \alpha,\] and let \(\phi\) be an arbitrary test with \[\alpha(\phi) \leq \alpha.\]

Let \(\phi_\eta\) = likelihood ratio test with size \(\alpha\); \(\phi\) = any other test with size \(\leq \alpha\).

We compare their powers: \(\beta(\phi) - \beta(\phi_\eta)\).

Then, \[ \begin{align*} \beta(\phi) - \beta(\phi_\eta) &= \int (\phi(x) - \phi_\eta(x))p_1(x)dx \\ &= \int (\phi(x) - \phi_\eta(x))\Lambda(x)p_0(x)dx \\ &= \int (\phi(x) - \phi_\eta(x))(\Lambda(x) - \eta)p_0(x)dx + \eta\int (\phi(x) - \phi_\eta(x))p_0(x)dx \\ &= \int (\phi(x) - \phi_\eta(x))(\Lambda(x) - \eta)p_0(x)dx + \eta (\alpha(\phi) - \alpha(\phi_\eta)). \end{align*} \]

\(\beta(\phi) - \beta(\phi_\eta) = \int (\phi(x) - \phi_\eta(x))p_1(x)dx\). Substitute \(p_1(x) = \Lambda(x)p_0(x)\) so \(= \int (\phi(x) - \phi_\eta(x))\Lambda(x)p_0(x)dx\).

Rewrite \(\Lambda(x) = (\Lambda(x) - \eta) + \eta\) so \(= \int (\phi - \phi_\eta)(\Lambda - \eta)p_0\,dx + \eta\int (\phi - \phi_\eta)p_0\,dx\).

Notice \(\int \phi(x)p_0(x)dx = \alpha(\phi)\) and \(\int \phi_\eta(x)p_0(x)dx = \alpha(\phi_\eta)\). Thus \(\int (\phi - \phi_\eta)p_0\,dx = \alpha(\phi) - \alpha(\phi_\eta)\).

So \(\beta(\phi) - \beta(\phi_\eta) = \int (\phi - \phi_\eta)(\Lambda - \eta)p_0\,dx + \eta(\alpha(\phi) - \alpha(\phi_\eta))\).

We first consider the integral by analyzing the sign of the integrand point-wise.

  • If \(\Lambda(x) > \eta\), then \(\phi_\eta(x) = 1\), so \(\phi(x)\leq \phi_\eta(x)\), thus the integrand is \(\leq 0\) here
  • If \(\Lambda(x) < \eta\), then \(\phi_\eta(x) = 0\), so \(\phi(x) \geq \phi_\eta(x)\), thus the integrand is \(\leq 0\) here.
  • If \(\Lambda(x) = \eta\), then the integrand has value \(0\).

Consider three cases.

Case 1: \(\Lambda(x) > \eta\). Then LRT chooses \(P_1\): \(\phi_\eta(x) = 1\). But \(\phi(x)\leq 1\), so \(\phi(x) - \phi_\eta(x)\leq 0\). Since \(\Lambda(x) - \eta > 0\): \((\phi - \phi_\eta)(\Lambda - \eta)\leq 0\).

Case 2: \(\Lambda(x) < \eta\). Then LRT chooses \(P_0\): \(\phi_\eta(x) = 0\). But \(\phi(x)\geq 0\), so \(\phi(x) - \phi_\eta(x)\geq 0\). Since \(\Lambda(x) - \eta < 0\): \((\phi - \phi_\eta)(\Lambda - \eta)\leq 0\).

Case 3: \(\Lambda(x) = \eta\): product equals 0.

Therefore everywhere \((\phi - \phi_\eta)(\Lambda - \eta)\leq 0\).

So the entire integral is \(\leq 0\).

It follows that the integral has value \(\leq 0\), thus \[ \begin{align*} \beta(\phi) - \beta(\phi_\eta) &\leq \eta(\alpha(\phi) - \alpha(\phi_\eta)). \end{align*} \]

Finally, since \(\alpha(\phi) \leq \alpha = \alpha(\phi_\eta)\) and \(\eta > 0\), it follows that \(\beta(\phi) \leq \beta(\phi_\eta)\). Thus, the likelihood ratio test \(\phi_\eta\) has maximum power.

We know \(\alpha(\phi) \leq \alpha\) and \(\alpha(\phi_\eta) = \alpha\). Thus \(\alpha(\phi) - \alpha(\phi_\eta) \leq 0\). Since \(\eta > 0\): \(\eta(\alpha(\phi) - \alpha(\phi_\eta)) \leq 0\).

The proof shows something intuitive: if you want to maximize detection of \(P_1\) while controlling false positives, you should accept \(P_1\) for the data points where \(\frac{p_1(x)}{p_0(x)}\) is largest.

That is exactly what the likelihood ratio test does.


6.3.2 (b)

Let \(P_0\sim \mathcal N(\mu_0, \sigma^2)\) and \(P_1\sim \mathcal N(\mu_1, \sigma^2)\) for \(\mu_0,\mu_1\in \mathbb{R}\) and \(\sigma^2 > 0\). Assume that \(\mu_0 < \mu_1\). We observe an i.i.d. sample \(X_1,\dots,X_n\in\mathbb{R}\) drawn from either \(P_0\) or \(P_1\). For fixed \(\alpha\in (0,1)\), derive the form of the likelihood ratio test \(\phi(X_1,\dots,X_n)\) with size \(\alpha\).

Answer

We have that \[ \begin{align*} \Lambda(X_1,\dots,X_n) &= \prod_{i=1}^n \Lambda(X_i) \\ &= \prod_{i=1}^n \frac{\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(X_i - \mu_1)^2}{2\sigma^2}\right)}{\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(X_i - \mu_0)^2}{2\sigma^2}\right)} \\ &= \prod_{i=1}^n \exp\left(\frac{-(X_i - \mu_1)^2 + (X_i - \mu_0)^2}{2\sigma^2} \right) \\ &= \prod_{i=1}^n \exp\left(\frac{2(\mu_1-\mu_0)X_i - (\mu_1^2 - \mu_0^2)}{2\sigma^2} \right) \\ &= \exp\left(\frac{2n(\mu_1 - \mu_0)\overline X - n(\mu_1^2 - \mu_0^2)}{2\sigma^2}\right). \end{align*} \]

We test \(H_0: X_i \sim \mathcal N(\mu_0, \sigma^2)\) vs \(H_1: X_i \sim \mathcal N(\mu_1, \sigma^2)\) with \(\mu_0 < \mu_1\), and \(X_1,\dots,X_n\) i.i.d. The likelihood ratio is \(\Lambda(X_1,\dots,X_n) = \frac{p_1(X_1,\dots,X_n)}{p_0(X_1,\dots,X_n)}\).

Reject \(H_0\) when \(\Lambda(X_1,\dots,X_n) > \eta\) for some threshold \(\eta\).

Inside the exponential: \(-(X_i - \mu_1)^2 + (X_i - \mu_0)^2\).

Expand both: \((X_i^2 - 2\mu_1 X_i + \mu_1^2)\) and \((X_i^2 - 2\mu_0 X_i + \mu_0^2)\).

Subtract: \(2(\mu_1 - \mu_0)X_i - (\mu_1^2 - \mu_0^2)\).

So \(\Lambda = \prod_{i=1}^n \exp\bigl(\frac{2(\mu_1-\mu_0)X_i - (\mu_1^2 - \mu_0^2)}{2\sigma^2}\bigr)\).

Products of exponentials add exponents: \(\Lambda = \exp\bigl(\frac{2(\mu_1-\mu_0)\sum X_i - n(\mu_1^2 - \mu_0^2)}{2\sigma^2}\bigr)\).

Since \(\sum X_i = n\overline X\) we get \(\Lambda = \exp\bigl(\frac{2n(\mu_1 - \mu_0)\overline X - n(\mu_1^2 - \mu_0^2)}{2\sigma^2}\bigr)\).

This increases with respect to \(\overline X\), thus given a threshold \(\eta > 0\), there exists a constant \(C\), such that the conditions \[ \begin{align*} \Lambda(X_1,\dots,X_n) > \eta \qquad\text{and}\qquad \overline X > C \end{align*} \] are equivalent.

Because \(\mu_1 > \mu_0\), \(\Lambda\) is monotonically increasing in \(\overline X\). So the rejection rule \(\Lambda > \eta\) is equivalent to \(\overline X > C\) for some constant \(C\).

It thus suffices to find the constant \(C\) to yield a test of size \(\alpha\). By definition, \[ \begin{align*} \alpha = \Pr_{P_0} \left[\overline X > C \right] = \Pr_{P_0} \left[\frac{\overline X - \mu_0}{\sqrt{\sigma^2/n}} > \frac{C - \mu_0}{\sqrt{\sigma^2/n}}\right], \end{align*} \] where, under \(P_0\), \[ \begin{align*} \frac{\overline X - \mu_0}{\sqrt{\sigma^2/n}} \sim \mathcal N\left(0, 1\right). \end{align*} \] Thus, we let \[ \begin{align*} \frac{C - \mu_0}{\sqrt{\sigma^2/n}} = \Phi^{-1}(1 - \alpha), \end{align*} \] which gives \[ \begin{align*} C &= \mu_0 + \sqrt{\frac{\sigma^2}{n}}\cdot \Phi^{-1}(1-\alpha), \end{align*} \] where \(\Phi\) denotes the CDF of the standard normal distribution. Thus, the test is of the form \[ \begin{align*} \phi (X_1,\dots,X_n) = \begin{cases} 1, &\overline X > C,\\ 0, &\overline X < C, \end{cases} \end{align*} \] where \(C\) is defined above.

Because \(\mu_1 > \mu_0\): large sample mean \(\Rightarrow\) evidence for \(H_1\).

So the optimal test is simply: reject \(H_0\) if the sample mean is sufficiently large.