9  CS 189 Discussion 09: Backpropagation & CNNs

9.0.1 Contact Information

Name Wesley Zheng
Pronouns He/him/his
Email wzheng0302@berkeley.edu
Discussion Wednesdays, 11–12 PM @ Wheeler 120 & Wednesdays, 3-4 PM @ Hildebrand Hall B51
Office Hours Tuesdays, 11–1 PM @ Cory Courtyard

9.1 Backprop in Practice: Staged Computation (F25 Dis8 Q1)

We will apply backpropagation for the function \(f(x,y,z) = (x+y)z\).

ImportantBackpropagation

Core Idea

  • Efficiently computes gradients of loss w.r.t. all parameters
  • Uses chain rule + reuses computations

Forward Pass

  • Compute activations layer-by-layer
  • Store intermediate values
  • Produce prediction → compute loss

Backward Pass

  • Propagate error output → input
  • Apply chain rule: \[ \frac{dL}{dw} = \frac{dL}{da}\frac{da}{dw} \]
  • Reuse gradients (dynamic programming style)

Local Gradients

  • Each layer computes its own derivative
  • Modular across layers

Requirements

  • Operations must be differentiable (or use subgradients)

9.1.1 (a)

Decompose \(f\) into two simpler functions.

Answer We can conveniently express \(f\) as: \(g=x+y\), \(f=gz\). The original \(f(x,y,z)\) expression is simple enough to differentiate directly, but this particular approach of thinking about it in terms of its decomposition can help with understanding the intuition behind backprop.

9.1.2 (b)

Draw the computation graph that represents \(f\) following the composition from part (a).

Answer

9.1.3 (c)

Write the steps of the forward pass and backward pass through the network.

Answer
\[\begin{array}{l} \textbf{Forward pass:} g = x + y, \\[2pt] f = g \cdot z. \\[10pt] \textbf{Backward pass:} \\[4pt] \text{Backprop through } f = g z: \\[2pt] \dfrac{\partial f}{\partial z} = g, \quad \dfrac{\partial f}{\partial g} = z. \\[8pt] \text{Backprop through } g = x + y \text{ (using the chain rule):} \\[2pt] \dfrac{\partial f}{\partial x} = \dfrac{\partial g}{\partial x} \cdot \dfrac{\partial f}{\partial g} = 1 \cdot \dfrac{\partial f}{\partial g} = \dfrac{\partial f}{\partial g}, \\[6pt] \dfrac{\partial f}{\partial y} = \dfrac{\partial g}{\partial y} \cdot \dfrac{\partial f}{\partial g} = 1 \cdot \dfrac{\partial f}{\partial g} = \dfrac{\partial f}{\partial g}. \end{array}\]

9.1.4 (d)

Update your network drawing with the intermediate values in the forward and backward pass using the inputs \(x = -2\), \(y = 5\), and \(z = -4\).

Answer

We first run the forward pass (shown in black) of the network, which computes values from inputs to output. Then, we do the backward pass (blue), which starts at the end of the network and recursively applies the chain rule to compute the gradients.

Think of the gradients as flowing backwards through the network. By the time we reach the beginning of the network in backprop, each gate will have learned about the gradient of its output value on the final output of the entire circuit. Backprop can thus be thought of as message passing (via the gradient signal) between gates about whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher.

9.2 Convolution Operation in CNNs (F25 Dis9 Q1)

In this question, we will look at the mechanism of weight sharing in convolutions. Suppose that we have an input vector with \(D=9\) dimensions: \[\mathbf{x} = [\,x_1 \quad x_2 \quad ... \quad x_9\,]^T.\]

We compute a 1D convolution with the kernel filter that has \(K=3\) weights (parameters): \[\mathbf{k} = [\,k_1 \quad k_2 \quad k_3\,]^T.\]

ImportantConvolutional Neural Networks (CNNs)

Core Idea

  • Learn spatial features from grid-like data (e.g., images)
  • Exploit local structure + parameter sharing

Convolution Layer

  • Apply small filters (kernels) across input
  • Same weights reused across locations

Feature Maps

  • Each map highlights a specific feature

Activation (e.g., ReLU)

  • Adds non-linearity

Pooling (Downsampling)

  • Reduces spatial size (e.g., max pooling)
  • Keeps important features, reduces computation

Hierarchy of Features

  • Early layers → edges
  • Middle layers → textures/parts
  • Deep layers → objects

Fully Connected Layers

  • Combine extracted features for final prediction
  • Used near output

Outcome

  • Strong performance on vision tasks (classification, detection, etc.)

9.2.1 (a)

What’s the output dimension if we apply filter \(\mathbf{k}\) with no padding and stride of 1? What’s the first element of the output? What’s the last element?

Answer The output dimension is 9 - 3 + 1 = 7. The first element is \(k_1x_1 + k_2x_2 + k_3x_3\) and the last element is \(k_1x_7 + k_2x_8 + k_3x_9\). Expressed as a dot product, these are equivalent to \([\, x_1 \quad x_2 \quad x_3 \,]\mathbf{k}\) and \([\, x_7 \quad x_8 \quad x_9 \,]\mathbf{k}\).

9.2.2 (b)

What’s the output dimension if we apply filter \(\mathbf{k}\) with padding of size 1 and stride of 2? What’s the first element of the output? What’s the second element?

Answer For the second example, the output dimension would be \((9+2-3)//2 + 1 = 5\). The first element is \(k_10 + k_2x_1 + k_3x_2\) and the second element would be \(k_1x_2 + k_2x_3 + k_3x_4\). Expressed as a dot product, these are equivalent to \([\, 0 \quad x_1 \quad x_2 \,]\mathbf{k}\) and \([\, x_2 \quad x_3 \quad x_4 \,]\mathbf{k}\).

9.2.3 (c)

Recall that CNN filters have the property of weight sharing, meaning that different portions of an image can share the same weight to extract the same set of features. A convolution is actually a linear operation, which we can express as \(\mathbf{x}' = \mathbf{K}\mathbf{x}\) for some weight matrix \(\mathbf{K}\). Find \(\mathbf{K}\) for the convolution applied in part (a). What is its dimension?

Answer The input size is \(D=9\) and the output size shrinks by \(K-1=2\), so \(\mathbf{K}\) has a dimension of \(\mathbb{R}^{7\times9}\) and is given by \[ \mathbf{K} = \begin{bmatrix} k_1 & k_2 & k_3 & 0 & 0 & \cdots & 0\\ 0 & k_1 & k_2 & k_3 & 0 & \cdots & 0\\ & & & \vdots & & & \\ 0 & \cdots & 0 & k_1 & k_2 & k_3 & 0\\ 0 & \cdots & 0 & 0 & k_1 & k_2 & k_3 \end{bmatrix}.\]

9.2.4 (d)

Consider now a 2-dimensional example with the following kernel filter and image: \[\mathbf{K} = \frac{1}{4} \begin{bmatrix} 1 & 1\\ 1 & 1 \end{bmatrix} \hspace{10mm} \mathbf{X} = \begin{bmatrix} 1 & 2 & 3\\ 4 & 5 & 6\\ 7 & 8 & 9 \end{bmatrix} \]

Using no padding and a stride of 1, compute the output, and describe the effect of this filter.

Answer The output is \[ \begin{bmatrix} 3 & 4 \\ 6 & 7 \end{bmatrix}. \] This is a 2D moving average filter (aka a box filter) and can be used to perform average pooling.

9.2.5 (e)

We want to know the general formula for computing the output dimension of a convolution operation. Suppose we have a square input image of dimension \(W \times W\) and a \(K \times K\) kernel filter. If we assume stride of 1 and no padding, what’s the output dimension \(W'\)? What if we applied stride of \(s\) and padding of size \(p\), how would the dimension change?

Answer

With a stride length of one and no padding, the output dimension will be \(W' = W - K + 1\).

With a stride of \(s\) and padding of size \(p\), the output dimension will be \(\lfloor\frac{W + 2p - K}{s}\rfloor + 1\). We include a \(+2p\) because the padding is equivalent to adding extra terms on both the left and right side of the array. Dividing by \(s\) follows from the fact that we increment our counter by \(+s\) instead of \(+1\) when sliding along the image. The floor division operator ensures that the resulting output is an integer.