9 CS 189 Discussion 09: Backpropagation & CNNs

9.0.1 Contact Information

Name	Wesley Zheng
Pronouns	He/him/his
Email	wzheng0302@berkeley.edu
Discussion	Wednesdays, 11–12 PM @ Wheeler 120 & Wednesdays, 3-4 PM @ Hildebrand Hall B51
Office Hours	Tuesdays, 11–1 PM @ Cory Courtyard

9.1 Backprop in Practice: Staged Computation (F25 Dis8 Q1)

We will apply backpropagation for the function \(f(x,y,z) = (x+y)z\).

Backpropagation

Core Idea

Efficiently computes gradients of loss w.r.t. all parameters
Uses chain rule + reuses computations

Forward Pass

Compute activations layer-by-layer
Store intermediate values
Produce prediction → compute loss

Backward Pass

Propagate error output → input
Apply chain rule: \[ \frac{dL}{dw} = \frac{dL}{da}\frac{da}{dw} \]
Reuse gradients (dynamic programming style)

Local Gradients

Each layer computes its own derivative
Modular across layers

Requirements

Operations must be differentiable (or use subgradients)

9.1.1 (a)

Decompose \(f\) into two simpler functions.

Answer

We can conveniently express \(f\) as: \(g=x+y\), \(f=gz\). The original \(f(x,y,z)\) expression is simple enough to differentiate directly, but this particular approach of thinking about it in terms of its decomposition can help with understanding the intuition behind backprop.

9.1.2 (b)

Draw the computation graph that represents \(f\) following the composition from part (a).

Answer

9.1.3 (c)

Write the steps of the forward pass and backward pass through the network.

Answer

\[\begin{array}{l} \textbf{Forward pass:} g = x + y, \\[2pt] f = g \cdot z. \\[10pt] \textbf{Backward pass:} \\[4pt] \text{Backprop through } f = g z: \\[2pt] \dfrac{\partial f}{\partial z} = g, \quad \dfrac{\partial f}{\partial g} = z. \\[8pt] \text{Backprop through } g = x + y \text{ (using the chain rule):} \\[2pt] \dfrac{\partial f}{\partial x} = \dfrac{\partial g}{\partial x} \cdot \dfrac{\partial f}{\partial g} = 1 \cdot \dfrac{\partial f}{\partial g} = \dfrac{\partial f}{\partial g}, \\[6pt] \dfrac{\partial f}{\partial y} = \dfrac{\partial g}{\partial y} \cdot \dfrac{\partial f}{\partial g} = 1 \cdot \dfrac{\partial f}{\partial g} = \dfrac{\partial f}{\partial g}. \end{array}\]

9.1.4 (d)

Update your network drawing with the intermediate values in the forward and backward pass using the inputs \(x = -2\), \(y = 5\), and \(z = -4\).

Answer

We first run the forward pass (shown in black) of the network, which computes values from inputs to output. Then, we do the backward pass (blue), which starts at the end of the network and recursively applies the chain rule to compute the gradients.

Think of the gradients as flowing backwards through the network. By the time we reach the beginning of the network in backprop, each gate will have learned about the gradient of its output value on the final output of the entire circuit. Backprop can thus be thought of as message passing (via the gradient signal) between gates about whether they want their outputs to increase or decrease (and how strongly), so as to make the final output value higher.

9.2 Convolution Operation in CNNs (F25 Dis9 Q1)

In this question, we will look at the mechanism of weight sharing in convolutions. Suppose that we have an input vector with \(D=9\) dimensions: \[\mathbf{x} = [\,x_1 \quad x_2 \quad ... \quad x_9\,]^T.\]

We compute a 1D convolution with the kernel filter that has \(K=3\) weights (parameters): \[\mathbf{k} = [\,k_1 \quad k_2 \quad k_3\,]^T.\]

Convolutional Neural Networks (CNNs)

Core Idea

Learn spatial features from grid-like data (e.g., images)
Exploit local structure + parameter sharing

Convolution Layer

Apply small filters (kernels) across input
Same weights reused across locations

Feature Maps

Each map highlights a specific feature

Activation (e.g., ReLU)

Adds non-linearity

Pooling (Downsampling)

Reduces spatial size (e.g., max pooling)
Keeps important features, reduces computation

Hierarchy of Features

Early layers → edges
Middle layers → textures/parts
Deep layers → objects

Fully Connected Layers

Combine extracted features for final prediction
Used near output

Outcome

Strong performance on vision tasks (classification, detection, etc.)

9.2.1 (a)

What’s the output dimension if we apply filter \(\mathbf{k}\) with no padding and stride of 1? What’s the first element of the output? What’s the last element?

Answer

The output dimension is 9 - 3 + 1 = 7. The first element is \(k_1x_1 + k_2x_2 + k_3x_3\) and the last element is \(k_1x_7 + k_2x_8 + k_3x_9\). Expressed as a dot product, these are equivalent to \([\, x_1 \quad x_2 \quad x_3 \,]\mathbf{k}\) and \([\, x_7 \quad x_8 \quad x_9 \,]\mathbf{k}\).

9.2.2 (b)

What’s the output dimension if we apply filter \(\mathbf{k}\) with padding of size 1 and stride of 2? What’s the first element of the output? What’s the second element?

Answer

For the second example, the output dimension would be \((9+2-3)//2 + 1 = 5\). The first element is \(k_10 + k_2x_1 + k_3x_2\) and the second element would be \(k_1x_2 + k_2x_3 + k_3x_4\). Expressed as a dot product, these are equivalent to \([\, 0 \quad x_1 \quad x_2 \,]\mathbf{k}\) and \([\, x_2 \quad x_3 \quad x_4 \,]\mathbf{k}\).

9.2.3 (c)

Recall that CNN filters have the property of weight sharing, meaning that different portions of an image can share the same weight to extract the same set of features. A convolution is actually a linear operation, which we can express as \(\mathbf{x}' = \mathbf{K}\mathbf{x}\) for some weight matrix \(\mathbf{K}\). Find \(\mathbf{K}\) for the convolution applied in part (a). What is its dimension?

Answer

The input size is \(D=9\) and the output size shrinks by \(K-1=2\), so \(\mathbf{K}\) has a dimension of \(\mathbb{R}^{7\times9}\) and is given by \[ \mathbf{K} = \begin{bmatrix} k_1 & k_2 & k_3 & 0 & 0 & \cdots & 0\\ 0 & k_1 & k_2 & k_3 & 0 & \cdots & 0\\ & & & \vdots & & & \\ 0 & \cdots & 0 & k_1 & k_2 & k_3 & 0\\ 0 & \cdots & 0 & 0 & k_1 & k_2 & k_3 \end{bmatrix}.\]

9.2.4 (d)

Consider now a 2-dimensional example with the following kernel filter and image: \[\mathbf{K} = \frac{1}{4} \begin{bmatrix} 1 & 1\\ 1 & 1 \end{bmatrix} \hspace{10mm} \mathbf{X} = \begin{bmatrix} 1 & 2 & 3\\ 4 & 5 & 6\\ 7 & 8 & 9 \end{bmatrix} \]

Using no padding and a stride of 1, compute the output, and describe the effect of this filter.

Answer

The output is \[ \begin{bmatrix} 3 & 4 \\ 6 & 7 \end{bmatrix}. \] This is a 2D moving average filter (aka a box filter) and can be used to perform average pooling.

9.2.5 (e)

We want to know the general formula for computing the output dimension of a convolution operation. Suppose we have a square input image of dimension \(W \times W\) and a \(K \times K\) kernel filter. If we assume stride of 1 and no padding, what’s the output dimension \(W'\)? What if we applied stride of \(s\) and padding of size \(p\), how would the dimension change?

Answer

With a stride length of one and no padding, the output dimension will be \(W' = W - K + 1\).

With a stride of \(s\) and padding of size \(p\), the output dimension will be \(\lfloor\frac{W + 2p - K}{s}\rfloor + 1\). We include a \(+2p\) because the padding is equivalent to adding extra terms on both the left and right side of the array. Dividing by \(s\) follows from the fact that we increment our counter by \(+s\) instead of \(+1\) when sliding along the image. The floor division operator ensures that the resulting output is an integer.