10 CS 189 Discussion 10: Transformers I (From Spring 2026)
10.0.1 Contact Information
| Name | Wesley Zheng |
| Pronouns | He/him/his |
| wzheng0302@berkeley.edu | |
| Discussion | Wednesdays, 11–12 PM @ Wheeler 120 & Wednesdays, 3-4 PM @ Hildebrand Hall B51 |
| Office Hours | Tuesdays, 11–1 PM @ Cory Courtyard |
10.1 Query-Key-Value in Transformer Attention (F25 Dis10 Q1)
The attention mechanism was a key building block introduced by the paper Attention Is All You Need in 2017 that has jump-started unprecedented advances in deep learning architectures. Attention helps machine learning models determine the relative importance of each part of an input sequence to other parts of the input sequence. In this problem, we will see how queries, keys, and values are calculated in the attention process and how they allow models to attend to their inputs.
Assume that the encoder in our transformer is processing the embeddings of three tokens:
\[ \mathbf{x}_1 = \begin{bmatrix} 2 \\ 3 \end{bmatrix}, \qquad \mathbf{x}_2 = \begin{bmatrix} -1 \\ 0 \end{bmatrix}, \qquad \mathbf{x}_3 = \begin{bmatrix} 1 \\ -2 \end{bmatrix}. \]
Our attention layer has the following weight matrices: \[ \mathbf{W}_\mathbf{Q} = \begin{bmatrix} 1 & 0 & 2 \\ 0 & 1 & -3 \end{bmatrix}, \qquad \mathbf{W}_\mathbf{K} = \begin{bmatrix} 0 & 0 & -1 \\ -2 & 0 & 1 \end{bmatrix}, \qquad \mathbf{W}_V = \begin{bmatrix} 8 & 0 \\ 0 & 9 \end{bmatrix}. \]
The query, key, and value of each embedding vector are defined respectively as \[\mathbf{q}_i = \mathbf{W}_Q^\top \mathbf{x}_i , \qquad \vec{k}_i = \mathbf{W}_K^\top \mathbf{x}_i, \qquad \mathbf{v}_i = \mathbf{W}_V^\top \mathbf{x}_i. \]
- Transformers compute context-aware representations of tokens.
- Each token is updated by “looking at” other tokens in the sequence.
- This mechanism is called self-attention.
Each input token is projected into three vectors:
- Query (Q): what this token is searching for
- Key (K): what this token offers / represents
- Value (V): the actual information passed forward
These are produced via learned linear transformations of the input embeddings.
A token compares its query with all other tokens’ keys
This produces a set of relevance scores
Scores determine how much each other token should influence the current token
Each token’s new representation is a weighted mixture of values:
- tokens it strongly attends to contribute more
- irrelevant tokens contribute little to nothing
10.1.1 (a)
Compute the query, key, and value vectors for \(\mathbf{x}_1\), \(\mathbf{x}_2\), and \(\mathbf{x}_3\).
Answer
Remember the shapes: \(\mathbf{x}_i\in\mathbb{R}^2\), while \(\mathbf{W}_Q,\mathbf{W}_K\in\mathbb{R}^{2\times 3}\) and \(\mathbf{W}_V\in\mathbb{R}^{2\times 2}\). So \(\mathbf{W}_Q^\top\mathbf{x}_i\) and \(\mathbf{W}_K^\top\mathbf{x}_i\) land in \(\mathbb{R}^3\), but \(\mathbf{W}_V^\top\mathbf{x}_i\) stays in \(\mathbb{R}^2\).
Queries: \[ \mathbf{q}_1 = \mathbf{W}_Q^\top \mathbf{x}_1 = \begin{bmatrix}1 & 0 \\[4pt] 0 & 1 \\[4pt] 2 & -3\end{bmatrix} \begin{bmatrix}2 \\[4pt] 3\end{bmatrix} = \begin{bmatrix}2 \\[4pt] 3 \\[4pt] -5\end{bmatrix}, \qquad \mathbf{q}_2 = \mathbf{W}_Q^\top \mathbf{x}_2 = \begin{bmatrix}-1 \\[4pt] 0 \\[4pt] -2\end{bmatrix}, \qquad \mathbf{q}_3 = \mathbf{W}_Q^\top \mathbf{x}_3 = \begin{bmatrix}1 \\[4pt] -2 \\[4pt] 8\end{bmatrix}. \]
Keys: \[ \vec{k}_1 = \mathbf{W}_K^\top \mathbf{x}_1 = \begin{bmatrix}0 & -2 \\[4pt] 0 & 0 \\[4pt] -1 & 1\end{bmatrix} \begin{bmatrix}2 \\[4pt] 3\end{bmatrix} = \begin{bmatrix}-6 \\[4pt] 0 \\[4pt] 1\end{bmatrix}, \qquad \vec{k}_2 = \mathbf{W}_K^\top \mathbf{x}_2 = \begin{bmatrix}0 \\[4pt] 0 \\[4pt] 1\end{bmatrix}, \qquad \vec{k}_3 = \mathbf{W}_K^\top \mathbf{x}_3 = \begin{bmatrix}4 \\[4pt] 0 \\[4pt] -3\end{bmatrix}. \]
Values: \[ \mathbf{v}_1 = \mathbf{W}_V^\top \mathbf{x}_1 = \begin{bmatrix}8 & 0 \\[4pt] 0 & 9\end{bmatrix} \begin{bmatrix}2 \\[4pt] 3\end{bmatrix} = \begin{bmatrix}16 \\[4pt] 27\end{bmatrix}, \qquad \mathbf{v}_2 = \mathbf{W}_V^\top \mathbf{x}_2 = \begin{bmatrix}-8 \\[4pt] 0\end{bmatrix}, \qquad \mathbf{v}_3 = \mathbf{W}_V^\top \mathbf{x}_3 = \begin{bmatrix}8 \\[4pt] -18\end{bmatrix}. \]10.1.2 (b)
Recall that the attention mechanism in transformers allows the model to decide how much each token should “focus” on every other token in the sequence. To do this, the model computes an attention score for every ordered pair of tokens by first taking the dot product between the query vector of one token and the key vector of another. For example, the inner product between the \(i\)-th token’s query, \(\mathbf{q}_i\), and the \(j\)-th token’s key, \(\vec{k}_j\), is: \(z_{i, j} = \mathbf{q}_i^T \vec{k}_j\).
The inner product \(z_{i, j}\) measures how stoken \(\mathbf{x}_i\) should consider or “pay attention” to token \(\mathbf{x}_j\). Calculate the attention that token \(\mathbf{x}_3\) places on each of the three tokens, \(\mathbf{x}_1\), \(\mathbf{x}_2\), \(\mathbf{x}_3\).
Answer
Token 3’s query \(\mathbf{q}_3\) is dotted with each token’s key \(\vec{k}_j\); larger \(z_{3,j}\) means token 3 attends more strongly to token \(j\) after softmax (part (c)).
10.1.3 (c)
We would like to transform these attention scores into probabilities, so we will apply the softmax function. Taking the softmax over \(z_{3,1}\), \(z_{3,2}\), and \(z_{3,3}\), we obtain the following probabilities: \[a_{3,1} \approx 0.00247, \qquad a_{3,2} \approx 0.99753, \qquad a_{3,3} \approx 6.90 \times 10^{-13}.\]
The resulting softmax scores act as weights for forming a weighted sum of the value vectors. Write an expression for this weighted sum for \(\mathbf{x}_3\) and plug in the values you computed previously.
Answer
This is the token-3 output representation before any residual connections: softmax weights \(a_{3,j}\) pick how much of each \(\mathbf{v}_j\) to blend. Because \(a_{3,2}\approx 1\), the sum is essentially \(\mathbf{v}_2\) (up to tiny contributions from the other values).
\[ \sum_{j=1}^3 a_{3,j} \, v_j = 0.00247 \begin{bmatrix} 16 \\ 27 \end{bmatrix} + 0.99753 \begin{bmatrix} -8 \\ 0 \end{bmatrix} + 6.90 \times 10^{-13} \begin{bmatrix} 8 \\ -18 \end{bmatrix} \approx \begin{bmatrix} -7.94 \\ 0.067 \end{bmatrix} \]
Numerically, \(\approx [-7.94,\,0.07]^\top\) is extremely close to \(\mathbf{v}_2=[-8,0]^\top\) because the softmax nearly one-hot selects token 2.
10.1.4 (d)
Transformers benefit from efficient matrix operations. To parallelize our computations, we stack all our input embeddings, \(\mathbf{x}_1\), \(\mathbf{x}_2\), and \(\mathbf{x}_3\), into the rows of a data matrix \(\mathbf{x}\): \[ \mathbf{x} = \begin{bmatrix} - \mathbf{x}_1^T - \\ - \mathbf{x}_2^T - \\ - \mathbf{x}_3^T - \end{bmatrix} = \begin{bmatrix} 2 & 3 \\ -1 & 0 \\ 1 & -2 \end{bmatrix}. \]
We want to obtain query, key, and value matrices, where \[ \mathbf{Q} = \begin{bmatrix} - \mathbf{q}_1^T - \\ - \mathbf{q}_2^T - \\ - \mathbf{q}_3^T - \end{bmatrix}, \quad \mathbf{K} = \begin{bmatrix} - \vec{k}_1^T - \\ - \vec{k}_2^T - \\ - \vec{k}_3^T - \end{bmatrix}, \quad \mathbf{V} = \begin{bmatrix} - \mathbf{v}_1^T - \\ - \mathbf{v}_2^T - \\ - \mathbf{v}_3^T - \end{bmatrix}. \]
Write an expression for these three matrices in terms of the data matrix, \(\mathbf{x}\), and the weight matrices defined in the problem statement. What are the dimensions of these matrices?
Answer
\[\begin{align*} \mathbf{Q} &= \mathbf{x} \mathbf{W}_Q \in\mathbb{R}^{3\times3} \\ \mathbf{K} &= \mathbf{x} \mathbf{W}_K \in\mathbb{R}^{3\times3} \\ \mathbf{V} &= \mathbf{x} \mathbf{W}_V \in\mathbb{R}^{3\times2} \end{align*}\]
Matrix multiply shape check: \(\mathbf{x}\in\mathbb{R}^{3\times 2}\) times \(\mathbf{W}_Q\in\mathbb{R}^{2\times 3}\) gives \(\mathbf{Q}\in\mathbb{R}^{3\times 3}\) (one row \(\mathbf{q}_i^\top\) per token). Similarly \(\mathbf{K}\in\mathbb{R}^{3\times 3}\), while \(\mathbf{V}\in\mathbb{R}^{3\times 2}\) because \(\mathbf{W}_V\in\mathbb{R}^{2\times 2}\).
To see why this is the case, let’s look just at our expression for the query matrix, \(\mathbf{Q}\). Our original expression \(q_i = \mathbf{W}_Q^\top \mathbf{x}_i\) treated \(\mathbf{q}_i\) and \(\mathbf{x}_i\) as column vectors. Now, we want to calculate each \(\mathbf{q}_i\) as a row vector so that we can stack them into the matrix \(\mathbf{Q}\). We can transpose both sides of the expression to get \(\mathbf{q}_i^T = \mathbf{x}_i^T \mathbf{W}_Q\). This leads us to the following expression:
Stacking rows turns many per-token multiplies into one batched multiply: each row of \(\mathbf{x}\) picks up the corresponding row of \(\mathbf{Q}\) after right-multiplying by \(\mathbf{W}_Q\).
10.1.5 (e)
Using the \(\mathbf{Q}\) and \(\mathbf{K}\) matrices from the previous step, show that the \((i, j)\)-th entry of the matrix product \(\mathbf{Q}\mathbf{K}^T\) is exactly \(z_{i, j}\) from part (b).
Answer
The \((i,j)\)-th entry of \(\mathbf{Q}\mathbf{K}^T\) is given by taking the dot product of the \(i\)-th row of \(\mathbf{Q}\) (which is \(\mathbf{q}_i^T\)) and the \(j\)-th column of \(\mathbf{K}^T\) (which is \(\vec{k}_j\)):
\[ (\mathbf{Q}\mathbf{K}^T)_{i, j} = \mathbf{q}_i^T \vec{k}_j. \]
\((\mathbf{Q}\mathbf{K}^T)_{ij}\) is “row \(i\) of \(\mathbf{Q}\) times column \(j\) of \(\mathbf{K}^T\)”, but column \(j\) of \(\mathbf{K}^T\) is \(\vec{k}_j\)—exactly the score definition from part (b).
10.1.6 (f)
The \(i\)-th row in the matrix product \(\mathbf{Q}\mathbf{K}^T\) represents how much \(\mathbf{x}_i\) should attend to every other token \(\mathbf{x}_j\). Recall that we apply the softmax over the attention scores \(z_{i, 1}\), \(z_{i, 2}\), \(z_{i, 3}\) to turn them into probabilities. In matrix world, this means we apply the softmax to each row of \(\mathbf{Q}\mathbf{K}^T\), such that all the entries are non-negative and the entries in each row sum to 1.
Let \(\mathbf{A}\) be the result of applying the softmax function to \(\mathbf{Q}\mathbf{K}^T\) row-wise. Show that the \(i\)-th row of \(\mathbf{A}\mathbf{V}\) is the weighted sum of the value vectors for \(\mathbf{x}_i\), using the softmax scores as weights.
Answer
Let \(\mathbf{A}\) be the \(3 \times 3\) “softmaxed” attention matrix with entries \(a_{i,j}\) and \(\mathbf{V}\) be the same value matrix defined previously with rows \(\mathbf{v}_i\). In summation notation, the \(i\)-th row of \(\mathbf{A}\mathbf{V}\) is: \[ (\mathbf{A}\mathbf{V})_{i} = \sum_{j=1}^3 a_{i,j} \mathbf{v}_j^T = a_{i,1} \mathbf{v}_1^T + a_{i,2} \mathbf{v}_2^T + a_{i,3} \mathbf{v}_3^T \]
This shows that each row in the matrix product \(\mathbf{A}\mathbf{V}\) is a weighted sum of the value vectors, with the weights coming from the softmaxed attention scores in the \(i\)-th row of \(\mathbf{A}\). In other words, multiplying the softmax matrix \(\mathbf{A}\) by \(\mathbf{V}\) is equivalent to performing the weighted sum for each token with the token’s own softmax scores.10.2 Justifying Scaled-Dot Product Attention (F25 Dis10 Q2)
In the previous problem, we worked through an example of softmax inner-product self-attention. In transformers, we actually apply scaled softmax inner-product self-attention. In this problem we’ll explore where this scaling factor comes from.
Suppose \(\mathbf{q}, \mathbf{k} \in \mathbb{R}^{D_k}\) are two random vectors with \(\mathbf{q}, \mathbf{k} \overset{iid}{\sim} \mathcal{N}(\mu\mathbf{1}, \sigma^2 \mathbf{I})\), where \(\mu\mathbf{1} \in \mathbb{R}^{D_k}\) and \(\sigma \in \mathbb{R}^+\). In other words, each component \(q_i\) of \(\mathbf{q}\) is drawn from a normal distribution with mean \(\mu\) and standard deviation \(\sigma\), and the same is true for \(k_i\) of \(\mathbf{k}\).
10.2.1 (a)
Define \(\mathbb{E}[\mathbf{q}^\top \mathbf{k}]\) in terms of \(\mu, \sigma\) and \(D_k\).
Answer
Using the definition of the dot product and linearity of expectation, \[ \mathbb{E}[\mathbf{q}^\top \mathbf{k}] = \mathbb{E}\left[\sum_{i=1}^{D_k} q_i k_i\right] = \sum_{i=1}^{D_k} \mathbb{E}[q_i k_i]. \]
Expectation passes through sums linearly; the dot product is just \(\sum_i q_i k_i\), so we can compute \(\mathbb{E}\) term-by-term.
Since \(q_i\) and \(k_i\) are i.i.d. with mean \(\mu\), \[ \mathbb{E}[\mathbf{q}^\top \mathbf{k}] = \sum_{i=1}^{D_k} \mathbb{E}[q_i k_i] = \sum_{i=1}^{D_k} \mathbb{E}[q_i] \mathbb{E}[k_i] = \sum_{i=1}^{D_k} \mu^2 = D_k \mu^2. \]
Independence of \(\mathbf{q}\) and \(\mathbf{k}\) lets \(\mathbb{E}[q_i k_i]=\mathbb{E}[q_i]\mathbb{E}[k_i]=\mu^2\) for each coordinate; summing \(D_k\) identical terms gives \(D_k\mu^2\).
10.2.2 (b)
Considering a practical case where \(\mu = 0\) and \(\sigma = 1\), define \(\mathrm{Var}(\mathbf{q}^\top \mathbf{k})\) in terms of \(D_k\).
Answer
Using the definition of the dot product, \[ \mathrm{Var}(\mathbf{q}^\top \mathbf{k}) = \mathrm{Var}\left(\sum_{i=1}^{D_k} q_i k_i\right). \]
Variance of a sum is not always the sum of variances—only when the summands are uncorrelated (here, different \(i\) pairs are independent across coordinates).
Since the \(q_i k_i\) terms are independent, the variance of the sum is the sum of the variances: \[ \mathrm{Var}(\mathbf{q}^\top \mathbf{k}) = \sum_{i=1}^{D_k} \mathrm{Var}(q_i k_i). \]
Each \(q_i, k_i \sim \mathcal{N}(0,1)\) independently, so \(q_i k_i\) has mean \(0\) and variance \[ \mathrm{Var}(q_i k_i) = \mathbb{E}[q_i^2 k_i^2] - (\mathbb{E}[q_i k_i])^2 = \mathbb{E}[q_i^2]\mathbb{E}[k_i^2] - 0 = 1 \cdot 1 = 1. \]
For standard normals, \(\mathbb{E}[q_i^2]=\mathrm{Var}(q_i)=1\); independence splits \(\mathbb{E}[q_i^2 k_i^2]\).
Therefore, \[ \mathrm{Var}(\mathbf{q}^\top \mathbf{k}) = \sum_{i=1}^{D_k} 1 = D_k. \]
So dot-product scores grow noisier in RMS sense as \(D_k\) increases—this is the statistical motivation for dividing by \(\sqrt{D_k}\) before softmax.
10.2.3 (c)
Continue to assume \(\mu = 0\) and \(\sigma = 1\). Let \(s\) be the scaling factor on the dot product. Suppose we want \(\mathbb{E}\!\left[\frac{\mathbf{q}^\top \mathbf{k}}{s}\right]\) to be 0, and \(\mathrm{Var}\!\left(\frac{\mathbf{q}^\top \mathbf{k}}{s}\right)\) to be \(\sigma = 1\). What should \(s\) be in terms of \(D_k\)?
Answer
We know \(\mathbb{E}[\mathbf{q}^\top \mathbf{k}] = 0\) and \(\mathrm{Var}(\mathbf{q}^\top \mathbf{k}) = D_k\).
For the scaled dot product, \(\mathbb{E}\!\left[\frac{\mathbf{q}^\top \mathbf{k}}{s}\right]=0\) and \[
\mathrm{Var}\!\left(\frac{\mathbf{q}^\top \mathbf{k}}{s}\right) = \frac{\mathrm{Var}(\mathbf{q}^\top \mathbf{k})}{s^2} = \frac{D_k}{s^2}.
\]
Mean stays 0 because expectation is linear and \(\mathbb{E}[\mathbf{q}^\top\mathbf{k}]=0\) already. For variance, scaling the dot product by \(1/s\) scales variance by \(1/s^2\).
The expectation is 0 regardless of \(s\), and we want variance to be \(1\), so: \[ \frac{D_k}{s^2} = 1 \implies s = \sqrt{D_k}. \] Thus, the scaling factor should be \(s = \sqrt{D_k}\).
So the famous \(\sqrt{D_k}\) scaling re-sets the score variance to \(O(1)\) as head dimension grows, keeping softmax from saturating into nearly one-hot vectors purely from random scale growth.