10 CS 189 Discussion 10: Transformers I

10.0.1 Contact Information

Name	Wesley Zheng
Pronouns	He/him/his
Email	wzheng0302@berkeley.edu
Discussion	Wednesdays, 11–12 PM @ Wheeler 120 & Wednesdays, 3-4 PM @ Hildebrand Hall B51
Office Hours	Tuesdays, 11–1 PM @ Cory Courtyard

10.1 Query-Key-Value in Transformer Attention (F25 Dis10 Q1)

The attention mechanism was a key building block introduced by the paper Attention Is All You Need in 2017 that has jump-started unprecedented advances in deep learning architectures. Attention helps machine learning models determine the relative importance of each part of an input sequence to other parts of the input sequence. In this problem, we will see how queries, keys, and values are calculated in the attention process and how they allow models to attend to their inputs.

Assume that the encoder in our transformer is processing the embeddings of three tokens:

\[ \mathbf{x}_1 = \begin{bmatrix} 2 \\ 3 \end{bmatrix}, \qquad \mathbf{x}_2 = \begin{bmatrix} -1 \\ 0 \end{bmatrix}, \qquad \mathbf{x}_3 = \begin{bmatrix} 1 \\ -2 \end{bmatrix}. \]

Our attention layer has the following weight matrices: \[ \mathbf{W}_\mathbf{Q} = \begin{bmatrix} 1 & 0 & 2 \\ 0 & 1 & -3 \end{bmatrix}, \qquad \mathbf{W}_\mathbf{K} = \begin{bmatrix} 0 & 0 & -1 \\ -2 & 0 & 1 \end{bmatrix}, \qquad \mathbf{W}_V = \begin{bmatrix} 8 & 0 \\ 0 & 9 \end{bmatrix}. \]

The query, key, and value of each embedding vector are defined respectively as \[\mathbf{q}_i = \mathbf{W}_Q^\top \mathbf{x}_i , \qquad \vec{k}_i = \mathbf{W}_K^\top \mathbf{x}_i, \qquad \mathbf{v}_i = \mathbf{W}_V^\top \mathbf{x}_i. \]

Attention

Transformers compute context-aware representations of tokens.
Each token is updated by “looking at” other tokens in the sequence.
This mechanism is called self-attention.

Each input token is projected into three vectors:

Query (Q): what this token is searching for
Key (K): what this token offers / represents
Value (V): the actual information passed forward

These are produced via learned linear transformations of the input embeddings.

A token compares its query with all other tokens’ keys
This produces a set of relevance scores
Scores determine how much each other token should influence the current token
Each token’s new representation is a weighted mixture of values:
- tokens it strongly attends to contribute more
- irrelevant tokens contribute little to nothing

10.1.1 (a)

Compute the query, key, and value vectors for \(\mathbf{x}_1\), \(\mathbf{x}_2\), and \(\mathbf{x}_3\).

Answer

Queries: \[ \mathbf{q}_1 = \mathbf{W}_Q^\top \mathbf{x}_1 = \begin{bmatrix}1 & 0 \\[4pt] 0 & 1 \\[4pt] 2 & -3\end{bmatrix} \begin{bmatrix}2 \\[4pt] 3\end{bmatrix} = \begin{bmatrix}2 \\[4pt] 3 \\[4pt] -5\end{bmatrix}, \qquad \mathbf{q}_2 = \mathbf{W}_Q^\top \mathbf{x}_2 = \begin{bmatrix}-1 \\[4pt] 0 \\[4pt] -2\end{bmatrix}, \qquad \mathbf{q}_3 = \mathbf{W}_Q^\top \mathbf{x}_3 = \begin{bmatrix}1 \\[4pt] -2 \\[4pt] 8\end{bmatrix}. \]

Keys: \[ \vec{k}_1 = \mathbf{W}_K^\top \mathbf{x}_1 = \begin{bmatrix}0 & -2 \\[4pt] 0 & 0 \\[4pt] -1 & 1\end{bmatrix} \begin{bmatrix}2 \\[4pt] 3\end{bmatrix} = \begin{bmatrix}-6 \\[4pt] 0 \\[4pt] 1\end{bmatrix}, \qquad \vec{k}_2 = \mathbf{W}_K^\top \mathbf{x}_2 = \begin{bmatrix}0 \\[4pt] 0 \\[4pt] 1\end{bmatrix}, \qquad \vec{k}_3 = \mathbf{W}_K^\top \mathbf{x}_3 = \begin{bmatrix}4 \\[4pt] 0 \\[4pt] -3\end{bmatrix}. \]

Values: \[ \mathbf{v}_1 = \mathbf{W}_V^\top \mathbf{x}_1 = \begin{bmatrix}8 & 0 \\[4pt] 0 & 9\end{bmatrix} \begin{bmatrix}2 \\[4pt] 3\end{bmatrix} = \begin{bmatrix}16 \\[4pt] 27\end{bmatrix}, \qquad \mathbf{v}_2 = \mathbf{W}_V^\top \mathbf{x}_2 = \begin{bmatrix}-8 \\[4pt] 0\end{bmatrix}, \qquad \mathbf{v}_3 = \mathbf{W}_V^\top \mathbf{x}_3 = \begin{bmatrix}8 \\[4pt] -18\end{bmatrix}. \]

10.1.2 (b)

Recall that the attention mechanism in transformers allows the model to decide how much each token should “focus” on every other token in the sequence. To do this, the model computes an attention score for every ordered pair of tokens by first taking the dot product between the query vector of one token and the key vector of another. For example, the inner product between the \(i\)-th token’s query, \(\mathbf{q}_i\), and the \(j\)-th token’s key, \(\vec{k}_j\), is: \(z_{i, j} = \mathbf{q}_i^T \vec{k}_j\).

The inner product \(z_{i, j}\) measures how stoken \(\mathbf{x}_i\) should consider or ``pay attention” to token \(\mathbf{x}_j\). Calculate the attention that token \(\mathbf{x}_3\) places on each of the three tokens, \(\mathbf{x}_1\), \(\mathbf{x}_2\), \(\mathbf{x}_3\).

Answer

\[ z_{3,1} = \mathbf{q}_3^T \vec{k}_1 = \begin{bmatrix} 1 & -2 & 8 \end{bmatrix} \begin{bmatrix} -6 \\ 0 \\ 1 \end{bmatrix} = 2, \qquad z_{3,2} = \mathbf{q}_3^T \vec{k}_2 = 8, \qquad z_{3,3} = \mathbf{q}_3^T \vec{k}_3 = -20. \]

10.1.3 (c)

We would like to transform these attention scores into probabilities, so we will apply the softmax function. Taking the softmax over \(z_{3,1}\), \(z_{3,2}\), and \(z_{3,3}\), we obtain the following probabilities: \[a_{3,1} \approx 0.00247, \qquad a_{3,2} \approx 0.99753, \qquad a_{3,3} \approx 6.90 \times 10^{-13}.\]

The resulting softmax scores act as weights for forming a weighted sum of the value vectors. Write an expression for this weighted sum for \(\mathbf{x}_3\) and plug in the values you computed previously.

Answer

\[ \sum_{j=1}^3 a_{3,j} \, v_j = 0.00247 \begin{bmatrix} 16 \\ 27 \end{bmatrix} + 0.99753 \begin{bmatrix} -8 \\ 0 \end{bmatrix} + 6.90 \times 10^{-13} \begin{bmatrix} 8 \\ -18 \end{bmatrix} \approx \begin{bmatrix} -7.94 \\ 0.067 \end{bmatrix} \]

10.1.4 (d)

Transformers benefit from efficient matrix operations. To parallelize our computations, we stack all our input embeddings, \(\mathbf{x}_1\), \(\mathbf{x}_2\), and \(\mathbf{x}_3\), into the rows of a data matrix \(\mathbf{x}\): \[ \mathbf{x} = \begin{bmatrix} - \mathbf{x}_1^T - \\ - \mathbf{x}_2^T - \\ - \mathbf{x}_3^T - \end{bmatrix} = \begin{bmatrix} 2 & 3 \\ -1 & 0 \\ 1 & -2 \end{bmatrix}. \]

We want to obtain query, key, and value matrices, where \[ \mathbf{Q} = \begin{bmatrix} - \mathbf{q}_1^T - \\ - \mathbf{q}_2^T - \\ - \mathbf{q}_3^T - \end{bmatrix}, \quad \mathbf{K} = \begin{bmatrix} - \vec{k}_1^T - \\ - \vec{k}_2^T - \\ - \vec{k}_3^T - \end{bmatrix}, \quad \mathbf{V} = \begin{bmatrix} - \mathbf{v}_1^T - \\ - \mathbf{v}_2^T - \\ - \mathbf{v}_3^T - \end{bmatrix}. \]

Write an expression for these three matrices in terms of the data matrix, \(\mathbf{x}\), and the weight matrices defined in the problem statement. What are the dimensions of these matrices?

Answer

\[\begin{align*} \mathbf{Q} &= \mathbf{x} \mathbf{W}_Q \in\mathbb{R}^{3\times3} \\ \mathbf{K} &= \mathbf{x} \mathbf{W}_K \in\mathbb{R}^{3\times3} \\ \mathbf{V} &= \mathbf{x} \mathbf{W}_V \in\mathbb{R}^{3\times2} \end{align*}\]

To see why this is the case, let’s look just at our expression for the query matrix, \(\mathbf{Q}\). Our original expression \(q_i = \mathbf{W}_Q^\top \mathbf{x}_i\) treated \(\mathbf{q}_i\) and \(\mathbf{x}_i\) as column vectors. Now, we want to calculate each \(\mathbf{q}_i\) as a row vector so that we can stack them into the matrix \(\mathbf{Q}\). We can transpose both sides of the expression to get \(\mathbf{q}_i^T = \mathbf{x}_i^T \mathbf{W}_Q\). This leads us to the following expression:

\[ \mathbf{Q} = \begin{bmatrix} - \mathbf{q}_1^T - \\ - \mathbf{q}_2^T - \\ - \mathbf{q}_3^T - \end{bmatrix} = \begin{bmatrix} - \mathbf{x}_1^T - \\ - \mathbf{x}_2^T - \\ - \mathbf{x}_3^T - \end{bmatrix} \mathbf{W}_Q = \mathbf{x} \mathbf{W}_Q. \]

10.1.5 (e)

Using the \(\mathbf{Q}\) and \(\mathbf{K}\) matrices from the previous step, show that the \((i, j)\)-th entry of the matrix product \(\mathbf{Q}\mathbf{K}^T\) is exactly \(z_{i, j}\) from part (b).

Answer

The \((i,j)\)-th entry of \(\mathbf{Q}\mathbf{K}^T\) is given by taking the dot product of the \(i\)-th row of \(\mathbf{Q}\) (which is \(\mathbf{q}_i^T\)) and the \(j\)-th column of \(\mathbf{K}^T\) (which is \(\vec{k}_j\)):

\[ (\mathbf{Q}\mathbf{K}^T)_{i, j} = \mathbf{q}_i^T \vec{k}_j. \]

Notice that \(\mathbf{q}_i^T \vec{k}_j\) is precisely the attention score \(z_{i,j}\) defined earlier. Therefore, \(\mathbf{Q}\mathbf{K}^T\) collects all pairwise attention scores as its entries, with the \((i,j)\)-th entry corresponding to \(z_{i,j}\).

10.1.6 (f)

The \(i\)-th row in the matrix product \(\mathbf{Q}\mathbf{K}^T\) represents how much \(\mathbf{x}_i\) should attend to every other token \(\mathbf{x}_j\). Recall that we apply the softmax over the attention scores \(z_{i, 1}\), \(z_{i, 2}\), \(z_{i, 3}\) to turn them into probabilities. In matrix world, this means we apply the softmax to each row of \(\mathbf{Q}\mathbf{K}^T\), such that all the entries are non-negative and the entries in each row sum to 1.

Let \(\mathbf{A}\) be the result of applying the softmax function to \(\mathbf{Q}\mathbf{K}^T\) row-wise. Show that the \(i\)-th row of \(\mathbf{A}\mathbf{V}\) is the weighted sum of the value vectors for \(\mathbf{x}_i\), using the softmax scores as weights.

Answer

Let \(\mathbf{A}\) be the \(3 \times 3\) “softmaxed” attention matrix with entries \(a_{i,j}\) and \(\mathbf{V}\) be the same value matrix defined previously with rows \(\mathbf{v}_i\). In summation notation, the \(i\)-th row of \(\mathbf{A}\mathbf{V}\) is: \[ (\mathbf{A}\mathbf{V})_{i} = \sum_{j=1}^3 a_{i,j} \mathbf{v}_j^T = a_{i,1} \mathbf{v}_1^T + a_{i,2} \mathbf{v}_2^T + a_{i,3} \mathbf{v}_3^T \]

This shows that each row in the matrix product \(\mathbf{A}\mathbf{V}\) is a weighted sum of the value vectors, with the weights coming from the softmaxed attention scores in the \(i\)-th row of \(\mathbf{A}\). In other words, multiplying the softmax matrix \(\mathbf{A}\) by \(\mathbf{V}\) is equivalent to performing the weighted sum for each token with the token’s own softmax scores.

10.2 Justifying Scaled-Dot Product Attention (F25 Dis10 Q2)

In the previous problem, we worked through an example of softmax inner-product self-attention. In transformers, we actually apply scaled softmax inner-product self-attention. In this problem we’ll explore where this scaling factor comes from.

Suppose \(\mathbf{q}, \mathbf{k} \in \mathbb{R}^{D_k}\) are two random vectors with \(\mathbf{q}, \mathbf{k} \overset{iid}{\sim} \mathcal{N}(\mu\mathbf{1}, \sigma^2 \mathbf{I})\), where \(\mu\mathbf{1} \in \mathbb{R}^{D_k}\) and \(\sigma \in \mathbb{R}^+\). In other words, each component \(q_i\) of \(\mathbf{q}\) is drawn from a normal distribution with mean \(\mu\) and standard deviation \(\sigma\), and the same is true for \(k_i\) of \(\mathbf{k}\).

10.2.1 (a)

Define \(\mathbb{E}[\mathbf{q}^\top \mathbf{k}]\) in terms of \(\mu, \sigma\) and \(D_k\).

Answer

Using the definition of the dot product and linearity of expectation, \[ \mathbb{E}[\mathbf{q}^\top \mathbf{k}] = \mathbb{E}\left[\sum_{i=1}^{D_k} q_i k_i\right] = \sum_{i=1}^{D_k} \mathbb{E}[q_i k_i]. \]

Since \(q_i\) and \(k_i\) are i.i.d. with mean \(\mu\), \[ \mathbb{E}[\mathbf{q}^\top \mathbf{k}] = \sum_{i=1}^{D_k} \mathbb{E}[q_i k_i] = \sum_{i=1}^{D_k} \mathbb{E}[q_i] \mathbb{E}[k_i] = \sum_{i=1}^{D_k} \mu^2 = D_k \mu^2. \]

10.2.2 (b)

Considering a practical case where \(\mu = 0\) and \(\sigma = 1\), define \(\mathrm{Var}(\mathbf{q}^\top \mathbf{k})\) in terms of \(D_k\).

Answer

Using the definition of the dot product, \[ \mathrm{Var}(\mathbf{q}^\top \mathbf{k}) = \mathrm{Var}\left(\sum_{i=1}^{D_k} q_i k_i\right). \]

Since the \(q_i k_i\) terms are independent, the variance of the sum is the sum of the variances: \[ \mathrm{Var}(\mathbf{q}^\top \mathbf{k}) = \sum_{i=1}^{D_k} \mathrm{Var}(q_i k_i). \]

Each \(q_i, k_i \sim \mathcal{N}(0,1)\) independently, so \(q_i k_i\) has mean \(0\) and variance \[ \mathrm{Var}(q_i k_i) = \mathbb{E}[q_i^2 k_i^2] - (\mathbb{E}[q_i k_i])^2 = \mathbb{E}[q_i^2]\mathbb{E}[k_i^2] - 0 = 1 \cdot 1 = 1. \]

Therefore, \[ \mathrm{Var}(\mathbf{q}^\top \mathbf{k}) = \sum_{i=1}^{D_k} 1 = D_k. \]

10.2.3 (c)

Continue to assume \(\mu = 0\) and \(\sigma = 1\). Let \(s\) be the scaling factor on the dot product. Suppose we want \(\mathbb{E}\!\left[\frac{\mathbf{q}^\top \mathbf{k}}{s}\right]\) to be 0, and \(\mathrm{Var}\!\left(\frac{\mathbf{q}^\top \mathbf{k}}{s}\right)\) to be \(\sigma = 1\). What should \(s\) be in terms of \(D_k\)?

Answer

We know \(\mathbb{E}[\mathbf{q}^\top \mathbf{k}] = 0\) and \(\mathrm{Var}(\mathbf{q}^\top \mathbf{k}) = D_k\).
For the scaled dot product, \(\mathbb{E}\!\left[\frac{\mathbf{q}^\top \mathbf{k}}{s}\right]=0\) and \[ \mathrm{Var}\!\left(\frac{\mathbf{q}^\top \mathbf{k}}{s}\right) = \frac{\mathrm{Var}(\mathbf{q}^\top \mathbf{k})}{s^2} = \frac{D_k}{s^2}. \] The expectation is 0 regardless of \(s\), and we want variance to be \(1\), so: \[ \frac{D_k}{s^2} = 1 \implies s = \sqrt{D_k}. \] Thus, the scaling factor should be \(s = \sqrt{D_k}\).