Notes on Transformers

August 07, 2023

Notes on Transformers

The purpose of this post is to record notes on Google's original paper on Transformers: Attention is All You Need from 2017. Transformers are the mathematical structures underlying modern LLMs such as ChatGPT. Section headings correspond loosely to the original paper to provide a sense of organization. Diagrams and equations from the paper are used for summarization and analysis (link to paper provided in References below).

Intro and Background

- The original Transformer paper was based on language related tasks such as translation (English to German as well as English to French translations specifically).

- Whereas sequential models like RNNs emphasize the previous hidden state being processed, attention mechanisms and Transformers in particular model dependencies in sequences of data without being limited by recency or proximity of states to each other.

- In other words, Transformers can capture non-local or global relationships in data through attention mechanisms.

Model Architecture

- Transformers specifically use Self-Attention or Intra-Attention which forms a representation of input data by relating different elements of the data sequence to each other at different positions.

- Transformers are based on an encoder/decoder architecture which takes input data and transforms or encodes it into a suitable representation which can then be further processed to arrive at an output which is decoded to produce a result of interest e.g., a language response to input text.

- It's interesting to note that fully connected or vanilla neural networks are leveraged by the architecture in addition to the Attention mechanism secret sauce. These are the Feed Forward blocks shown in the Figure 1 architecture diagram. Essentially, the vector representations encoded from the original data are being multiplied by matrices post Attention mechanisms.

- The encoder consists of 6 identical layers represented to the left in Figure 1. Similarly, the decoder consists of 6 identical layers represented to the right in Figure 1. Note that the decoder utilizes the input encoding and produces one output e.g., one word, at a time-- shifting the output by one element and cycling back into itself to then produce the next element or word in the output sequence.

- So, what exactly is Attention at a technical and mathematical level? The paper states "An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key." There is a fair bit to unpack here. There are really two lookups or mappings at play here: the mapping of keys to values V(K) as well as the mapping of queries to weights W(Q). Schematically, the output is then the vector product of W*V i.e., the weighted sum of embedded values mapped from the input data. In fact, it's evident in the equation 1 below that the weights matrix W is actually a function of the Queries Q as well as the Keys K.

- It is evident comparing Figure 2 of the Attention mechanism structure to the overall Transformer architecture diagram in Figure 1 that the input embedding and encoding is acting in a manner that transforms the original input sequence of data into vectors V, K and Q which are then further processed to perform the aforementioned weighted sum via dot products (Scaled Dot-Product Attention).

- The Attention mechanisms utilized in the paper's Transformer is actually an h-fold bundle of the above Scaled Dot-Product Attention. Basically, a set of projection matrices W_i are used to generate multiple representations (h of them) of the input data (or its transmuted queries, keys, values i.e., Q, K, V form), allowing the model's training process to explore a more diverse representation space.

- Information regarding the order of the input sequence is added to the encoder/decoder architecture through positional encoding functions. The authors chose vectorial sinusoidal functions for their positional encoding functions with dimensions indexed by i and the position of the input sequence's element represented by pos:

- Interestingly, the authors note that the choice of these positional encoding functions whether they be optimized via training or hard-coded sinusoids has negligible effects on their results.

Why Self-Attention

- Table 1 from the paper provides a nice summary of how complexity and algorithmic effectiveness for the Self-Attention based Transformer approach compares to the other state of the art methods of the time such as Recurrent Neural Networks (RNNs). Notably, Transformers are able to calculate long range relationships in data sequences (Maximum Path Length) independent of the size of the data itself, which is a very powerful result.

- Interestingly, from a complexity standpoint, Transformers are noted to be more efficient i.e., faster than RNNs if the data sequence length (n) is shorter than its representation (d). This strikes me as a quirky result. Though the authors note that this condition is often true for language translation tasks, I imagine progress must have been made since 2017 for ChatGPT to seemingly be able to process text of arbitrary length on a wide variety of language tasks with excellent speed and efficiency.

- The authors also note that Transformers may be more interpretable than other models as the Multi-Head Attention mechanisms seem to learn different aspects of language structure at each head. This reminds me of a note I made in an earlier blog post that ChatGPT actually tokenizes words in a manner reminiscent of suffixes/prefixes/roots rather than letters-- it's possible that Multi-Head Attention has really afforded a deeper capability for etymological learning.

References:

- The original paper by Google on Transformers [1706.03762] Attention Is All You Need (arxiv.org)

- Coursera course on LLMs and Generative AI including the Transformers architecture Transformers: Attention is all you need | Coursera

Search This Blog

Ideas to Reality