Transformer Notes

These are my notes about terminology being used in regards to generative AI features and practices. The references section provides detailed information about each terminology mentioned.

Attention is a technique used in a neural network to solve the problems mentioned in LSTM and RNN later in this article. Each input word has a corresponding hidden state that is passed along the entire chain to the decoding stage. The hidden states are used at each step of the RNN to decode with the assumption their may be relevant info in every word in a sentence input. Attention can be used in RNNs sequence transduction steps during the encoding stage and the decoding stage.

Processing inputs (words) in parallel is not possible with Attention techniques and increases the translation time for large corpus of text.

I’ve also noticed this technique does not work for projects that support Polysynthetic languages (Kanien’keha, Iroquoian dialects…).

Analytical languages (English, French, Spanish) with shorter sentences can benefit from Attention techniques.

Bytenet is a popular Convolutional Neural Network (CNN).

Causal Transformer is also called a Unidirectional Transformer and refers to a pre-trained transformer that uses language modeling on a large corpus with long range dependencies. OpenAI GPT model improved Language Understanding with Generative Pre-Training proposed by Alec Radford, Karthik Narasimhan, Tim Salimans and Ilya Sutskever.

CNN refers to Convolutional Neural Networks and are used for translating longer sentences that need to figure out dependencies, connections and context. Popular CNNs include Wavenet and Bytenet. CNNs are used in projects with the following traits:

Local dependencies can be exploited
Trivial to Parallelization ( per layer ) is trivial.
Logarithmic distance between positions

Each word on an input can be processed at the same time and does not depend on previous words for translation. The distance between an output word and input words is in the order of log(N) – the size of the height of the tree created from the output to the input. This is more effective than the RNNs that are on the order of N.

CNNs do not really help with figuring out dependencies when translating sentences, so Transformers were created as a combination of both CNNs with Attention.

Decoder-Only Transformer is a specific type of transformer used by ChatGPT.

Decoding Stage receives all hidden states created by every input ‘word’ of a sentence. Each hidden state is used to figure out where the network should pay attention when using Attention techniques.

Encoding Stage creates hidden states from the input when using Attention techniques.

LSTM refers to Long-Short Term Memory and makes changes to info with multiplication and addition. The info flow is known as cell states and allows LSTMs to selectively remember and forget info that is important or not as important.

Each cell takes an input (a word for a sentence to sentence translation), the previous cell state and the output of the previous cell. It manipulates the inputs and generates a new cell state, a new output based on the previous inputs. LSTMs do not perform well with long sentences as it often forgets the context that was provided farther away in the chain of info.

Three Issues with LSTMs and RNNs include: Sequential computation inhibits parallelization, Distance between positions is linear and No explicit modeling of long and short range dependencies.

Neural Machine Translation is also known as Sequence Transduction.

RNN refers to Recurrent Neural Networks and are used for translating sentences that need to figure out dependencies, connections and context. Loops allow information to persist and be passed from one step to the next step creating a chain-like process where each word is processed separately and the resulting sentence generated by passing a hidden state to the decoding stage where the output is generated.

RNNs learn to use past info to figure out what the next word in a sentence will be and are effective for shorter contexts. RNNs are not effective for long term dependencies with large gaps between relevant info that provides context. The longer the chain of info being passed from one step to the next step; the more chance the info will be lost along the chain.

RNNs perform well for short term memory and do not consider what is a priority in the data. Long-Short Term Memory (LSTM) is a type of RNN that is trying to solve long term dependency issues.

Self-Attention creates three vectors from each of the encoder’s input vectors (word embeddings) as the first step. For each word: a Query Vector, a Key Vector and a Value Vector. The vectors are created by multiplying the embedding by three matrices that were trained during the training process.

The second step calculates a score for each word of the input sentence to determine how much focus to place on other parts of the input sentence as we encode a word at a certain position. The score is calculated by taking the dot product of the query vector with the key vector of the word we are scoring.

The third and fourth steps divide the scores by the square root of the dimension of the key vectors(default values) to get more stable gradients and pass the results through a softmax operation to normalize the scores to be all positive and add up to 1.

The fifth step multiplies each value vector by the softmax score (prepares to sum them). This keeps the values of the words we focus on intact and drowns out irrelevant words by multiplying them with tiny numbers such as 0.001.

The sixth and final step sums up the weighted value vectors to produce the output of the self-attention layer at this position for the first word.

The resulting vector is sent to the feed-forward neural network.

Sequence Transduction is also known as neural machine translation.

Transformers refer to a type of neural network architecture that changes or transforms an input into an output. This is accomplished by tracking relationship between sequence components and learning context.

The initial concept is credited to Uszkoreit. Shazeer and Vaswani are two authors deeply involved in the work.

OpenAI uses transformers in their language models. DeepMind used transformers for their AlphaStar program to defeat a professional Starcraft player.

Transformers were made to solve the problem of sequence transduction, also known as neural machine translation. Any task that transforms an input sequence to an output sequence (speech recognition, text to speech transformation…) is being solved with transformers.

Transformers are designed to solve parallelization problems by using encoders and decoders with self-attention models that boost the speed of how fast the model can translate from one sequence to another sequence.

Each encoder is similar to each other and have the same architecture with two layers. The two layers are Self-Attention and a feed Forward Neural Network.

Each decoder is similar to each other with the same architecture.

Unidirectional Transformer is also called a Transformer.

Wavenet is a popular CNN.

REFERENCES

Alammar, Jay. “The Illustrated Transformer.” Github.io, 27 June 2018, jalammar.github.io/illustrated-transformer/.

Giacaglia, Giuliano. “Transformers.” Medium, 30 Dec. 2019, towardsdatascience.com/transformers-141e32e69591.

“Understanding LSTM Networks — Colah’s Blog.” Colah.github.io, colah.github.io/posts/2015-08-Understanding-LSTMs.

“WaveNet: A Generative Model for Raw Audio.” Google DeepMind, 8 Sept. 2016, deepmind.google/discover/blog/wavenet-a-generative-model-for-raw-audio/.

Transformer Notes

REFERENCES

Share this: