Today, we’re diving into something that's at the core of Transformers, and Transformers themselves are at the core of this AI revolution! Meet self-attention—the secret sauce that lets models like GPT understand the relationships between words, no matter where they appear in a sentence. It’s like giving AI superpowers to focus on the important stuff, making sense of everything from language translation to creative text generation.

Why Do We Need Self-Attention?

If I asked you, “What's the most important task in all of natural language processing (NLP)?”, what would your answer be?

Of course, it’s converting words into numbers, right? After all, computers can’t understand words like we do — they need everything in numeric form to perform any computations. But here’s the catch: converting words to numbers in a meaningful way isn’t as straightforward as it seems.

The Problem with Traditional Word Representations

Initially, the simplest way we represented words as numbers was through techniques like one-hot encoding. Each word was assigned a unique vector of 1s and 0s. While this approach works, it completely ignores relationships between words. For example, "dog" and "cat" would be as unrelated as "dog" and "table" in this representation, despite "dog" and "cat" being semantically closer.

To address this, we developed word embeddings like Word2Vec, which represented each word as a vector in a continuous space, capturing semantic relationships like "king" being closer to "queen" than to "apple." But the problem with word embeddings was that it doesn’t really capture the contextual dynamic meaning rather static average meaning like no matter if we are saying “river bank” or “money bank” bank would have the same vector.

The Need for Context

But here's the thing: words change meaning depending on the context. Think about the word "bank":

In both sentences, the word "bank" is used, but its meaning is completely different. Traditional word embeddings can’t capture this change in meaning. They assign one fixed vector to "bank," even though the word means different things in different contexts.

How Self-Attention Works from First Principles

To truly capture the context of words in a sentence, we needed a way to represent each word based on its relationship with every other word in the sentence. In simple terms, we needed to weigh how important other words are when interpreting a given word.

Instead of assigning a fixed meaning to each word (as we did with traditional embeddings), we began representing each word as a weighted sum of all the other words in the sentence. The key idea here is that we dynamically adjust these weights based on how relevant each word is to the word we’re focusing on.

This idea of dynamically focusing on diff parts of sequence/sentence is why it is referred to as “attention”.

In self-attention, we start with the old word embeddings, which are the initial vector representations of the words in the sentence. But we don't stop there. The goal of self-attention is to generate new, context-aware embeddings for each word by making each word a weighted combination of all the other words in the sequence.

So, how do we calculate these weights?