Understanding Transformers Part 7: From Similarity Scores to Self-Attention

In the previous article, we calculated the similarities between Queries and Keys.

We can use the output of the softmax function to determine how much each input word should contribute when encoding the word “Let’s”.

Interpreting the Weights

In this case, “Let’s” is much more similar to itself than to “go”.

So after applying softmax:

“Let’s” gets a weight close to 1 (100%)
“go” gets a weight close to 0 (0%)

This means:

“Let’s” contributes almost entirely to its own encoding
“go” contributes very little

Creating Value Representations

To apply these weights, we create another set of values for each word.

First, we create two values to represent “Let’s”

Then, we scale these values by 1 (since its weight is 100%)
Next, we create two values to represent “go”

These values are scaled by 0 (since its weight is 0%)

Combining the Values

Finally, we add the scaled values together:

The result is a new set of values that represent the word “Let’s”, now enriched by its relationship with all input words.

These final values are called the self-attention values for “Let’s”.

They combine information from all words in the sentence, weighted by how relevant each word is to “Let’s”.

We can now repeat the same process for the word “go”, which we will explore in the next article.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: