Understanding Transformers Part 9: Stacking Self-Attention Layers

In the previous article, we explored how the weights are shared in self-attention.

Now we will see why we have these self-attention values instead of the initial positional encoding values.

Using Self-Attention Values

We now use the self-attention values instead of the original positional encoded values.

This is because the self-attention values for each word include information from all the other words in the sentence. This helps give each word context.

It also helps establish how each word in the input is related to the others.

If we think of this unit, along with its weights for calculating queries, keys, and values, as a self-attention cell, then we can extend this idea further.

To correctly capture relationships in more complex sentences and paragraphs, we can stack multiple self-attention cells, each with its own set of weights. These layers are applied to the position-encoded values of each word, allowing the model to learn different types of relationships.

Going back to our example, there is one more step required to fully encode the input. We will explore that in the next article.

Looking for an easier way to install tools, libraries, or entire repositories?
Try Installerpedia: a community-driven, structured installation platform that lets you install almost anything with minimal hassle and clear, reliable guidance.

Just run: