It’s the attention, stupid!

For more than 50 years, the Muppets from Sesame Street taught children their language. Now, it’s time for Elmo, Bert and their friends to teach their wisdom to adults. As Sesame Street ended, they had to find a new job. With a CV that many starters would drool over with more than five decades of experience, the obvious step was staying in the education sector. ELMo [1], BERT [2], Big BIRD [3], ERNIE [4], and KERMIT [5] have all found their place in Natural Language Processing (NLP). Unfortunately, Cookie Monster had a different passion.

These are not just names of the Muppets, but also names for a new type of deep learning model called transformers [6]. Transformers have caught the scene by storm. Since 2017, they have been the go-to for language models to work on. And researchers like to have a little fun by naming their models to muppets. Transformers are interesting, because it uses many techniques that soften the issues that previous language models suffered from. In this blog, we’ll go over how we got to this point, how transformers work and why transformers are hyped up so much.


For a history rundown on how we got to this point, there’s plenty of resources. But, because it is important to know why the steps that were taken to get us to transformers, I will give a little more context. Let’s go through a sentence together:

“Big Bird is a character from Sesame Street. He has yellow fur and is quite big”

What does “He” in the second sentence refer to? For us humans it’s obviously Big Bird, but it’s more difficult for a computer. Recurrent neural networks (RNN) are very attractive to use for NLP over convolutional neural networks (CNN), because of its sequential nature. The efficiency was increased so much, Count von Count lost track of counting. Recurrent neural networks are networks that propagate information in a sequential fashion, but also have loops in them to retain information during a computation. By retaining the information from the previous sentence, we can make a link between ‘Big Bird’ and ‘He’. To show how a recurrent neural network handles text, we can go through an autocompletion task. Let’s have another example:

“Big bird is a character from Sesame ______”.

We want to predict the word “Street” here. We start with the first word in this sentence and put it into the RNN and compute a value together with the hidden state. It then propagates this value to the next word in the sequence and uses that value together with the word “bird”. It does this until it arrives at “Sesame”. We now have information of all the words preceding Sesame.  In the above figure, the colour blocks represent that information. The previous information we have is Big bird, and Sesame. So we can presume that it is about the show, and thus sesame will follow with ‘Street’. RNNs are great for such usage where dependencies are important, but there is a catch. RNNs have a hard time dealing with longer sentences. Let’s change the first example sentence to:

“Big Bird is a character from Sesame Street. Sesame Street is a TV show for children. He has yellow fur and is quite big”

If we now want to find out that “He” refers to Big Bird, we must go all the way back to the first two words. As we go back further into a sequence, the earlier words have less weight, while more recent words have a higher weight. You can see that in the previous figure. The yellow block that represents the word “Big” is hard to see in the layer for Sesame. This phenomenon becomes even worse for longer sentences. Then, when the weights are backpropagated, the gradients that update the information ‘vanish’. Thus, the information from the beginning is lost. This is called the vanishing gradient problem. So RNN’s greatest advantage also becomes its biggest enemy. It’s hard for RNNs to retain information, even more so with longer sentences. It will mostly remember just the near past, and less so the far past. 

To mitigate this problem, long short-term memory (LSTM) came bursting into the scene. LSTMs are a type of RNN that solve the vanishing gradient problem, to an extent. They forget what they think is not as important and remember what it deems important. This way, the network can remember earlier words more vividly. While LSTMs can be great and still have viability for some tasks, they do not fully solve the vanishing gradient problem. Another problem is that parallel computations are not possible. With RNNs, you must sequentially compute the input. But I just told you that the advantage of RNNs is its sequential nature, why is that bad? If you want to calculate the probability of a word in a sentence, you have to process every word before it, because the values propagate back. RNNs and LSTMs could not live with their own failure, and where did that bring us? Back to convolutional networks. 

You just want attention

In comes the transformer, putting us out of our misery by giving the solution to these issues. Transformers combine the parallelization of CNNs and the forgetting and remembering idea of LSTMs. A transformer consists of an encoder-decoder architecture. The encoder receives an input string and creates word embeddings from it. Word embeddings are numeric vector representations of words. The decoder generates an output string. When opening the bonnet of the encoder, we can see the engine of the transformer: multi-headed attention. Before we get to what multi-headed attention is and visualise the transformer architecture, let’s first take a step back and look at attention in general.

One of the crucial ingredients that makes transformers so good and why Swedish Chef approves of this ingredient, is attention. The main takeaway from the seminal paper “Attention is all you need” by Vaswani et al. that introduced the transformer model, is that we don’t need to focus on everything. This sounds familiar to the idea of LSTMs. A Transformer is like Sherlock Holmes; it knows where it should look. This will solve the problem with longer sentences because loss of information from the far past is no longer an issue. It is also the main part of a transformer.

We should take our big furry friend again as an example:

“BigBird is friends with Snuffleupagus”

What the attention mechanism can do for us here, is give more context to “BigBird”. On its own, Big Bird could just mean a bird that is big. But with a little more context, the embedding can change such that it knows that we are talking about the muppet from Sesame Street, since Snuffleupagus is also mentioned in the sentence. So how is this done? Through the magic of self-attention. Remember that the first step in a transformer is to convert strings of words to vectors (i.e. word embeddings). These vectors have all kinds of information in them. Since these word embeddings are vectors, they span a space. While the numbers in these vectors don’t have direct meaning, we do know that vectors that cluster together in the space, can have similar meaning:

words close together in the vector space have something in common | credits: google

So we can expect that Big Bird and Snuffleupagus are close to each other in the vector space. But now we want to calculate new vectors for these words by including context. Although Big Bird consists of two words, for the sake of simplicity I will combine Big Bird as one vector:

To recalculate the vector for BigBird, we multiply its vector with the other vectors in the sentence. We can call the outcomes dp, short for dot-product. These dp numbers are then normalized (softmax), multiplied with the original vectors and then summed up to calculate the new vector VBigBird. The same happens for the other vectors in the sentence. We see with this self-attention mechanism that the sequential nature of previous models is gone. Words that are close to each other in the sentence don’t matter. 

In RNNs you have weights that are calculated in the hidden state. These weights contribute to how important a word can be. As it currently stands in our journey with transformers, there are no weights we can play around with. What if we do want some parameters? Why do we even want parameters? If we add weights, we can maybe add weights to the vectors, so we can hope that the meaning comes out better, or we can find patterns easier. We can make it more risky or risk-free. Looking at picture 3, there are three points where vectors have a role, so we can add three parameters that are called: queries, keys, and values. The query is the word vector that you want more context for (vector 1). The keys are all the word vectors in the sentence (vector 2). The values, depending on the task, can have different inputs. For this task the values are the same as the keys. Now that we are adding parameters, it’s starting to look like a neural network. We can see the similarities from before. This also means that we can use backpropagation for the attention mechanism to learn.

Now that we have an understanding of self-attention, we can move on to the final piece of the puzzle: multi-headed attention. In the figure below, you can see the original illustration from Vaswani et al [6] of the Multi-head attention:

But first, another example:

“Bert gifted Ernie a jumper for his birthday”

We want to calculate the attention of the word ‘gifted’. Which words should have some attention? Bert does the gifting, Ernie is the receiver, and the jumper is the gift, so those three words should have high importance, with maybe birthday as the reason to a lesser extent. This would mean that we would have to split up the attention. Why split it up when there is no sequential structure in transformers? Behold! The power of parallelization! Instead of splitting up the attention mechanism, just add more layers of self-attention mechanisms. These linear layers are also called ‘heads’. That’s where the name of multi-headed attention comes from. This multi-headed attention mechanism is the main engine of a transformer. With the engine good to go, we can close the bonnet and look at the encoder structure:

This figure comes from the paper of Vaswani et al [6]. 

We can now see how all these things interact together. In the original picture, the decoder is also shown. The decoder is mostly used for translating sentences, so to keep it simple I’m only showing the encoder. One thing that is new in this picture is the positional encoding. I mentioned that the advantage of transformers is that the sequence of the words doesn’t matter, making them able to parallelize. But sometimes the position that the word is in can be important. For this reason, we add this positional encoding to make sure the transformer also doesn’t forget about the sequence position a word is in.

Highway to Sesame Street

Before you set your destination to Sesame Street and want to plunge yourself in transformers, there are some considerations for you to think about before you take this exit. Transformers need a lot of memory, time, and power to train. 

Furthermore, there are bias issues with transformers [9]. While this issue of gender and racial bias in older language models are known (e.g. word2vec [7,8]), this bias is amplified even more in transformers. If the data you’re working on is sensitive and has a probable chance that it will suffer from such problems, think very carefully if you want to use this. While there are techniques that try to remove bias, they merely hide the bias and not remove it.

Nonetheless, the hype on transformers is still justified as it makes many NLP tasks much easier and accurate to work on. Bert might nag, but the station at Sesame Street will become busier.


[1]M. Peters et al., “Deep Contextualized Word Representations”, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 2018. Available: 10.18653/v1/n18-1202 [Accessed 21 April 2021].

[2]J. Devlin, M. Chang, K. Lee and K. Toutanova, Proceedings of the 2019 Conference of the North, 2019. Available: 10.18653/v1/n19-1423 [Accessed 21 April 2021].

[3]M. Zaheer et al., “Big Bird: Transformers for Longer Sequences”, Advances in Neural Information Processing System, vol. 33, pp. 17283–17297, 2020. [Accessed 21 April 2021].

[4]Z. Zhang, X. Han, Z. Liu, X. Jiang, M. Sun and Q. Liu, “ERNIE: Enhanced Language Representation with Informative Entities”, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. Available: 10.18653/v1/p19-1139 [Accessed 21 April 2021].

[5]W. Chan, N. Kitaev, K. Guu, M. Stern and J. Uszkoreit, “KERMIT: Generative Insertion-Based Modeling for Sequences”, arXiv preprint arXiv:1906.01604, 2019. [Accessed 21 April 2021].

[6]A. Vaswani et al., “Attention is All You Need”, in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6000–6010.

[7]Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. Proceedings of Workshop at ICLR, 2013.

[8]Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed Representations of Words and Phrases and their Compositionality. Proceedings of NIPS, 2013.

[9]K. Kurita, N. Vyas, A. Pareek, A. Black and Y. Tsvetkov, “Measuring Bias in Contextualized Word Representations”, Proceedings of the First Workshop on Gender Bias in Natural Language Processing, 2019. Available: 10.18653/v1/w19-3823 [Accessed 26 April 2021].