How the term of Attention Mechanism was introduced and What and Why of Attention Mechanism

Jai Kushwaha
4 min readSep 28, 2020

Introduction of seq2seq models and application:

Sequence to sequence was first introduced by Google in 2014. So let’s go through our question what is seq2seq model? Sequence to sequence model tries to map input text with fixed length to output text fixed-length where the length of input and output to the model may differ. As we know variants of Recurrent neural networks like Long short-term memory or Gated Recurrent Neural Network (GRU) are the method we mostly used since they overcome the problem of vanishing gradient.

From the example shown in the image is of language conversion from French to English.

Another example of English to Hindi Translation. Which is nothing but google translation.

Sequence to Sequence Learning with Neural Networks was introduced by

Ilya Sutskever Google ilyasu@google.com

Oriol Vinyals Google vinyals@google.com

Quoc V. Le Google qvl@google.com

Paper Reference:

Application of Seq2seq models

· Speech Recognition

· Machine Language Translation

· Name entity/Subject extraction

· Relation Classification

· Path Query Answering

· Speech Generation

· Chatbot

·Text Summarization

· Product Sales Forecasting

Seq2Seq is encoder and decoder.

So why does the seq2seq model fails?

Details of the architecture and function already explained by my colleagues so now we see where this model lags.

As we saw encoder takes input and converts it into a fixed-size vector and then the decoder makes a prediction and gives output sequence. It works fine for short sequence but it fails when we have a long sequence because it becomes difficult for the encoder to memorize the entire sequence into a fixed-sized vector and to compress all the contextual information from the sequence. As we observed that as the sequence size increases model performance starts getting degrading.

How can we overcome the problem of long sentences and performance of the model?

Here comes the solution with Attention Mechanism

As the word ‘attention’ suggest importance is given to specific part of context while so as to increase the performance and output interpretation is starts to make sense. In simple terms we give importance to specific parts of the sequence instead of the entire sequence predict that word. Basically, in the attention, we don’t throw away the intermediate from the encoder state but we utilize this to generate context vector from all states so that the decoder gives output result.

The attention mechanism has changed the way we work with deep learning algorithms

Fields like Natural Language Processing (NLP) and even Computer Vision have been revolutionized by the attention mechanism

For Example: For Deep learning we have to read an article and get the inference out it. Or a whole book. Like the human brain attention is given to specific words which mind interprets and grasps others are just a blurry information.

Text Attention

Image Attention

“So, whenever the proposed model generates a sentence, it searches for a set of positions in the encoder hidden states where the most relevant information is available. This idea is called ‘Attention’.”

How it Works

Working of Attention Mechanism
Attention Layer

Attention Unit

Soft Attention

Hard Attention

Architecture

Drawbacks:

Only one drawback of attention is that it’s time-consuming. To overcome this problem Google introduced “Transformer Model” .

Applications

--

--

Jai Kushwaha

I am a 11yrs+ experienced Senior Consultant in Analytics and Model development with domain expertise in BFSI.