Transformers Models

A transformer model is a neural network that discovers context and, therefore, meaning by tracing relationships in sequential data, such as the words in this phrase.

Transformers use an expanding collection of mathematical approaches known as attention or self-attention to recognize the sophisticated ways even distant data pieces in a series impact and rely on one another.

  • Transformers, which were first detailed in a 2017 Google study, are one of the newest and most potent types of models created yet. They are fueling a surge of machine learning advancements that some have called transformer AI.

In an August 2021 study, Stanford researchers referred to transformers as “foundation models” because they saw them triggering a paradigm change in AI. They noted that “the sheer magnitude and breadth of foundation models in recent years have expanded our conception of what is conceivable.”

Transformers Model Kits

The transformer’s design is encoder-decoder-based, however, it does not rely on recursion and convolutions to create an output.

The encoder on the left side of the transformer structure is responsible for mapping an input sequence to a series of ongoing representations, which is subsequently given to a decoder.

On the right side of the architecture, the decoder gets the encoder’s output together with the decoder’s output from the previous time step to form an output sequence.

The Encoder

The encoder is built of a stack of 6 equal layers, with each layer consisting of two sublayers.

  • The first sublayer provides a method for multi-head self-attention. The multi-head technique implements heads that acquire a linear projection version of the queries, keys, and values to create outputs in parallel that are then utilized to build the final result.
  • The second sublayer is a feed-forward network with two linear transformations separated by Rectified Linear Unit (ReLU) activation.

Each of the six layers of the transformer encoder performs identical linear changes on each word in the input sequence, but they do so with varying weight and bias parameters.

Additionally, each of these two sublayers is surrounded by a residual connection.

It is important to bear in mind that the transformer design cannot capture information about the relative locations of the words in the sequence since it does not employ recursion. This information must be inserted into the input embeddings by adding positional encodings.

Sine and cosine functions of varying frequencies are used to construct positional encoding vectors with the same size as the input embeddings. After that, they are simply added to the source embeddings to inject positional information.

The Decoder

The decoder and encoder have some commonalities.

The decoder consists of a stack of identical layers, each built with three sublayers.

  • The first sublayer takes the previous output from the decoder stack, adds positional information to it, and then uses it to achieve multi-head self-attention. While the decoder is tweaked to focus on the words immediately before the current one, the encoder pays attention to all phrases in the input sequence. Consequently, the forecast for a word at a position can only be derived from the known outcomes for the preceding words. Specifically, the values resulting from the scaled multiplication of matrices Q and K are masked in the multi-head attention mechanism to accomplish this effect.
  • Similar to the technique utilized in the first sublayer of the encoder, the second layer employs a multi-head self-attention mechanism. This collects the queries from the preceding decoder sublayer as well as the keywords and values from the encoder’s output. This enables the decoder to process each word in the input sequence.
  • The third layer provides a fully linked feed-forward network, comparable to the network built in the encoder’s second sublayer. In addition, the three sublayers on the decoder side are surrounded by residual connections and followed by a normalizing layer.

Just as we saw with the encoder, the decoder’s input embeddings receive positional encodings.

Testing. CI/CD. Monitoring.

Because ML systems are more fragile than you think. All based on our open-source core.

Our GithubInstall Open SourceBook a Demo


  • A deep learning transformer such as a transformer-based language model is an effective tool that may be used in a range of natural language processing applications.
  • HuggingFace’s transformers machine learning package makes it simple for programmers to use cutting-edge transformers for common applications like question answering, sentiment analysis, and text summarizing.

For your NLP applications, you may also fine-tune the pre-trained AI transformer model.