The Transformer Is Born

“A structure that doesn’t walk through time — but sees all at once.”

1. A Quiet Revolution: The Year is 2017

In the last part, we looked closely at self-attention as a strange and elegant mechanism. It allowed a sentence to look at itself, not as a sequence to be marched through, but as a whole, with weighted glimpses from every word to every other.
It broke the one-directional flow of time that had defined neural networks for decades.
And it did so with matrices.

We ended on a threshold: if attention can do this, then what else could it replace?

The answer came faster than most expected.

In 2017, a paper appeared that didn’t just refine the old paradigm. It erased it.
Attention is All You Need was its title. Almost unassuming. As if it were just another building block in the library of machine learning progress. But it wasn’t.

At the time, recurrent networks still shaped the landscape of language processing. Architectures like LSTMs and GRUs had become familiar territory. Their mechanics were well-studied, their limitations documented. A few researchers had been experimenting with attention as an auxiliary component, often bolted onto existing architectures. None of them had dared to ask: what if attention wasn’t an addition, but the foundation?

Well, the paper did exactly that. It removed recurrence. Removed convolutions. And instead proposed a system made entirely of attention layers and simple feedforward steps. No bells. No tricks. Just a clean, repeating structure.

Six authors. No major company branding. Published at NeurIPS. It was respected, but not yet the megaphone it is today.

What they offered was not a flashy product. It was a rethinking. A model that could handle sequences without stepping through them. A mechanism where position was added deliberately and not built into the structure. And perhaps most surprisingly: a design so regular, so modular, that it looked more like a circuit diagram than a deep learning architecture.

It was the beginning of a quiet shift. The field wouldn’t fully grasp it for another year.
But something fundamental had changed.

Perhaps the most remarkable part: it didn’t feel revolutionary.
Not at first. Only later, when the echoes arrived.

2. What Changed? The Transformer at a Glance

The Transformer doesn’t refine the idea of sequence modeling. It redefines it.

There is no recurrence. No state passed from one step to the next.
Instead, the entire input sequence is processed at once.
Parallel, not step-by-step. It is not a pipe but more of a field.

The input still arrives as a sequence of tokens. But the model no longer walks through them. It looks at all of them together. Self-attention enables each token to see every other, weighted by relevance. And this becomes the organizing principle.
Everything else like embedding, normalization, feedforward transformation supports it.

But self-attention alone doesn’t preserve order. A sentence without position is a bag of words. The Transformer handles this with positional encodings by using sinusoidal patterns added to each token embedding. A structural hint. Not learned, just injected. Deliberate, not emergent.

The original transformer model is split into two main parts: the encoder and the decoder. Both are built from the same basic building block namely a stack of layers, each combining self-attention and feedforward steps.

Here is the famous illustration of the original paper:

Looks complicated. I know. But let’s tackle this illustration step by step.

The encoder receives the full input sequence and processes it in parallel.
Its job is to produce a rich, contextual representation of each token and that not just as it appears, but as it relates to the others.

The decoder, in contrast, generates output one token at a time.
During training, it receives the ground-truth output sequence, shifted by one position, and learns to predict the next token.
During inference, it builds this sequence step by step. A point we’ll return to.

Inside each decoder layer, there are two attention mechanisms:

  • A masked self-attention layer, where tokens can only attend to earlier positions (to avoid peeking ahead)
  • A cross-attention layer, where each position attends to the encoder’s full output

This cross-attention is what allows the model to “translate” from source to target.
It does not memorizing rules, but it aligns representations across two languages, two domains, two contexts.

The structure is deceptively simple. Each block follows the same pattern: attention, feedforward, residual connections, normalization. But when stacked, for instance six layers, twelve, or more then complexity emerges.

This wasn’t just a new model. It was a clean design. One that could be scaled, studied, and eventually repurposed.

The model, when used once, predicts only the next token. There is no hidden engine producing full sentences in one sweep. To generate longer outputs, the model is called again and again. Each time with one more token in place. But that’s not part of the Transformer itself. It’s a loop built around it. And understanding exactly this distinction of the model vs. process will become essential later.

3. Inside the Encoder Block

Before anything can be translated, summarized, or answered, the model must first understand. Not in the human sense but in the structural one.
And that begins with the encoder.

The encoder takes the input sequence in a form of a sentence, a paragraph, a list of tokens and transforms it into a representation that the model can work with.
This is not a memory. It’s a contextual map: a matrix of vectors, where each position knows where it is and what surrounds it.

Let’s take a quiet example:

Input:
"Gothic 1 was a cool game around..."

This string is first tokenized, split into subwords or wordpieces depending on the tokenizer. Let’s say it becomes:

["Gothic", "1", "was", "a", "cool", "game", "around", "..."]

Each token is then passed through two components:

1. A Learned Embedding

Each token ID is mapped to a dense vector using a trainable lookup table.
This embedding matrix is part of the model’s parameters so it is not fixed.
It evolves during training, just like attention weights and feedforward layers.

So "Gothic" might become a 512-dimensional vector, not because someone defined it, but because the model has learned, over time, how this word tends to behave in context.

2. A Positional Encoding

But the model must also know where each word appears.

Unlike RNNs, the Transformer has no sense of sequence built into its structure.
Every token can attend to every other, which is powerful, but also means that the notion of “order” must be added explicitly.

In the original Transformer, this is done with sinusoidal positional encodings:
mathematical patterns added to each embedding vector. They are not learned, but carefully designed so that relative positions are distinguishable.

Modern architectures often use learned positional embeddings instead by using
another trainable matrix, just like the token embeddings. But the idea is the same:
give the model a sense of place.

The result is a matrix:

\[
\mathbf{X} \in \mathbb{R}^{8 \times d} \quad \leftarrow \text{8 tokens, each embedded and position-aware}
\]

This becomes the input to the encoder stack.

The encoder itself is a repeating block, stacked multiple times.
Each block does the same three things:

  1. Self-Attention
    Every token looks at every other token (including itself) and reweighs what matters.
    "game" might attend more to "cool" than "Gothic", because the local context is stronger.
    These attention scores are turned into a new vector for each position as a kind of a weighted summary.
  2. Feedforward Transformation
    Each vector then passes through a small neural network: a nonlinear transformation that lets the model reshape and abstract the information.
  3. Residual Connections + Normalization
    To keep gradients stable and learning smooth, the original inputs are added back into the output (residual), and the result is normalized.

This stack is applied layer after layer, often six, twelve, or more times until the representation is no longer just a bag of words, but a structured field of relationships.

At the end of the encoder, we have a new matrix:

\[
\mathbf{E} \in \mathbb{R}^{8 \times d} \quad \leftarrow \text{encoder outputs}
\]


Each row corresponds to a token in the input which is now enriched.
It carries not just the meaning of the word, but the way it relates to its neighbors,
to the sentence, to the structure that surrounds it.

No summary token. No compression. Just a contextualized view, per token, of the input sequence.

This matrix is passed on to the decoder. It will not be changed.
Only attended to by the decoder, one token at a time.

4. Decoder Block: Like the Encoder, But Yet Different

The following is entirely seen from the perspective of the training. Here is no generation at work, that comes in the inference phase which is explained later. So please keep this in mind: Now: Training perspective.

The encoder sees the whole sentence at once. The decoder builds a sentence one word at a time but is trained to act as if it could see everything. That contradiction sits at the heart of this block.

The decoder is the second half of the Transformer. A mirror stack of layers,
but with one critical difference: it doesn’t just attend to itself.

It attends to something else: the encoded input.

A Two-Fold Attention

Each decoder block has two attention mechanisms:

  1. Masked Self-Attention
    The decoder attends to its own input — the beginning of the output sentence —
    but with a mask that blocks it from looking ahead. Every token can only see tokens to its left. This simulates the process of generation:

    “I’ve written this much — what comes next?”
  2. Cross-Attention
    Each token then attends to the output of the encoder:
    a fixed matrix representing the original input sequence.
    This is how the decoder pulls in meaning from the source.
    It’s not just predicting what comes next. It’s aligning that prediction with the context of the input.

The block then continues with the usual feedforward layer, residual connections, and normalization.

What Goes In?

Let’s make this concrete with our familiar sentence:

Source (English):
"Gothic 1 was a cool game around..."

Target (German):
"Gothic 1 war ein cooles Spiel..."

During training, we already know the full target sentence.
We shift it one step to the right:

["<BOS>", "Gothic", "1", "war", "ein", "cooles", "Spiel"]

This becomes the input to the decoder.
Each position is tasked with predicting the next token.

The mask ensures that:

  • Position 0 sees only <BOS>
  • Position 1 sees <BOS>, "Gothic"
  • Position 2 sees <BOS>, "Gothic", "1"
    …and so on.

The decoder doesn’t “know” the full sentence but it simulates guessing it, one word at a time.

Each position attends backward to what’s already been written,
and outward to the encoder’s output.

This structure is what allows translation to happen. Not by copying words,
but by learning how pieces of language align across structure and meaning.

What Comes Out?

At the top of the decoder stack, we now have a matrix of output vectors: one for each position in the shifted sentence.

Each vector is projected into a distribution over the vocabulary:
logits → softmax → probability.

These probabilities are compared to the actual next token namely the ground truth.
This is where teacher forcing comes in:

  • The model is trained with the correct previous words.
  • It doesn’t generate during training.
  • It learns how to generate.

So position i in the decoder output predicts token i in the unshifted target.

That subtle shift (the teacher forcing method) of giving the model input from the past and asking it to predict the present is what enables efficient, parallel learning.

The Confusion: A Clarifying Table

Position

Decoder Input Token

Predicts Token

0

<BOS>

“Gothic”

1

“Gothic”

“war”

2

“war”

“ein

At each position, the model does not see the token it is asked to predict.
It only sees what came before in both the source and the target.

That’s why we shift the sentence. That’s why we mask. That’s how the model learns to write. Not by copying, but by guessing what comes next when the answer is just out of reach.

This block (structurally simple, behaviorally complex) is the reason the Transformer can generate fluent language. But we haven’t yet seen it in action.

To understand how this works at runtime, we need to look at how these blocks (encoder and decoder) interact during training and generation.

What we just have seen in Section 4 is the training-time perspective on the decoder block. That means:

  • The decoder is fed the shifted target sequence (e.g. ["<BOS>", "Gothic", "1", "war", "ein", ...])
  • The model knows the entire target sentence and learns to predict each token in parallel
  • Masked self-attention ensures each token only attends to earlier ones
  • Cross-attention pulls context from the encoder’s full output
  • The outputs are token-wise predictions (logits), compared against the unshifted ground truth
  • And this is all wrapped in teacher forcing: correct previous tokens are used as input, not the model’s own predictions

5. The Grand Picture

Up to this point, we’ve looked at the Transformer as a structure. We saw layers stacked on layers, attention flowing in all directions, each block a neatly defined operation.

But a model is more than its shape. It has a rhythm and that rhythm changes depending on whether it is learning or producing. (aka training or inference)

This is where we shift perspective: From what the Transformer is to what it does: from structure to process.

🧭 First: The Training Process

Let’s return to our running example.

Source (English):
"Gothic 1 was a cool game around..."

Target (German):
"Gothic 1 war ein cooles Spiel um..."

Step 1: Encode the Source

The encoder sees the full source sentence at once:

  • Tokenized into:
    ["Gothic", "1", "was", "a", "cool", "game", "around", "..."]
  • Embedded and position-encoded
  • Passed through a stack of self-attention blocks
  • Output: a matrix of contextual token vectors

This matrix will not change. It becomes the static memory of the input that is available to every decoding step.

Step 2: Prepare the Target (Shifted)

The decoder input is the German sentence, shifted right:

["<BOS>", "Gothic", "1", "war", "ein", "cooles", "Spiel", "um"]

This is what the model receives during training.

Each token is embedded, position-encoded, and masked
so that it can only attend to the tokens before it and not after.

Step 3: Predict the Next Token

Each position in the decoder stack:

  • Looks leftward into the shifted target
  • Looks outward into the encoder’s full output
  • Produces a probability distribution over the vocabulary

The model is trained to match its prediction at each position
to the corresponding token in the unshifted target:

Position

Decoder Input Token

Predicts Token

0

<BOS>

“Gothic”

1

“Gothic”

“1”

2

“1”

“war”

3

“war”

“ein”

All this happens in parallel. One training step. No loops, no sampling, no waiting for previous predictions. Only loss, gradient, update.

Step 4: The Outer Training Loop

We’ve just described one single training step: feeding a source-target pair into the encoder–decoder and computing the loss.

But training a Transformer means repeating this process millions of times, across a full dataset.

Here’s how that works in practice:

  • The model is fed mini-batches of input–output pairs.
    Each batch contains multiple sentence pairs, padded to equal length.
  • For each batch:
  1. Tokenize and embed both source and target
  2. Shift the target sequence to form the decoder input
  3. Compute the output logits (predicted tokens)
  4. Compare predictions to the ground truth targets
  5. Compute loss (typically cross-entropy)
  6. Backpropagate the gradients
  7. Update the model weights

This is wrapped in a loop:

for batch in training_data:
    output = model(source, target_shifted)
    loss = compare(output, target)
    update_weights(loss)

The model doesn’t see the full corpus at once.
It moves through it batch by batch, learning a little more with each step.

The embedding weights, the attention patterns, the feedforward layers
are all adjusted, again and again, until the model no longer needs the correct answer to guess what comes next.


🔁 Now: The Inference Process

When the model is deployed (for translation, summarization, or any other generative task) there is no ground truth to guide it. There is only the input and the uncertainty of what comes next.

This is where the behavior of the decoder changes.

Let’s walk through it step by step using our example:

Input (German):
"Gothic 1 war ein cooles Spiel um das Jahr..."
(“Gothic 1 was a cool game around the year…”)

Goal: Generate the English translation — one token at a time.

Step 1: Encode the Input

The encoder receives the full German sentence:

  1. It is tokenized, embedded, and positionally encoded.
  2. The encoder processes all tokens in parallel through stacked self-attention and feedforward layers.
  3. The output is a matrix of contextual vectors: one per input token.

This matrix now holds the model’s internal representation of the source sentence.
It will remain unchanged. It is now a fixed reference for every decoding step.

Step 2: Initialize the Decoder

The decoder begins with a special start token:

["<BOS>"]

This single token is embedded and positionally encoded.
The decoder stack processes it by applying:

  • Masked self-attention (though it only sees one token)
  • Cross-attention to the full encoder output

The output is a vector of logits: a probability distribution over the vocabulary.

Let’s say the model predicts:

"Gothic"

This becomes the first generated token.

Step 3: Grow the Sequence

At this point, the model enters a loop. This loop is not inside the Transformer itself, but in the code that surrounds it. The encoder output stays fixed, like a memory of the input sentence. The decoder is now called repeatedly, each time with a slightly longer sequence of previously generated tokens. At every step, the model predicts one more word until the sentence is complete.

The decoder input is now:

["<BOS>", "Gothic"]

The decoder runs again (a loop step) with the new, longer sequence.

This time, masked self-attention allows "Gothic" to attend to <BOS>,
and the model can now use both that small internal history and the full encoder output to predict the next word.

Maybe it predicts:

"1"

Then:

"was", "a", "cool", "game", "around", "2001", "."

Each step repeats the same process:

  • Attend to the previous decoder tokens (masked)
  • Attend to the encoder output (unmasked)
  • Predict the next token
Step 4: Stop Condition

The loop continues until one of the following happens:

  • The model generates a special <EOS> token
  • A natural stopping point is reached (e.g. period + newline)
  • A maximum length is exceeded

At that point, the output tokens are decoded back into a human-readable string:

"Gothic 1 was a cool game around 2001."
Behind the Scenes: The Loop

While the encoder runs once, the decoder is called again and again, growing its input each time.

This loop is not part of the model but it’s part of the logic around it:

generated = ["<BOS>"]
while not finished:
    logits = decoder(generated, encoder_output)
    next_token = sample(logits[-1])
    generated.append(next_token)

The model itself remains unchanged, but it is reapplied recursively, using its own outputs as inputs.

This is the generative mode of the Transformer:
Not parallel, not teacher-forced, but step-by-step construction of a sentence
that has never existed before.

Each new token is a guess based on all the guesses before it, and on a single frozen view of the input.

The model doesn’t know how long the sentence should be. It simply continues until it decides it’s done.

6. The Cost of Insight

The Transformer sees all tokens at once. It builds connections between every word and every other word whether they’re five syllables apart or fifty.

This is its gift. But it is not free.

Quadratic Attention

At the heart of the model lies a matrix. A matrix that holds attention scores between every pair of tokens in the sequence.

For a sentence of length n, that’s an n × n matrix.

  • 128 tokens → 16,384 attention weights
  • 512 tokens → over 250,000
  • 1024 tokens → more than a million

And that’s per head, per layer.

The cost of global awareness is quadratic growth in memory, in compute, in time.

This limits how long your input can be. It’s why documents are often truncated,
why researchers hunt for sparse approximations, and why so much recent work focuses not on better thinking, but on cheaper attention.

Noisy Relevance

The model attends to every token, but not every token deserves it.

Sometimes the model pays attention to punctuation.
Sometimes it splits attention across meaningless filler.
Sometimes, the very openness of attention leads to diffused focus: a gaze that sees everything, and therefore sees nothing.

Expressiveness vs. Efficiency

This is the paradox:

  • The more freely tokens can interact,
  • the more burdened the model becomes by that freedom.

You can’t compress insight into a single path.
So the Transformer explores all of them and must learn, on its own,
which ones matter.

In theory, this is elegant. In practice, it is wasteful, especially when many tokens don’t need to talk to each other at all.

The Model Sees Everything And Even What It Should Ignore

There is no mechanism in the vanilla Transformer to say: this part of the input is irrelevant so skip it. No inductive bias for locality, no temporal structure, no sliding memory.

Everything is potentially connected to everything.

That works well when the signal is dense and when every word depends on every other.

But in sparse data such as documents, code, structured tasks, this becomes a problem.

And yet: we live with this cost.

Because for all its excess, the Transformer makes no assumptions. It does not try to guess what matters in advance. It lets the data decide.

And in doing so, it gives us a model that is not just precise but adaptable.

A structure that is expensive, but open. Heavy, but general. A model that sees too much, so that eventually it learns to see just enough.

7. The Beginning, Not the End

The architecture described in 2017 was clean. Stacked layers, self-attention, feedforward blocks. And it has no recurrence, no convolution. It was not optimized, nor lean, nor tuned for any particular task.

But it was complete. And that completeness made it adaptable.

What followed wasn’t just research. It was a rearrangement of the same structure: cut, mirrored, simplified, scaled.

BERT: Just the Encoder

In 2018, BERT kept the encoder and removed the decoder.

It was trained to reconstruct masked tokens inside sentences.
It does not generate, it recovers. It could read deeply, but not speak.

That was enough to transform tasks like classification, question answering, and sentence similarity. The model became an interface to language itself.

GPT: Just the Decoder

At the same time, GPT did the opposite.
It kept the decoder and trained it to generate.

Next-token prediction, over massive text corpora. No input/output pair, just a long tretch of words, and the task of guessing what comes next.

From this, it learned to write, to summarize, to improvise. The model could speak but only from left to right.

T5 and BART: The Return to Both

Later came T5 and BART, which brought back the encoder–decoder structure.

Now the input could be anything: a question, a sentence, a command.
The output: whatever text made sense.

These models weren’t trained on specific tasks.
They were trained on everything. Tasks like summarization, translation, question answering, title generation.

A model no longer built for one thing, but pretrained on the world, and fine-tuned later.

General-Purpose Thinking

This is where we are now.

The original Transformer didn’t try to define intelligence. It defined a form: a way for data to pass through space, attend, transform, and emerge.

From that form came a wave of systems that do not need separate logic for every task.

They just need input. And tokens.

The decoder writes. The encoder shapes. And the attention moves through it all like as if language itself had become architecture.

Primer for Part 5: Transformer Variants on the Rise

The original Transformer was a full stack: encoder and decoder, built to translate.
But as soon as it was published, researchers began disassembling it. They not tried to simplify it, but to specialize it.

Some models kept only the encoder, others only the decoder.
The question wasn’t what to keep, but what the model should learn to do.

  • Do we want a model that understands text?
  • Or one that continues it?
  • Or one that rewrites, reformulates, fills in gaps?

Suddenly, the architecture wasn’t just about layers. Now it was about intent.
And with that came new objectives:

  • Masked Language Modeling: the art of guessing the missing word
  • Causal Language Modeling: the discipline of speaking forward
  • Span prediction, denoising, sequence-to-sequence transformation — each objective shaping a different kind of intelligence

In the next part, we’ll explore how this branching happened.
How GPT became a storyteller. How BERT became a reader. How models like T5 learned to be something else entirely as a kind of textual Swiss army knife.

The Transformer wasn’t the final design. It was the foundational one.

What came next wasn’t just variation. It was the beginning of purpose-built cognition.


Discover more from GRAUSOFT

Subscribe to get the latest posts sent to your email.


2 responses to “The Thinking Machine – Part 4”

  1. Sundar Raghavan Avatar
    Sundar Raghavan

    Would be really nice if there links up top for part 1, part 2, part 3. Navigating the site is not intuittive.

    1. Oliver Avatar

      Hey Sundar, thanks for you message. I will add links to the other parts so the reader can better navigate.

      Thanks
      Oliver

Leave a Reply

Discover more from GRAUSOFT

Subscribe now to keep reading and get access to the full archive.

Continue reading