The most basic architecture of a neural network is called a Multi-Layer Perceptron (MLP). It consists of multiple layers of simple computational units (neurons) stacked one after another, where each neuron performs a weighted sum of its inputs followed by a non-linear activation function.

At its core, a neural network is designed to be a function approximator.
You start with an input (for example, an image, a text sequence, or a number) and want to produce a corresponding output (such as a label, a translation, or a prediction). The catch is: you don’t explicitly know the true function that connects input to output. And often, the true function is way too complex to write down manually anyway.

Instead, you train the network by showing it many examples of input-output pairs. Using a loss function that measures how wrong the network’s current prediction is, the network adjusts its internal parameters (its weights and biases) to reduce this error. Over time, it learns a set of parameters that approximate the true mapping from input to output, even though the exact function was never given.

This is the most basic and profound idea behind neural networks.

Once researchers understood that basic neural networks (MLPs) can approximate functions, they also quickly realized two problems:

  1. Efficiency:
    A fully connected MLP becomes extremely inefficient for complex input like images. Imagine an image with 1000ร—1000 pixels. A simple MLP would need a million input neurons! And each neuron would be connected to every neuron in the next layer, leading to a combinatorial explosion of parameters.
    โ†’ Too slow, too memory-hungry, and too hard to train.
  2. Structure in the Data:
    Real-world data often has structure. For example, in images, nearby pixels are more related than far-away ones. An MLP treats every input equally and independently, which ignores this structure.
    โ†’ MLPs are blind to useful patterns in the data.

These two problems drove researchers to look for smarter architectures that could:

  • Handle large inputs more efficiently (fewer parameters, faster computation)
  • Exploit the structure of the data (such as spatial locality in images, or sequentiality in text)

This led to ideas like:

Architecture ModuleMain IdeaMotivation
Convolutions (CNNs)Apply the same small filter across the entire input.Reduce parameters, exploit local patterns (e.g., edges, textures).
BottlenecksCompress information into a smaller latent space before expanding again.Reduce computation and overfitting, make training deeper networks feasible.
AttentionLet the network learn which parts of the input to focus on dynamically.Handle long-range dependencies (especially important in sequences like text).

โžก๏ธ Efficiency and data structure awareness were the two main forces that pushed researchers beyond plain MLPs.

And how did they approach this research?

  • Inspired by biology: Early ideas like convolutional layers were inspired by studies of the visual cortex (Hubel and Wiesel’s experiments in the 1960s, where they found neurons respond to edges and simple patterns).
  • Trial and error: Researchers tried many different architectures and training strategies, often guided by intuition and small experiments.
  • Learning from failures: Problems like “vanishing gradients” (which made deep MLPs hard to train) led to inventions like bottlenecks, skip connections (ResNet), and better activation functions (ReLU).

In a way, neural network architecture research has been a constant dance between practical necessity and clever inspiration.

Timeline of Neural Network Architecture Innovations

Time PeriodInnovationMotivation / Problem Solved
1950sโ€“60sPerceptron (Rosenblatt)First simple model of a neuron. Showed that machines could “learn” basic patterns.
1980sMulti-Layer Perceptron (MLP) + Backpropagation (Rumelhart, Hinton, Williams)Enabled training deep (multi-layer) networks. Recognized neural networks as general function approximators.
1989โ€“1998Convolutional Neural Networks (CNNs) (LeCun et al., LeNet)Handle images efficiently by exploiting local spatial patterns (like edges). Reduced the number of parameters massively.
1997Long Short-Term Memory (LSTM) (Hochreiter & Schmidhuber)Solve problems in sequence data (e.g., text, time series) by addressing vanishing gradients.
2012Deep CNNs (AlexNet, Krizhevsky et al.)Huge breakthrough in ImageNet competition. Deep CNNs + GPUs suddenly made neural networks mainstream.
2015Residual Networks (ResNet) (He et al.)Allowed extremely deep networks (>100 layers) by introducing skip connections to solve vanishing gradient problems.
2015Bottleneck Blocks (ResNet variants)Made deep networks more computationally efficient by compressing and expanding channels (1×1 convolutions).
2017Attention Mechanism and Transformer architecture (Vaswani et al.)Replaced recurrence and convolutions for sequences. Dynamically focuses on important parts of input, allowing parallel training and better handling of long dependencies.
2020sVision Transformers (ViTs), Swin Transformers, Hybrid ModelsBrought transformer-based models into computer vision, sometimes outperforming CNNs by using self-attention across patches.
So what would happen if we hadn’t invented architecture modules, but had an unlimited number of computers available?

Even without time or resource constraints, a plain, extremely deep Multilayer Perceptron (MLP) would still not be ideal for all tasks.
Hereโ€™s why:

  • No structure awareness: An MLP treats every input feature equally and independently.
    For example, in an image, the spatial relationships (e.g., “this pixel is next to that one”) are crucial. MLPs have no built-in way to recognize local patterns like edges, textures, or hierarchies unless they somehow relearn everything from scratch which is extremely inefficient and unreliable.
  • Positional information is ignored: In sequences (like text or time series), the order matters. “Dog bites man” and “Man bites dog” are very different!
    MLPs are order-agnostic unless you explicitly encode the order, which is what architectures like RNNs or Transformers naturally handle.
  • Generalization is weaker:
    Specialized architectures like CNNs or attention modules allow the model to generalize better from fewer examples because they build in prior knowledge (e.g., “local features are important” in images).
    A deep MLP would need way, way more data to accidentally “stumble” on the same ideas without this built-in help.
  • Scalability to new domains:
    Modern tasks like language modeling, video analysis, graph understanding – they need architectures that understand relationships between parts of the input, not just flat mappings from input to output.

So, even with infinite compute, MLPs lack inductive biases (structural assumptions about the input) that specialized modules like convolutions, attention, and others bake in.

These biases are essential for making learning efficient, effective, and scalable. To make this intuition even clearer, let’s look at two practical examples:

Example 1: why a plain mlp struggles with images

Imagine you show a deep MLP millions of cat pictures.
The MLP sees each pixel individually, without understanding that groups of nearby pixels form shapes, and shapes form ears, eyes, fur patterns. Features that define a cat.

In contrast, a Convolutional Neural Network (CNN) knows by design that local pixel groups matter. It learns to detect edges, curves, textures by building blocks for recognizing higher-level concepts like a cat’s face.

Without this built-in awareness, a plain MLP would have to independently “rediscover” the idea of edges, patterns, and hierarchies from scratch by requiring vastly more data and still struggling to generalize well.

Example 2: Why a plain mlp struggles with text

Imagine you give a deep MLP the sentence:
“The cat sat on the mat.”

The MLP sees each word (or letter) as just another input feature. No notion of order, grammar, or context.
For the MLP, “cat sat mat the on the” might look just as valid as the correct sentence!

In contrast, architectures designed for sequences like Recurrent Neural Networks (RNNs) or Transformers naturally capture the position and relationships between words.
They understand that “the cat” is different from “cat the”, and that “sat” relates to “cat” as an action.

Without this built-in sequential understanding, a plain MLP would again need to “reinvent” the rules of language through brute force by making learning slower, less reliable, and more data-hungry.

How do researchers approach a task to solve with Neural Networks?

There are two main principles researchers use:

1. Analyze the nature of the input and the output

You ask:

  • What structure does the input have?
    (e.g., is it spatial like an image? sequential like text? graph-like like molecules?)
  • What structure does the desired output have?
    (e.g., a label? a pixel-wise map? a sequence of tokens? a graph?)

This analysis leads you to pick modules that are naturally good at handling the structure.

Examples:

Input โ†’ OutputSuitable ModulesWhy
Image โ†’ Label (cat vs. dog)CNNsLocal patterns (edges, shapes) are important; spatial hierarchies matter.
Image โ†’ Pixel-wise segmentationCNNs + upsampling modules (e.g., U-Nets)Need fine-grained spatial outputs; maintain spatial resolution.
Image โ†’ Text descriptionCNN encoder + Sequence decoder (RNN, Transformer)Need to first “understand” image features, then generate sequential language.
Time series โ†’ Anomaly detectionRNNs, 1D CNNs, TransformersTemporal patterns matter; need memory or dynamic attention.
Text โ†’ Answer a questionTransformers (BERT, T5)Long-range dependencies and reasoning over sequences.

Key takeaway:
โžก๏ธ Different input/output structures suggest different “default” building blocks.


2. Understand the difficulties of the mapping

You also ask:

  • What is hard about this mapping? (Long-range dependencies? Fine details? Irregular patterns?)
  • Are there known problems that often appear?
    (e.g., vanishing gradients, missing long-term memory, blurry outputs)

This leads you to special tricks or modules.

Examples:

ProblemSolution ModuleWhy
Spatial detail loss in deep networksSkip connections (U-Net, ResNet)Helps preserve fine-grained information.
Vanishing gradientsResidual connectionsHelps gradients flow through deep networks.
Need to focus dynamicallyAttention mechanismsLets network “look” where it matters most.
Rare anomalies in imagesAutoencoders, VAEs, specialized loss functionsModel normality first; detect deviations.

โœ… Putting it together: Choosing or inventing architecture modules is a matter of

  • analyzing structure (input + output),
  • anticipating challenges (what makes the problem hard),
  • picking or designing modules that fit these needs.

Itโ€™s almost like being an engineer matching tools to a construction project with the construction site is information itself.


For example:

  • Attention was invented because RNNs struggled with long sequences.
  • Vision Transformers arose because CNNs sometimes missed global context in images.

Insight:
When creating new modules, researchers often observe what the network is bad at. And then invent a module that makes it better.

Outro: Understanding and Shaping Neural Network Architectures

At their heart, neural networks are powerful tools for mapping implicit information to explicit understanding.

Starting from simple function approximators like MLPs, the field evolved new architectural ideas (like convolutions, bottlenecks, and attention) to handle real-world data more efficiently and intelligently.

Choosing the right architecture, or even inventing new modules, follows two guiding principles:

  • Understand the structure of your input and output.
  • Identify the challenges specific to your task.

Architecture is not just about stacking layers. Itโ€™s about building networks that align naturally with the data and solve the real difficulties hidden in the problem.
This mindset is what has driven and continues to drive the thrilling progress in AI today.


Leave a Reply