A Visual History — 1943 to 2026

The Architecture
of Intelligence

From cybernetics and the Perceptron to transformers, diffusion models, and the agentic web. Eight decades of ideas that built the machine mind.

8 Chapters · Interactive Timeline · Live Charts
Scroll
"The story of artificial intelligence is really a story about transfer. Over eight decades, cognitive work that once required human operators has progressively migrated into computational systems."

The progression has not been linear. It moved in fits and starts, through overlapping eras interrupted by paradigm shifts, funding collapses, and sudden hardware-driven leaps. Two competing philosophies shaped everything: the symbolic camp believed intelligence was fundamentally about top-down manipulation of explicit rules. The connectionist camp argued it emerges bottom-up from neural networks processing large amounts of data. Today's most capable autonomous agents use the pattern-matching power of neural networks to execute goal-oriented planning loops that classical symbolic theorists would recognize from their work fifty years ago.

Key milestones, 1943 to 2026

Chapter 01

The genesis of
mechanized thought

1940s to 1950s  ·  Cybernetics, neurons, and the Turing Test

The intellectual foundation for modern AI came together before anyone had used the term. In the 1940s, mathematics, logic, and early neuroscience were converging, and the framework that emerged was initially called "cybernetics," built largely by mathematician Norbert Wiener.

Wiener's core ideas came out of World War II, when he was working on systems to aim anti-aircraft guns at fast-moving bombers. The problem was prediction: human pilots don't move randomly, so past behavior could be modeled statistically to forecast future trajectories. More importantly, he formalized the concept of the "feedback loop," observing that both biological and mechanical systems work by sensing their environment, processing that input, and adjusting behavior accordingly. That architecture is the same "perceive-plan-act" loop that governs modern autonomous AI agents.

In 1943, Warren McCulloch and Walter Pitts published the first mathematical model of an artificial neuron, showing that networks of simple binary threshold devices could perform basic logic operations. It could handle AND/OR logic but was mathematically unable to solve nonlinear problems like XOR or XNOR.

Donald Hebb added something critical in 1949. His principle "neurons that fire together, wire together" introduced neuroplasticity to artificial networks and directed connectionist research toward weighted inputs.

In 1950, Alan Turing reframed the entire question. His Imitation Game replaced "can machines think?" with a behavioral test: if a machine could convince a human interrogator via typed exchange that it was human, that was sufficient. He also described "learning machines," systems that could alter their own rules through inductive processes, essentially what gradient descent does today.

These threads came together at the 1956 Dartmouth Summer Research Project, where the term "Artificial Intelligence" was formally adopted. Frank Rosenblatt introduced the Perceptron in 1957, the first trainable neural network. Then Minsky and Papert demolished it in 1969, proving single-layer networks couldn't compute XOR. Funding collapsed, and neural networks were set aside for more than a decade.

McCULLOCH-PITTS NEURON · 1943 x₁ x₂ x₃ w₁ w₂ w₃ > θ y 1 SENSE → PROCESS → ACT → SENSE (Wiener, 1940s)
1943

McCulloch-Pitts Neuron

First mathematical model of an artificial neuron, proving networks could represent logical functions.

1949

Hebbian Learning

"Neurons that fire together, wire together." Introduced neuroplasticity and weighted inputs.

1950

The Turing Test

Replaced philosophical debate with a behavioral benchmark. Theorized learning machines.

1957

The Perceptron

Rosenblatt's first trainable neural network. Critiqued by Minsky and Papert in 1969 for its XOR failure.

Chapter 02

The symbolic era and
the AI winters

1960s to 1980s  ·  GOFAI, Expert Systems, and two funding collapses

EXPERT SYSTEM LOGIC · 1980s IF symptoms fever > 38C AND bacteria no infection detected MYCIN Rx more tests THEN pass ELSE fail "Knowledge Acquisition Bottleneck" Every edge case requires a handwritten rule

With early neural networks written off as mathematically insufficient, the 1960s and 1970s were dominated by Symbolic AI, sometimes called "Good Old-Fashioned AI" (GOFAI). The core assumption: human intelligence could be reduced to explicit symbol manipulation according to logical rules. Intelligence was treated as a search problem.

A notable milestone was SHRDLU, Terry Winograd's natural language system built at MIT in the early 1970s. It let users command a virtual robot in a simulated "blocks world." SHRDLU could parse complex spatial commands and work out intermediate steps. It was impressive. But it worked entirely within a closed micro-world with fixed physics. Real language, with its ambiguity and grammatical complexity, was completely beyond it.

By the 1980s, symbolic AI had evolved into "Expert Systems" with rule databases encoding specialist knowledge in "if-then" form. MYCIN diagnosed bacterial infections. DENDRAL helped chemists interpret mass spectrometry. XCON, at Digital Equipment Corporation, configured computer hardware and reportedly saved millions.

The structural problem was the "knowledge acquisition bottleneck." Writing rules for every edge case was expensive, and the systems couldn't learn. As new rules were added, they conflicted with existing ones. By the late 1980s, maintaining systems like XCON cost more than they saved.

Two AI Winters followed. But the era wasn't without lasting contributions: Carl Hewitt's Actor Model (1973) and Michael Bratman's Belief-Desire-Intention (BDI) architecture both prefigure multi-agent systems built today.

1959

Dijkstra's Algorithm

Shortest-path search in weighted graphs. Remains foundational in robotics and network routing.

1973

Actor Model (Hewitt)

Autonomous "actors" as building blocks of concurrent computation. Prefigures modern multi-agent systems.

1980s

Expert Systems

MYCIN, DENDRAL, XCON. First major commercial AI deployment, and the first major commercial failure.

1984

Second AI Winter

Schank and Minsky warned the AAAI conference. Their prediction of a funding collapse was exactly right.

Chapter 03

The statistical turn and
the deep learning revolution

1980s to 2010s  ·  Backpropagation, LSTMs, AlexNet, AlphaGo

When symbolic AI hit its wall, researchers made a clean break. Instead of writing rules, they built systems to find statistical patterns in raw data. Decision trees, support-vector machines, ensemble methods. This was modern machine learning taking shape.

At the same time, connectionism got a second chance. In 1986, David Rumelhart, Geoffrey Hinton, and Ronald Williams published a paper on applying backpropagation to neural networks. Backpropagation is essentially the chain rule of calculus applied recursively: compute the loss function, then propagate the error gradients backward through the network's hidden layers to adjust weights. This solved the nonlinear limitations that had killed early connectionism.

The "vanishing gradient problem" was brutal: as networks got deeper, error signals tended to either explode or vanish before reaching the early layers. Sepp Hochreiter and Jurgen Schmidhuber solved it for sequential modeling in 1997 with the Long Short-Term Memory (LSTM). Instead of a simple recurrent loop, LSTMs used a memory cell with three learnable gates: input, output, and forget. LSTMs became the backbone of NLP, machine translation, and speech recognition for the next two decades.

The real break came in 2012. Three things converged: massive labeled datasets (ImageNet), GPUs for parallel computation, and better algorithmic design. AlexNet, built by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton, beat the second-place ImageNet system by 10.8 percentage points. Deep learning became the dominant AI approach almost overnight.

DeepMind's AlphaGo then mastered the game of Go using a dual-network architecture: a Policy Network to select moves, and a Value Network to estimate winning probability. First trained on human expert games, then refined through millions of self-play iterations.

BACKPROPAGATION · 1986 INPUT HIDDEN HIDDEN OUT GRADIENT FLOW (backpropagation)
1986

Backpropagation

Rumelhart, Hinton, Williams. Made training deep, multilayer networks theoretically viable.

1997

LSTM

Hochreiter and Schmidhuber solved vanishing gradients in sequential data with learnable memory gates.

2012

AlexNet

Won ImageNet by 10.8 percentage points. Deep learning became the dominant paradigm overnight.

2016

AlphaGo

Mastered Go via dual policy/value networks and self-play RL, defeating the world champion.

Chapter 04

Transformers and
the scaling laws

2010s to present  ·  Attention, Word2Vec, GPT, Kaplan vs. Chinchilla

Before the generative AI wave, NLP was stuck with sequential processing. A critical bridge appeared in 2013 with Word2Vec from Google. Word2Vec replaced sparse one-hot encodings with dense vector spaces, mapping words into coordinate systems based on surrounding context. "King minus man plus woman equals queen" actually worked in the geometry of the embedding space.

The problem was that RNNs and LSTMs still processed words one after another. You couldn't parallelize across a full sequence, so training on large datasets was slow. That constraint broke in 2017 with "Attention Is All You Need," a paper from Google Brain and the University of Toronto that introduced the Transformer.

The Transformer dropped recurrence and convolutions entirely, replacing them with self-attention. Rather than processing sequentially, Transformers process every token in a sequence simultaneously. Running this across multiple parallel sub-spaces ("multi-head attention"), with positional encodings to preserve word order, enabled training to be massively parallelized. That was the infrastructure the LLM era needed.

Google introduced BERT in 2018, and OpenAI launched GPT. GPT-2 in 2019 had 1.5 billion parameters. GPT-3 in 2020 had 175 billion. These models showed something unexpected: just by scaling up parameter count and training data, models started doing things nobody had specifically trained them to do. Few-shot reasoning, translation, code generation. These "emergent abilities" weren't designed; they appeared.

In 2020, Jared Kaplan and colleagues at OpenAI showed that model performance follows a predictable power-law with compute. In 2022, DeepMind's Chinchilla paper upended that. GPT-3 was massively undertrained. The right ratio is roughly 20 tokens of training data per parameter — about 11 times more data than the field had been using.

SELF-ATTENTION MATRIX · TRANSFORMER 2017 the model learns fast today the model learns fast high low attn

Scaling laws: Kaplan vs. Chinchilla

For the same compute budget, Chinchilla's data-token ratio outperforms Kaplan's large-model strategy

Kaplan 2020 Chinchilla 2022 Compute budget Performance GPT-3
Chapter 05

The visual synthesis

2014 to 2021  ·  GANs, VAEs, Diffusion models, and CLIP

The same generative era that transformed NLP also transformed visual synthesis. Generative visual models learn the statistical distribution of a training dataset and generate new samples from it, rather than just classifying existing ones. The technical story is about where the "latent space" burden gets placed. Three architectures competed, each with a different answer.

2014

Generative Adversarial
Networks (GANs)

Generator noise → fake Discriminator real or fake? loss

Two networks in adversarial training: a generator creates images from noise; a discriminator tries to catch fakes. StyleGAN achieved stunning photorealism.

Introduced ~2014

Image fidelity
Diversity
Training stability
2013

Variational
Autoencoders (VAEs)

Encoder x → z Latent Z N(μ, σ²) Decoder z → x'

Encoder maps data into a probabilistic latent distribution; decoder reconstructs from samples. Avoids mode collapse by design. Output tends to be blurrier.

Introduced ~2013

Image fidelity
Diversity
Training stability
2020

Diffusion Models
(DDPM)

Forward (add noise) ~N Reverse (denoise → image)

Forward process: adds Gaussian noise over hundreds of steps until pure static. Reverse process: a network learns to denoise step by step, recovering a pristine image. Slower but immune to mode collapse.

DDPM by Ho et al., 2020

Image fidelity
Diversity
Training stability

The commercial image generation boom was unlocked in January 2021, when OpenAI released CLIP (Contrastive Language-Image Pre-training). CLIP was trained to map images and text descriptions into the same mathematical space. Using CLIP to guide a diffusion model's denoising steps gave users the ability to steer image generation with natural language prompts. That's the backbone of Stable Diffusion, DALL-E, and Midjourney.

Chapter 06

Human alignment and
the conversational turn

2017 to 2025  ·  RLHF, ChatGPT, reasoning models

Scaling up LLMs revealed a problem: training on internet text to predict the next token doesn't automatically produce a system that's safe, truthful, or useful. The statistical model is happy to confidently say false things if false things appear often in training data.

The fix came from a 2017 paper by Paul Christiano and colleagues at OpenAI and DeepMind, introducing what became known as RLHF (Reinforcement Learning from Human Feedback). Human contractors ranked pairs of AI outputs. That feedback trained a "reward model" as a proxy for human judgment. Then the main AI was optimized using Proximal Policy Optimization (PPO) to maximize the reward model's score.

Applied to NLP, RLHF became the core of modern alignment work. By late 2022, OpenAI used it to fine-tune GPT-3 into InstructGPT, teaching the model to follow written instructions rather than just autocomplete text. That alignment work is what made ChatGPT work when it launched in November 2022. Millions of people who had never thought about LLMs suddenly had a coherent conversational AI in their browser.

By 2024 and 2025, the focus shifted toward reasoning. OpenAI's o1 and o3 models, alongside DeepSeek R1, used RL techniques to generate internal chain-of-thought reasoning traces before producing answers. The finding: reasoning behavior emerged from reinforcement learning without explicitly programming it in.

RLHF LOOP · 2017 ONWARDS Base LLM GPT-3 / InstructGPT Response A (helpful? safe?) Response B (helpful? safe?) Human Preference A > B (contractor ranking) Reward Model (PPO) optimize LLM
Chapter 07

The agentic web

2022 to 2026  ·  ReAct, Toolformer, MCP, A2A protocols

LLMs, even well-aligned ones, have a fundamental limitation. They are static. An LLM is trained on data up to a cutoff date and produces text based on statistical patterns from that training. It can't look things up, run code, interact with systems, or update its knowledge. Human cognition doesn't work that way.

This gap created the agentic AI paradigm: use an LLM as a cognitive core, then wire it to external tools, memory systems, and control loops that let it act on the world. The shift is from generative AI (making content) to agentic AI (taking goal-directed actions in software environments). There's an irony worth noting: this architecture is really a return to the planning goals of symbolic AI, using the Belief-Desire-Intention framework structure. The difference is that the rule-following logic that classical AI hand-coded is now handled by billions of neural parameters.

The ReAct (Reason + Act) framework in 2022 gave agents structure: generate a thought, take a domain-specific action like searching Wikipedia, observe the result, use it to inform the next thought. In 2023, the Toolformer paper from Meta showed LLMs could teach themselves to use external tools through self-supervised learning. This became "function calling."

By 2025 and 2026, the proliferation of specialized AI agents created a fragmentation problem. Anthropic launched the Model Context Protocol (MCP) in late 2024 as a solution: an open-source standard built on JSON-RPC 2.0 that abstracts the integration layer between AI clients and external data sources. By early 2026, MCP had crossed 100 million monthly SDK downloads.

A three-layer protocol stack crystallized in 2026, with governance increasingly moving to bodies like the Linux Foundation's Agentic AI Foundation and the W3C AI Agent Protocol Community Group.

MULTI-AGENT SYSTEM · 2026 Orchestrator LLM Core Search agent Code agent Data agent Finance agent MCP MCP A2A A2A
L3

Commerce Protocols (UCP/ACP)

Agents negotiate payments, execute fulfillments, and manage procurement without requiring human authorization on every micro-transaction.

L2

Agent-to-Agent Protocol (A2A) + ANP

Routing and communication between separate autonomous agents. Incorporates W3C decentralized identity (DID) standards and encrypted agent-to-agent handshakes.

L1

Model Context Protocol (MCP)

Anthropic, 2024. Open-source standard built on JSON-RPC 2.0 for agent-to-tool integration and local context retrieval. 100M+ monthly SDK downloads by early 2026.

Conclusion

"The history of AI is not a clean line from ignorance to enlightenment. It is a back-and-forth between two different theories of what intelligence actually is."

The symbolic camp had elegance on its side: clean logic, explicit rules, inspectable reasoning. But the real world is too messy for handwritten rules to capture. Expert systems hit that wall hard. The connectionist bet proved right, once backpropagation solved the nonlinear problem, once LSTMs handled gradient collapse in sequences, and once GPUs provided the compute to run training at scale.

The Transformer and the empirical scaling laws that followed unlocked a capability jump that surprised nearly everyone. But generation is not the same as agency. The current moment, the Agentic Web, is where the two traditions finally merge. Modern autonomous agents use the pattern-matching capabilities of large language models to execute structured planning loops that classical AI researchers would recognize. The planning logic runs on neural parameters instead of handwritten rules, but the architecture reflects both traditions.

These systems perceive, reason, and act on the digital world. That is what the pioneers of the 1950s were imagining, even if they had no way to predict the path it would take to get there.

Back to Lab