Chapter 10

Neural Networks & Deep Learning

Intuition behind layers, weights, and why deep learning is so powerful — and so hungry for data and compute.

⏱ 12 min read

What makes a neural network "neural"

Neural networks are loosely inspired by the brain — but don't take the analogy too literally. A biological neuron fires when it receives enough signal from connected neurons. An artificial neuron does something similar: it receives numerical inputs, multiplies each by a weight, sums them up, and passes the result through an activation function to produce an output.

Stack millions of these simple units in layers, connect every unit in one layer to every unit in the next, and let the whole system adjust its weights to minimise prediction error on a large dataset — and you have a neural network.

Input
layer

x₁

x₂

x₃

x₄

Hidden
layer 1

Hidden
layer 2

Output
layer

A simple neural network: inputs flow left → right through hidden layers to a prediction

Weights: what the network actually learns

Every connection between neurons has a weight — a number that determines how much that connection contributes to the next neuron's output. When we say a model is "trained," what we mean is: those weights have been adjusted, iteratively, to minimise prediction error on the training data.

A network with 2 hidden layers and 512 neurons per layer has roughly 500,000 weights. GPT-4 has an estimated 1.8 trillion. Each weight is a tiny, learned fact about the patterns in the training data.

The training process, in brief

Forward pass: feed an input through the network, get a prediction.
Loss: measure how wrong the prediction was.
Backpropagation: calculate how much each weight contributed to the error.
Gradient descent: nudge each weight slightly in the direction that reduces error.
Repeat billions of times.

Why "deep" learning?

"Deep" refers to the number of hidden layers. Shallow networks (1–2 layers) can approximate many functions but struggle with complex patterns. Deep networks (many layers) learn hierarchical representations — early layers detect simple features, later layers combine them into increasingly abstract concepts.

In image recognition, this is literal:

Layer 1 detects edges and gradients
Layer 3 detects textures and simple shapes
Layer 7 detects object parts (eyes, wheels, handles)
Final layer combines parts into object categories

In language models, the hierarchy is less visible but equally real — early layers handle syntax and grammar, later layers handle semantics and reasoning.

Activation functions: why non-linearity matters

Without activation functions, a neural network — no matter how many layers — is just a linear transformation. Linear transformations can't model complex, curved decision boundaries. Activation functions introduce non-linearity, allowing the network to learn any shape of function given enough neurons.

The most common activation today is ReLU (Rectified Linear Unit: the name just means it "rectifies" by zeroing out negatives): the rule is simply output = max(0, input). If a signal is positive, pass it through. If it's negative, output zero. Simple, fast, effective. Transformers use a smoother variant called GELU (Gaussian Error Linear Unit) that tapers off near zero rather than cutting hard — it behaves slightly better in practice for language tasks.

PM Insight

The choice of activation function is an implementation detail. What matters is why they exist: activations are what make neural networks powerful enough to learn from unstructured data — images, text, audio — that simpler models can't handle. If a use case involves unstructured data, neural networks are likely the right class of model.

Specialised architectures for different data types

The basic "fully connected" network works well for tabular data. Different data types need different architectures:

Architecture	Best for	Product examples
CNN Convolutional Neural Network	Images, spatial data	Photo moderation, medical imaging, document parsing, visual search
RNN / LSTM Recurrent NN / Long Short-Term Memory	Sequences where order matters	Time-series forecasting, early NLP, user session modelling
Transformer	Language, code, multimodal	LLMs (GPT, Claude), translation, summarisation, code generation
GNN Graph Neural Network	Relationship/network data	Social networks, fraud rings, molecular property prediction
Diffusion models	Generative tasks	Image generation, audio synthesis, video generation

Why deep learning needs so much data and compute

A neural network with millions of weights needs millions of examples to learn meaningful patterns rather than memorising noise. With too little data, it overfits — the weights encode specific training examples rather than generalisable rules.

Compute is needed because training involves billions of arithmetic operations per training step, across millions of steps. This is why GPUs (and TPUs) matter — they're designed for the massively parallel matrix multiplications that make up most of neural network computation.

Rough data requirements by task type

Tabular ML (gradient boosting) → thousands to tens of thousands of examples
Fine-tuning a pretrained model → hundreds to thousands of examples
Training a vision model from scratch → millions of labelled images
Training a large language model → trillions of tokens of text

PM Insight

"Let's just train a model on our data" is often impractical. For most product teams, the right starting point is a pretrained model (foundation model) that you adapt — through prompting, fine-tuning, or RAG — rather than training from scratch. The data and compute requirements for training from scratch are beyond what most companies can justify outside of core AI infrastructure.

Transfer learning: standing on giants' shoulders

Training from scratch is expensive. Transfer learning is the practical alternative: start with a model pretrained on a large general dataset, then adapt it to your specific task with far less data and compute.

This is why foundation models changed everything. A model pretrained on the entire internet has learned rich representations of language, facts, and reasoning that you can leverage for your product without reproducing that training. The cost you pay is adaptation, not education.

PM Playbook — Questions to ask

Are we training from scratch or adapting a pretrained model? — almost always the latter; if the former, understand why
What architecture is being used and why? — transformer for text, CNN for images — make sure it fits the data type
How much labelled data do we have? — drives whether fine-tuning is viable or you need to rely on prompting
What's the compute cost of training vs inference? — both matter for the build/buy decision
What's the inference latency, and does that fit our user experience? — deep models can be slow; real-time features have hard constraints
How do we handle the model being wrong? — neural networks fail confidently; plan for graceful degradation

Check your understanding 4 questions