Chapter 10
Neural Networks & Deep Learning
Intuition behind layers, weights, and why deep learning is so powerful — and so hungry for data and compute.
What makes a neural network "neural"
Neural networks are loosely inspired by the brain — but don't take the analogy too literally. A biological neuron fires when it receives enough signal from connected neurons. An artificial neuron does something similar: it receives numerical inputs, multiplies each by a weight, sums them up, and passes the result through an activation function to produce an output.
Stack millions of these simple units in layers, connect every unit in one layer to every unit in the next, and let the whole system adjust its weights to minimise prediction error on a large dataset — and you have a neural network.
layer
layer 1
layer 2
layer
A simple neural network: inputs flow left → right through hidden layers to a prediction
Weights: what the network actually learns
Every connection between neurons has a weight — a number that determines how much that connection contributes to the next neuron's output. When we say a model is "trained," what we mean is: those weights have been adjusted, iteratively, to minimise prediction error on the training data.
A network with 2 hidden layers and 512 neurons per layer has roughly 500,000 weights. GPT-4 has an estimated 1.8 trillion. Each weight is a tiny, learned fact about the patterns in the training data.
The training process, in brief
Forward pass: feed an input through the network, get a prediction.
Loss: measure how wrong the prediction was.
Backpropagation: calculate how much each weight contributed to the error.
Gradient descent: nudge each weight slightly in the direction that reduces error.
Repeat billions of times.
Why "deep" learning?
"Deep" refers to the number of hidden layers. Shallow networks (1–2 layers) can approximate many functions but struggle with complex patterns. Deep networks (many layers) learn hierarchical representations — early layers detect simple features, later layers combine them into increasingly abstract concepts.
In image recognition, this is literal:
- Layer 1 detects edges and gradients
- Layer 3 detects textures and simple shapes
- Layer 7 detects object parts (eyes, wheels, handles)
- Final layer combines parts into object categories
In language models, the hierarchy is less visible but equally real — early layers handle syntax and grammar, later layers handle semantics and reasoning.
Activation functions: why non-linearity matters
Without activation functions, a neural network — no matter how many layers — is just a linear transformation. Linear transformations can't model complex, curved decision boundaries. Activation functions introduce non-linearity, allowing the network to learn any shape of function given enough neurons.
The most common activation today is ReLU (Rectified Linear Unit: the name just means it "rectifies" by zeroing out negatives): the rule is simply output = max(0, input). If a signal is positive, pass it through. If it's negative, output zero. Simple, fast, effective. Transformers use a smoother variant called GELU (Gaussian Error Linear Unit) that tapers off near zero rather than cutting hard — it behaves slightly better in practice for language tasks.
PM Insight
The choice of activation function is an implementation detail. What matters is why they exist: activations are what make neural networks powerful enough to learn from unstructured data — images, text, audio — that simpler models can't handle. If a use case involves unstructured data, neural networks are likely the right class of model.
Specialised architectures for different data types
The basic "fully connected" network works well for tabular data. Different data types need different architectures:
| Architecture | Best for | Product examples |
|---|---|---|
| CNN Convolutional Neural Network |
Images, spatial data | Photo moderation, medical imaging, document parsing, visual search |
| RNN / LSTM Recurrent NN / Long Short-Term Memory |
Sequences where order matters | Time-series forecasting, early NLP, user session modelling |
| Transformer | Language, code, multimodal | LLMs (GPT, Claude), translation, summarisation, code generation |
| GNN Graph Neural Network |
Relationship/network data | Social networks, fraud rings, molecular property prediction |
| Diffusion models | Generative tasks | Image generation, audio synthesis, video generation |
Why deep learning needs so much data and compute
A neural network with millions of weights needs millions of examples to learn meaningful patterns rather than memorising noise. With too little data, it overfits — the weights encode specific training examples rather than generalisable rules.
Compute is needed because training involves billions of arithmetic operations per training step, across millions of steps. This is why GPUs (and TPUs) matter — they're designed for the massively parallel matrix multiplications that make up most of neural network computation.
Rough data requirements by task type
Tabular ML (gradient boosting) → thousands to tens of thousands of examples
Fine-tuning a pretrained model → hundreds to thousands of examples
Training a vision model from scratch → millions of labelled images
Training a large language model → trillions of tokens of text
PM Insight
"Let's just train a model on our data" is often impractical. For most product teams, the right starting point is a pretrained model (foundation model) that you adapt — through prompting, fine-tuning, or RAG — rather than training from scratch. The data and compute requirements for training from scratch are beyond what most companies can justify outside of core AI infrastructure.
Transfer learning: standing on giants' shoulders
Training from scratch is expensive. Transfer learning is the practical alternative: start with a model pretrained on a large general dataset, then adapt it to your specific task with far less data and compute.
This is why foundation models changed everything. A model pretrained on the entire internet has learned rich representations of language, facts, and reasoning that you can leverage for your product without reproducing that training. The cost you pay is adaptation, not education.
PM Playbook — Questions to ask
- Are we training from scratch or adapting a pretrained model? — almost always the latter; if the former, understand why
- What architecture is being used and why? — transformer for text, CNN for images — make sure it fits the data type
- How much labelled data do we have? — drives whether fine-tuning is viable or you need to rely on prompting
- What's the compute cost of training vs inference? — both matter for the build/buy decision
- What's the inference latency, and does that fit our user experience? — deep models can be slow; real-time features have hard constraints
- How do we handle the model being wrong? — neural networks fail confidently; plan for graceful degradation