Chapter 11
NLP & Large Language Models
How language models actually work β and the real differences between prompting, RAG, and fine-tuning that every PM building with AI needs to understand.
Text is not natural for computers
Computers work with numbers. Text is messy, ambiguous, context-dependent, and infinite in its combinations. Getting machines to understand language has been one of the hardest problems in AI β and it's largely been cracked in the last decade, first by deep learning approaches to NLP and then, decisively, by transformers and large language models.
Understanding how that works makes you a sharper evaluator of LLM capabilities, a better judge of where they'll fail, and a more informed decision-maker about how to build with them.
Step 1: Tokenization β text as numbers
Before a language model can process text, the text must be converted into numbers. This happens through tokenization: splitting text into chunks called tokens, then mapping each token to an ID from a fixed vocabulary.
Tokens are not words. They're subword units β common words stay whole, while rare
or long words get split into pieces. "Tokenization" might become
Token + ization. Numbers, punctuation, and whitespace
each get their own tokens.
Why it matters for PMs: LLM APIs charge per token, have context limits in tokens, and behave differently for tokenized vs. untokenized input. A 1,000-word document is roughly 1,300β1,500 tokens. Code and non-English text often tokenize less efficiently β more tokens per character = more cost and more context consumed.
Interactive β Tokenization Demo
Type anything below. Each colour block is one token. Notice how common words
stay whole while long or unusual words get split. The Β· prefix
marks a space before a word (GPT-style byte-pair encoding).
Step 2: Embeddings β words as locations in space
Each token ID gets mapped to an embedding: a list of hundreds or thousands of numbers (a vector) that represents its meaning. Tokens with similar meanings end up close to each other in this high-dimensional space.
The classic example: in a well-trained embedding space,
king β man + woman β queen. The model has learned that gender is
a direction in embedding space, and royalty is another β independently.
Embeddings are powerful beyond LLMs. They're used to:
- Find meaning-similar content β search that returns results related in meaning, not just keyword matches (type "cheap flights" and get results about "affordable airfare" too)
- Cluster text by topic β without predefined categories
- Power recommendation β embed users and items in the same space; nearby items are good recommendations
- Enable RAG β store documents as embeddings; retrieve the most relevant ones at query time (more on this below)
Step 3: Attention β what transformers actually do
Before transformers, language models processed text sequentially β one word at a time. Long documents were hard because early words were "forgotten" by the time the model reached the end.
Attention β the key innovation in the transformer architecture β lets the model look at all tokens in the input simultaneously and learn which ones are relevant to each other for the task at hand.
When processing the word "it" in "The trophy didn't fit in the suitcase because it was too big," attention allows the model to figure out that "it" refers to "trophy" (because the trophy is the thing that might not fit) β by learning to attend to the right context.
Why "large" in large language models
Scale turns out to matter enormously. More parameters + more data + more compute produces models that don't just get incrementally better β they develop qualitatively new capabilities (reasoning, in-context learning, instruction following) that smaller models don't have. This scaling behaviour is why the field moved so fast after GPT-3, and why model size is still a key differentiator.
How LLMs are trained
Modern LLMs go through (at least) two training phases:
1. Pretraining
Train on a massive corpus of text (the internet, books, code) to predict the next token. This teaches the model language, facts, reasoning patterns, and world knowledge. Extremely expensive β only a handful of organisations can afford to do this.
2. Fine-tuning with human feedback (RLHF)
RLHF stands for Reinforcement Learning from Human Feedback. Human raters compare model outputs and rate which are better. Those preferences train a reward model, which is then used to fine-tune the base model to produce responses humans prefer. This is what turns a raw next-token predictor into an assistant that follows instructions, is polite, and declines harmful requests. Variants like DPO (Direct Preference Optimisation) and RLAIF (RL from AI Feedback) are now common in practice, but the core idea β optimise for preferred outputs β is the same. Claude, ChatGPT, and Gemini all use some form of this.
Prompting vs RAG vs fine-tuning
This is one of the most important decisions you'll make when building with LLMs β and it's often made without enough clarity. Here's how to think about it:
| Approach | How it works | Best for | Limitations |
|---|---|---|---|
| Prompting | Write instructions in the system/user prompt. No model changes. | Formatting, tone, task framing. Fast iteration. Zero data needed. | Doesn't add knowledge. Unreliable for complex, consistent behaviour. Prompt visible in context window. |
| RAG Retrieval-Augmented Generation |
At query time, retrieve relevant documents from a knowledge base and inject them into the prompt. | Domain-specific knowledge, up-to-date information, large document corpora, traceability. | Retrieval quality matters β bad retrieval = bad answers. Adds latency. Requires embedding infrastructure. |
| Fine-tuning | Train the model further on your own examples, updating its weights. | Consistent style/format, domain-specific reasoning, reducing prompt length, specialised tasks. | Requires labelled examples (100sβ1000s). Slower iteration. Doesn't add factual knowledge reliably. Can degrade general capabilities. |
PM Insight
Start with prompting. If you can't get reliable enough behaviour through prompting, try RAG if the problem is about knowledge (what the model knows). Try fine-tuning if the problem is about behaviour (how the model responds). Do both only when you have clear evidence neither alone is enough.
Context window: the model's working memory
The context window is how much text a model can "see" at once β its working memory. Everything outside the context window doesn't exist to the model.
Context windows have grown dramatically: GPT-3 launched with 4K tokens; modern models support 128K to 1M tokens. But larger isn't always better:
- Cost scales with context length β filling a 128K window costs 32Γ more than a 4K window
- "Lost in the middle" β models attend better to the beginning and end of long contexts; information buried in the middle is often missed
- Latency increases β longer contexts mean slower responses
PM Insight
Context window size is a product constraint, not just a technical one. Design your LLM features around what fits: if a user's document exceeds the context window, you need chunking, RAG, or summarisation as part of your product architecture. "We'll just send the whole document" only works until it doesn't.
Hallucination: confident and wrong
LLMs generate the most statistically likely next token β they don't look things up. When they don't "know" something, they don't say "I don't know" β they generate plausible-sounding text anyway. This is hallucination.
Hallucinations are not bugs that will be fixed β they're a fundamental property of how these models work. Mitigation strategies:
- RAG β ground responses in retrieved source documents
- Citation requirements β prompt the model to cite sources; makes hallucinations detectable
- Confidence elicitation β ask the model to flag uncertainty
- Human review β for high-stakes outputs, keep a human in the loop
- Constrained outputs β if the task allows, limit outputs to structured formats (JSON, fixed schemas) that can be validated programmatically
- Tool use and search grounding β give the model tools to look up facts (web search, database queries, calculators) rather than answer from memory alone. Now standard in agentic products and the most effective mitigation for factual hallucinations
Prompting patterns that actually work
Most teams start by writing a one-line instruction and wondering why results are inconsistent. Reliable LLM outputs require deliberate prompt design. Here are the patterns with the highest ROI:
Few-shot prompting
Show the model examples of the input/output pairs you want before asking it to do the task. The model infers the pattern from examples rather than relying purely on your instruction.
Prompt: Classify this support ticket as "billing", "technical", or "account".
Ticket: "I can't log in on my new phone."
// Few-shot (more reliable)
Prompt: Classify support tickets. Examples:
"My invoice is wrong" β billing
"The app crashes on startup" β technical
"I want to change my email address" β account
Now classify: "I can't log in on my new phone." β technical
Three to five examples is usually enough. Choose examples that cover edge cases, not just easy cases β the model will interpolate toward your examples.
Chain-of-thought prompting
Ask the model to reason step-by-step before giving its final answer. This dramatically improves accuracy on tasks that require multi-step reasoning, math, or judgment calls. Adding "think through this step by step" to a prompt is one of the highest-ROI prompt changes you can make.
Prompt: A user has been inactive for 45 days and their subscription renews in 7 days. Should we send a win-back email?
Response: Yes.
// With chain-of-thought
Prompt: ... Think through this step by step before answering.
Response: 45 days inactive exceeds our churn risk threshold of 30 days. Renewal is in 7 days, which is within our win-back window. However, if the user already decided to cancel, a renewal email might accelerate cancellation rather than prevent it. Given the timing, a value-reminder email (not a renewal reminder) is the better choice. Recommendation: send a soft re-engagement email highlighting features they haven't used.
Role and persona specification
Telling the model what role it's playing shifts the style and emphasis of its output significantly. "You are a terse senior engineer reviewing a pull request" produces different feedback than "You are a helpful writing assistant."
Use role specification when output style or domain framing matters. Avoid vague roles ("be an expert") in favor of specific behavioral ones ("respond only with the classification label, no explanation").
Output format control
LLMs are more reliable when you specify the exact output format. For structured data, ask for JSON. For consistent classification, enumerate the valid labels. For summaries, specify the target length and structure.
PM Insight
Prompt engineering is product iteration β test, measure, iterate. Treat each prompt as a version that should be stored, tested against a fixed eval set, and changed deliberately. "We updated the prompt" is a product change that can regress quality as easily as a code change can.
How to evaluate LLM output quality
Evaluating generative AI is harder than evaluating classification models. There's no single right answer to "write a good product summary." The main approaches:
Human evaluation
The gold standard. Raters assess outputs on dimensions like accuracy, helpfulness, tone, and completeness. Reliable but slow and expensive. Use it to calibrate other evaluation approaches, and for high-stakes output categories.
Design rubrics carefully. "Is this response good?" produces inconsistent ratings. "Does this response answer the specific question asked, without adding incorrect information? (yes/no)" produces consistent ones.
LLM-as-judge
Use a capable LLM to evaluate the outputs of another LLM (or the same one). You give the judge model a rubric and ask it to score or compare outputs. Scales much better than human eval and correlates reasonably well with human judgment on many tasks.
LLM-as-judge: what to watch out for
Position bias: judges tend to prefer whichever response appears first
in a comparison. Randomize order.
Verbosity bias: judges often rate longer responses higher, regardless of quality.
Self-preference: a model tends to rate its own outputs higher than a
different model's. Use a different model as judge, or use pairwise comparisons with randomized order.
Automated reference-based metrics
Compare model outputs to a reference (human-written) answer using word-overlap or embedding similarity. These scale easily and require no human time per query β but they all require a reference answer to compare against, which limits when they're useful.
BLEU β precision-focused, designed for translation
BLEU (Bilingual Evaluation Understudy) measures how many n-grams (short word sequences) in the model's output also appear in the reference. It's a precision metric: "of everything the model said, how much of it was in the reference?"
Example: Reference: "The cat sat on the mat." Model output: "The cat sat." β high BLEU (every word is in the reference). Model output: "A feline rested on the rug." β low BLEU (no n-gram overlap, even though the meaning is the same).
When to use: Machine translation, where many near-synonymous references are available. Avoid for: open-ended generation or summarisation β it penalises valid paraphrases heavily.
ROUGE β recall-focused, designed for summarisation
ROUGE (Recall-Oriented Understudy for Gisting Evaluation) flips the question: "of everything in the reference, how much did the model include?" It's a recall metric. The most common variants:
- ROUGE-1 β unigram (single word) overlap between output and reference
- ROUGE-2 β bigram (two-word sequence) overlap β stricter
- ROUGE-L β longest common subsequence β rewards output that preserves the order of key phrases from the reference
When to use: Summarisation evaluation β ROUGE-L is the standard benchmark metric for tasks like news summarisation. Avoid for: tasks where word choice varies widely.
BERTScore β semantic similarity via embeddings
BERTScore doesn't count word matches β it embeds both the model output and the reference using a pre-trained language model (BERT), then measures the cosine similarity between their token embeddings. This means a paraphrase like "a feline rested on the rug" can score well against "the cat sat on the mat" because the embeddings are semantically close, even with no shared n-grams.
BERTScore correlates significantly better with human judgments than BLEU or ROUGE on most tasks. The tradeoff: it's more expensive to compute and harder to interpret (a score of 0.87 is less intuitive than "80% word overlap").
When to use: When semantic accuracy matters more than exact phrasing β summarisation, paraphrase detection, response quality where valid outputs vary in wording.
| Metric | What it measures | Handles paraphrases? | Typical use case |
|---|---|---|---|
| BLEU | N-gram precision (outputβreference) | No β penalises them | Machine translation benchmarks |
| ROUGE | N-gram recall (referenceβoutput) | No β penalises them | Summarisation benchmarks |
| BERTScore | Semantic similarity via embeddings | Yes | Open-ended generation, summarisation, response quality |
PM Insight
All three metrics require a reference answer β they tell you how close the model got to a human-written gold standard. That's the core limitation: for truly open-ended tasks (customer support replies, creative writing, product copy), no single reference is "correct." Use these metrics when your task has well-defined expected outputs; fall back to LLM-as-judge or human eval when it doesn't. If your data team reports a BLEU or ROUGE score, always ask: how many human reference answers were used, and how much do valid outputs vary?
Behavioral / task-completion evaluation
For agentic or tool-use tasks, evaluate whether the model completed the task correctly end-to-end. Did the SQL query return the right results? Did the agent book the correct flight? Binary pass/fail on concrete outcomes β more reliable than rubric scores for well-defined tasks.
PM Insight
Every LLM product needs an eval set before it ships β a fixed collection of inputs with expected outputs or quality standards. Without it, you have no way to know whether a prompt change, model upgrade, or new feature made things better or worse. Build the eval set during development, not after a regression.
PM Playbook β Questions to ask
- Prompting, RAG, or fine-tuning? β start with prompting, escalate only with evidence
- Have we tried few-shot examples? β usually the highest-ROI prompt change before any other optimization
- Do we have an eval set? β if not, you can't safely change anything (prompt, model, parameters)
- How are we evaluating output quality? β human eval, LLM-as-judge, or task completion; each has different cost/reliability tradeoffs
- What's the context window size and how does our use case fit within it?
- What's the cost per 1K tokens at our projected usage? β run the numbers before committing
- How are we handling hallucinations for this use case? β never "we'll rely on the model being accurate"
- What's the latency at the p95? β LLM inference can be slow; user experience implications are real
- What model are we using, and have we evaluated alternatives? β cost/quality/latency tradeoffs differ significantly across providers