My Tiny Transformer Hallucinated

Like many people, I became fascinated with machine learning primarily because of the astonishing things large generative models started to accomplish around 2023. Having spent almost two decades in software development with much of that time spent trying to get people to break problems down into clear logical processes with precise definitions, the ‘fuzzy’ behavior of these new generative models felt confusing, kinda spooky, and completely disconnected from ground truths I was used to in computing. I could work with the models and even got comfortable with prompt engineering. But I still felt more like I was playing with some strange magic box rather than a computer, and I found myself lacking any real intuition for how it all actually worked.

That bothered me more than was probably reasonable, and was my main reason for returning to graduate school. I wanted to achieve a real understanding of the machinery inside that strange box.

Deep Learning, which explicitly covered larger neural networks, was a class I had been looking forward to since I had first put in my grad school application. The class was exactly what I was looking for – a moderately rigorous class (in the mathematical sense) which gave me a genuine understanding of various model architectures, why they behave as they do, and helped me connect how models are built to why they work certain ways. It gave me hands-on experience building and evaluating model architectures with PyTorch alongside reading papers and trying to address the math.

There were a lot of really good moments, but the one that really stuck with me happened in assignment 3, in which we implemented five different model types, including a small transformer model (think: a very small version of the model type used in LLMs) to do german to english translations.

To explain the results I saw (taken from my final report):

Qualitatively my best Seq2Seq model is still frankly pretty terrible – it repeats words, seems to have very little conception of rarer word combinations (for ‘orange hat’ it just repeated the word – ‘hat hat’). It does seem to be able to structure sentences somewhat (especially when compared to the encoder-only model) but the word repetition generally breaks up the feel of the sentence being sensible. My Seq2Seq model, even with attention, still seems to be unaware of what it has said previously (‘mother’, ‘mother’, ‘and’, ‘her’, ‘mother’, ‘son’,). While LSTM does theoretically carry context forward the compressed context seems to provide a meaningful limitation.

My transformer model on the other hand creates sensible sentences and plausible scenarios, and even some of its substitutions feel a lot more ‘natural’ (i.e. ‘yellow dog’ for ‘terrier’ – ‘performing a board’ for ‘breaking a stick’). The transformer’s ability to ingest context and its bias for fluency in even this tiny model is almost uncanny. These generated answers could be regarded as very small hallucinations – but they’re hallucinations informed by context. The errors in even this tiny model are coherent.

Behind that observation was a startling revelation – my model, my very tiny model which I had mostly coded and entirely trained by hand, running in a completely deterministic fashion, was generating the phenomena of hallucinations – introducing content (‘yellow’) which wasn’t in the original input.

I’ll admit it: I ended up spending a bit of time googling if terrier had some specific association with ‘yellow’ in German. Then I realized I was being silly and started to try to figure out how this had happened. That’s when the learning really started.

Terrier was a relatively rare word in my training data. There were only 7 instances of it in the set, compared to over 2000 for “dog”. So when the model was trying to predict ‘likeliness’ with the full context of the sentence, even though there was no case in which ‘terrier’ ever directly translates to ‘dog’ in my data set, it went with the most likely and common word given the context. This is how powerful self-attention can be. Then because ‘terrier’ and ‘dog’ map relatively closely as vector word representations, it was a fairly easy switch for the model to make. Thus why my model picked ‘dog’ over the more literal ‘terrier’ despite being wrong.

More puzzling to me was the dog specifically being yellow. So as I looked through my training data I realized that the pattern <color> <noun> and specifically <color> “dog” was unreasonably common. It was far from universal, but of the ~2000 dog references from my training data well over half had some sort of color pattern associated with them. The literal string “yellow dog” appeared only 25 times, but other variants were much more common – “brown dog” appeared 318 times, “white dog” 291 times, “black dog” 272. The word ‘terrier’ was only associated with color once (it was usually associated with a place-name, like “Boston Terrier”) – but ‘dog’ was most often proceeded by a color. So while I couldn’t quite figure out why it would come to the conclusion that the dog was yellow specifically, I could very clearly see how when trying to find the most common structure for the language the model tried to put some sort of color there.

What surprised me most was that this hallucination did not require randomness. There was no temperature setting nudging the model toward being inventive. The model was simply following a path made most likely in its architecture. In transformers, hallucinations emerge from the same machinery that makes the model useful. The model is trained to produce contextually plausible next tokens, with attention helping to bind those continuations into coherent context (and structure). So when the model doesn’t have a clear piece of information to place into that structure (generally because the available examples are rare or ambiguous) that pull towards coherence can cause the model to favor fluent inventions rather than obvious failures.

This is also why no amount of creativity-removal will ever work to completely eliminate hallucinations – because to the model this invention was the most structurally correct answer it had available, just like remembering to use “a” or “an” appropriately in a sentence is structurally correct and required even if those articles didn’t exist in the language being translated from.

This insight would eventually lead me into a whole world of study on accuracy and reliability in LLMs. But for now, it was enough to make the strange magic box feel a little less like magic, while also revealing just how fundamental the problem of LLM reliability really is.

Image: Screen shot from my notebook taken at a kind of random point while training my transformer model.

S	M	T	W	T	F	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30

Related Posts

My Corner of CogSci

Dreams, Models, and Mistakes

Math vs. Stock Picking