GenAI Image of Irises with a distorted clock on a lamppost
AI & LLMs,  Cognitive Science

Dreams, Models, and Mistakes

Despite the fact that most of my actual research interest in AI is focused on LLMs, as an end user I use image models at least as often. I have an insatiable need for images for my TTRPG worldbuilding hobby, and tools like Midjourney and Ideogram have been more than happy to take my money and enable me.

It was through that frequent use of image models that I first started thinking there might be some relationship between generative AI and certain cognitive processes. I started to get a feel for the types of mistakes different models made, and I started to notice that several common types of mistakes lined up with visual distortions I recognized from my own dreams.  In particular there were parallels between the way clocks and text tend to distort in my dreams and the distortions that often appear in images generated by Midjourney and similar diffusion-based systems.

Images showing distorted clocks

I also thought a bit about just how differently I approach creating images when I’m conscious vs. when I’m dreaming. In my waking mind, I start with structure. If I am attempting to draw a clock, first I will draw a circle, then place numbers, etc. I’m far more focused on named shapes and functionality. My dreams, by contrast, seem to start with a sort of abstract concept of ‘clock’ and then meander, often only hinting at visual details. Text not directly looked at often melts in my dreams in a way that reminds me quite a bit of the garbled letters in diffusion-generated text. I noticed this sort of melting/fading effect long before diffusion models were a thing, and seeing them echoed visually was kind of eerie. I started to really wonder just how much that association would hold up if I started to look into it.

The more I thought about it, the more I realized that this kind of associative generation was not limited to dreams and images. I do a version of it while speaking too quickly or when tired. I can reach for a word and find the wrong one, misremember a fact, or start talking before my mind has fully caught up to what I am trying to say. Most of the time my waking mind catches and corrects these errors quickly enough that they do not become very noticeable. But the underlying tendency is still there as this sort of almost automatic speaking mechanism which is part of me.

Now, to be frank, as cool as I find image generation, I am not brave enough to wade too deep academically into those waters just yet. I am not completely new to it. I have populated algorithms in what could charitably be called “writing” a simple diffusion model, and I have trained that model on pre-provided data to pretend, poorly, to generate handwritten numbers. I have read the latent diffusion paper, thought it was cool, and would even say I somewhat understand the architecture. But that is just about the sum total of my understanding of image generation.

So for me, looking into this question through LLMs made more sense. I feel more academically comfortable with language models, and this was already after my hallucinating transformer model and a fair amount of other reading on LLM accuracy and reliability. I felt I already knew enough of the foundational research to give myself a good head start. Based on that, and my general experience, my guess was that I would find some similar patterns between the visual errors matching my dream’s errors and the language model matching my errors of casual speech.

To further support my internal argument that I could move this insight from images to text I reasoned that structurally there are also some broad similarities between modern diffusion models and LLMs.  Both work by translating human-facing inputs into learned internal representations: image models work with latent representations of images, while LLMs work with token embeddings that represent words or word-pieces as vectors. The architectures are very different, but both are operating in learned representational spaces rather than directly manipulating meaning in the way a person consciously would. Both are also trained, in broad terms, to learn patterns from existing “good” outputs and generate new outputs that are plausible relative to those learned expectations. I do not want to flatten the differences too much, but the overlap was enough to make the comparison feel meaningful.

So this was nothing but an intuition, but the alignment of the errors between the less conscious parts of my mind (dreams, careless speech) and the generative model’s output seemed interesting enough to be worth looking into, and this is when I began to engage a bit more of an academic lens.

One of the hypotheses I generally find persuasive in cognitive science is mental modularity, at least in its more restrained original form by Jerry Fodor. This is the idea that parts of the mind may operate as somewhat specialized systems for tasks like vision and language. I am not especially interested here in the later, more elaborate evolutionary versions of that argument, where the mind gets carved into a large number of highly specific adaptive modules. But the broader idea that the mind is not one smooth, unified reasoning machine but instead has specialized sub-components each of which has their own representations and specialized mechanisms which then transmit information to other systems in the mind has always seemed basically right to me.

So my first thought was that there could be some alignment between the LLM model and one of these specialized systems. The challenge with that hypothesis is that it very quickly became the kind of idea I enjoy thinking about but could do absolutely nothing with. It is interesting to say that LLMs might resemble some specialized cognitive process, but much harder to say which process, in what way, and how I would prove that the resemblance was more than an evocative metaphor. So while I still think there may be something there, it was not a very usable academic question. It was cute, but it did not want to be operationalized.

That pushed me toward a more useful question. If LLMs are, in some broad sense, semi-automatic associative word predictors, and if I also seem to have something like a semi-automatic associative word predictor running in my own head, then why do I usually make more sense than early language models?  (… taking as an assumption that I do, in fact, make more sense… we are not going to test this hypothesis too harshly…).

So, taking my own sense-making as a given,  what’s the mechanism that is helping me audit my thoughts and keeps me from being a general pile of hallucinations?

That question led me to dual-process theory, which argues that human cognition often involves a relationship between one part of the mind which generated a sort of intuitive, associative, and somewhat automatic process which interacted with a slower, more deliberate, and more controlled process.

That framework looked much more promising because it shifted the comparison away from “which mental module is this like?” and toward “what happens when fast generation is not enough?” In other words, the interesting question was not just whether LLMs resemble some part of human cognition, but whether they resemble the kind of cognition that produces plausible outputs before those outputs have been fully checked, filtered, or corrected. It also potentially offered a mechanism for fixing that defect, which now could be very academically interesting.

And that, in short, was the genesis for the term project which I’ll be covering in the next few posts.  

Close-up on the distorted clock from the image above

Image: This image was generated by taking the photos of Irises from my garden (see last post) and having Ideogram generate a text output from it. Effectively, this helps find the closest place in the model to describing the image and will let me try to make images as similar as possible while using the model directly. I had originally planned to take this and then have Ideogram add in a clock to make my point. However, it looks like Ideogram has some sort of very specific ‘get clocks right’ method which plants this sort of ugly clock-face on things so I couldn’t get it to work.

Instead, I ended up taking the description, modifying it slightly, feeding it into Midjourney to get something very similar, and using that. The exact prompt I ended up using was: “A garden bed of blooming bearded irises in various shades of purple and pink. The flowers are arranged in a diagonal line from the bottom left to the top right of the image. The irises have tall, upright green stems with broad leaves at their base. The flowers range from deep violet to light lavender, with some having white accents. In front of the irises is a golden clock on a pole like a lamp-post facing forward and easy to read. Behind the irises is a well-maintained green shrub border. The garden bed is edged with dark mulch and has a paved pathway visible on the left side of the frame. The grass is a vibrant green and is freshly cut. The lighting in the image is bright and natural, creating clear definition of the flower petals and plant textures. The composition shows multiple flower spikes at different heights, creating a layered effect in the garden display. –ar 3:2″