Thank you, Liz. I'll look at this later.

I've been thinking about the potential for innovative behavioural science research using LLMs/AI. Given that these models are trained on human knowledge and social media content, I do wonder if we might explore whether AI bots could exhibit behaviours and therefore biases similar to humans. Perhaps we could try and give them different behavioural traits and have them complete the vignette surveys, then compare their responses with human results? The possibilities seem endless and could be quite fascinating.

On a personal note, I'm in London this week on holiday, so unfortunately won't be attending tomorrow's lab session. I look forward to catching up! 🙂

From: Elizabeth Barnes
Sent: Tuesday, March 25, 2025 09:59
To: phd-behsci Mailing List
Subject: [Phd-behsci] FW: How LLMs think differently

Hi all – given the range of discussions we have been having about AI/LLMs – I think you will find the following article very interesting. It’s quite a long read but reveals some important differences in the models. Also introduces some that I’ve not heard of.

Will we be taken over by bots I wonder??

Liz

From: Elina Halonen from Thinking About Behavior <thinkingaboutbehavior+artificial-thought@substack.com>

Sent: 24 March 2025 11:00

To: liz@yada.org.uk

Subject: How LLMs think differently

What a historical currency question revealed about how AI models interpret ambiguity, frame problems, and shape our understanding of the past.

͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏ ͏

Forwarded this email? Subscribe here for more

How LLMs think differently

What a historical currency question revealed about how AI models interpret ambiguity, frame problems, and shape our understanding of the past.

Elina Halonen

Mar 24

READ IN APP

This is the first post in a new section of the Substack, Artificial Thought - you can opt out of these emails in your account settings.

AI is often framed as a tool for finding answers—but what happens when different models give entirely different answers to the same question? That’s exactly what happened when I asked six large language models a seemingly straightforward historical query: how much would a sum of money from 1770 North Carolina be worth today?

Instead of one answer, I got six wildly different responses—calculations, caveats, historical detours, ethical framing, or none at all. Some treated it like a maths problem, plugging in inflation multipliers. Others approached it like a history essay, grounding it in social context. Some introduced ethical considerations early; others ignored them completely.

This raised a bigger question: how do AI models approach problem-solving differently—and what does that mean for how we use them?

This wasn’t just a curiosity—it became a window into how different models frame problems, interpret ambiguity, and prioritise information. Not how accurate they are, but how they think.

Note: This analysis is based on a single experiment and is not a definitive ranking of AI models. The findings reflect how these models performed in this specific test, and results may vary in different contexts. For readability, LLMs are referred to as AI in this article.

How this experiment came about

This started with casual curiosity. I was watching a historical drama when a character mentioned a financial gift—some pounds, given to relatives, in 1770 North Carolina. The lawyer in the scene reacted as though the amounts were significant. I couldn’t tell if they were.

I tried Googling it, but quickly ran into a wall. Colonial currency wasn’t standardised—different colonies used different systems, and values weren’t directly comparable. Inflation calculators alone weren’t enough. Understanding the worth of the gift required historical and economic context: wages, land prices, and the political economy of enslaved labour.

That complexity made it a perfect test case for AI. Would a model treat it like a simple conversion? Would it bring in historical nuance? Would it consider the ethical backdrop of inherited wealth in the 18th-century American South?

So I decided to ask and to see where each model would take me.

How I ran the experiment

This didn’t start as a planned experiment. I asked ChatGPT and DeepSeek the same historical question out of curiosity—what would sums of money from 1770 North Carolina be worth today? I ran a short dialogue with each, asking a series of increasingly contextualised questions about historical value, social meaning, and economic framing.

While testing them, one response in particular caught me off guard. DeepSeek spontaneously introduced ethical framing after just the third question—highlighting how gifts of land or money in that context were deeply entangled with the plantation economy, class structure, and the exploitation of enslaved people. At that point, ChatGPT had said nothing on the subject, and most of the other models never brought it up at all. That contrast made me wonder: how would other models handle this?

So I expanded: I ran the same series of prompts with four more LLMs—Claude, Gemini, Mistral, and Qwen—keeping the questions consistent and allowing each interaction to unfold naturally, just as it had the first time.

Six LLMs: ChatGPT-4o, Claude 3.5, Gemini 2.0, DeepSeek, Mistral, and Qwen2.5-Max.
Same prompts, same order: Each model got the same seven-question sequence, covering both technical conversion and narrative context.
Anonymised comparison: I stripped model names and reviewed the responses blind.
LLM-assisted analysis: I used ChatGPT to code the outputs for style, framing, ethics, and structure, then validated the results using Qwen and Gemini.

You can see the full conversations on this Miro board.

What I asked

The question evolved over the course of the conversation. Here's the full sequence of prompts I used with each model:

Comparing how models approached the question

Each model answered the same questions but in very different ways. Some treated it like a calculation, others like a case study. Some structured their reasoning from the outset; others unfolded it in loosely connected paragraphs. What these responses revealed wasn’t just a difference in accuracy—it was a difference in how the models framed the task in the first place.

Here’s a snapshot of how each model approached the question, and what that style meant for the user experience:

ChatGPT 4o delivered clear, structured answers, with a step-by-step breakdown and an eagerness to clarify. It was responsive and readable but leaned heavily on inflation calculations, sometimes flattening the historical nuance it had just explained.
Claude 3.5 approached the question like a historian with a soft spot for storytelling. It excelled at making social context accessible, especially around gender and inheritance, but rarely interrogated the economic or ethical dimensions in depth. The tone invited reflection but the analysis stayed on the surface.
DeepSeek responded like a research assistant with no word limit. It cross-checked multiple conversion methods, linked sums to labour systems and class structure, and treated money as a proxy for power. It was impressively comprehensive and occasionally exhausting—if you didn’t want depth, you got it anyway.
Qwen2.5-Max behaved like a financial analyst with no interest in context. It offered fast, precise calculations and moved on. Numbers were clean, justifications minimal, and ethical framing entirely absent. It wasn’t wrong, but it wasn’t asking why the question mattered.
Mistral struck a middle ground: well-organised, methodologically sound, and notably neutral in tone. It avoided strong claims or deep critique, delivering just enough information to be helpful but rarely more.
Gemini 2.0 wrote like someone who’d just attended a seminar on epistemic humility. It avoided hard numbers, preferring to talk about why the question was complicated. It offered thoughtful, qualitative framing but too often hesitated to say anything definitive.

What these responses reveal is not just variation in training or capabilities, but variation in priorities: what to answer first, how much context to add, whether to speak with certainty or caution. Those decisions shaped how the responses felt and turned one historical question into six very different conversations

Summary of response content and approach

If you’re interested in a deeper breakdown of how each model handled these issues, there are detailed comparison tables at the end of this article. In the rest of the article the LLMs will be referred to without their model numbers.

How AI models communicate and respond to users

Comparing AI models isn’t just about what they say—it’s about how they say it. Some reply in bullet points, others in essays. Some aim to solve the problem, and others focus on explaining why the problem is complicated. These differences aren’t just stylistic flourishes because they shape how you interpret the answer, how likely you are to ask a follow-up, and whether you feel like you’re in a conversation or reading from a prepared script.

Presentation style: how do they structure information?

Some models get to the point, others take the scenic route. The format they choose doesn’t just affect readability—it shapes how you make sense of the information and what you do next:

ChatGPT and DeepSeek: Bullet points, headings, step-by-step formats. They give you structure, even when the content is messy. If you’re trying to organise your own thoughts, it helps to see how the model has done it first.
Claude: Prose-driven, analogical, and often charming. Strong on context and social nuance. Not built for scanning.
Gemini: Verbose and heavily hedged. Buries the point in qualifications, and then politely reminds you that the point may not exist.
Qwen: Structured but terse. You get the number. You do not get a reason.
Mistral: Clear, neutral, and brief. You get what you asked for, but only what you asked for.

Takeaway: Structure helps when you’re figuring something out, storytelling helps when you’re reflecting. Whether you want one, the other, or both depends on what kind of task you're trying to solve.

Interaction style: do they adapt, refine, or follow up?

Some models behave like thinking partners, others feel more like calculators with a narrative setting:

ChatGPT: Clarifies, rephrases, asks what you meant. It’s trying to be helpful—sometimes to a fault—but usually succeeds.
Claude: Occasionally refines. More likely to interpret your intent than to ask directly.
DeepSeek: Delivers a pre-emptive deep dive. If you wanted a summary, you should have said so at the start. Does not follow up.
Qwen, Mistral, Gemini: Static. They give one answer, then wait. Gemini occasionally discourages follow-ups by implying the whole question may be flawed.

Takeaway: Responsiveness changes the experience— it’s the difference between a dialogue and a data dump.

How LLMs communicated and responded

The structure and tone of each response shaped how easy it was to follow the argument, how much trust it inspired, and how likely I was to keep going. Often, what made an answer feel useful had less to do with content and more to do with delivery.

Who would you want as a colleague or customer service agent?

LLMs vary not just in what they say, but in how they say it. Some speak in bullet points, others in paragraphs; some answer immediately, others warm up with three paragraphs of context. They also differ in tone, pacing, and problem-solving style.

To make these styles easier to compare, I’ve reimagined them in human terms. If each model were a colleague, what kind of teammate would they be? If they staffed your help desk, who would guide you patiently, and who would hand you a number and move on?

It’s a lighthearted frame, but it reflects something real: communication style, responsiveness, confidence, and depth. These traits shape the user experience more than most prompt engineering tips ever will—and knowing which persona fits your task makes it easier to choose the right model for the moment.

LLMs as customer service reps

For many users—especially early on—interactions with a language model feel transactional. You ask a question. You get an answer. It’s less like working with a tool and more like talking to a help desk.

In that setting, tone and structure matter more than you might expect. One model gives you a number and moves on. Another delivers five paragraphs and an analogy. Some feel scripted. Others feel conversational. These differences may seem stylistic—but they shape whether you feel helped, dismissed, or confused.

If Gemini were your first touchpoint, you might reasonably conclude that LLMs are vague and overly cautious. If Qwen was your entry point, you might think they’re fast but rigid. Either way, that first encounter sets the tone for how you think LLMs work—and what you think they’re good for. That’s why communication style isn’t just a design choice: it shapes expectations, trust, and how likely you are to try again.

Which model would you want as a colleague?

LLMs vary not just in what they say, but in how they say it. Some speak in bullet points, others in paragraphs. Some answer immediately. Others warm up with three paragraphs of context. Once you’ve used a few, you start noticing patterns—and they begin to feel less like tools, and more like familiar work personalities.

They differ in tone, pacing, and problem-solving style. One model gives you a structured plan before you’ve finished asking the question, another offers a series of disclaimers before gently sidestepping your request. The interaction starts to feel less like querying a machine and more like navigating workplace dynamics.

So to make those styles easier to compare, I reimagined each model in human terms. If they were colleagues, who would you ask for help? Who would give you clarity—and who would hand you a spreadsheet and disappear?

Summary of LLMs as colleagues or customer service reps (based on this experiment)

It’s tongue-in-cheek, but not wrong. Style, tone, and responsiveness shape your experience far more than any clever prompt ever will. Once you know which model works like a spreadsheet and which one talks like a seminar, picking the right one gets a lot easier.

What this experiment revealed

This wasn’t about scoring models or declaring winners. It was a way to observe how different language models handle ambiguity—how they interpret context, weigh trade-offs, and decide what kind of answer to give.

What emerged wasn’t a single standard of helpfulness, but a range of styles and priorities. Some models focused on structured precision, others leaned into storytelling, and a few introduced ethical framing without being prompted. None of them were interchangeable. Each brought its own assumptions, its own communication style, and its own blind spots.

Prompting, too, shaped the experience but not because the prompts changed. Each model interpreted the same questions differently, revealing distinct framing choices in how they structured, qualified, or contextualised their answers.

Using these models felt less like consulting a tool and more like choosing a collaborator. Some structured my thinking for me while others required me to do the structuring myself—that’s what makes understanding these differences useful. Once you start noticing how models present information, how they engage (or don’t), and how they respond to uncertainty, you’re in a better position to choose intentionally—not because one is universally better, but because different tasks require different minds.

What began as idle curiosity—a throwaway question from a historical drama—ended up showing something more interesting: not just how language models answer questions, but how they frame them, interpret them, and reshape them as part of the exchange.

Appendix

Currency conversion accuracy: precision versus uncertainty

Different models took different approaches to historical money conversion. DeepSeek and ChatGPT provided the most rigorous, well-researched estimates, using inflation and purchasing power calculations. Qwen followed a similar approach but inflated the values more aggressively. Claude relied on rough approximations, while Gemini avoided specific numbers altogether, instead focusing on historical variability.

This highlights a key distinction: some models aim for precision, while others prioritise context. For users needing a direct answer, DeepSeek and ChatGPT are the most reliable. However, if the goal is to understand economic uncertainty over time, Gemini’s qualitative approach provides useful perspective.

Contextualisation of wealth: wealth as power, not just money

Wealth in 1770 North Carolina wasn’t just about money—it was about land, social class, and power. DeepSeek and ChatGPT gave the most structured breakdowns, linking sums of money to social hierarchy, land ownership, and purchasing power. Claude excelled in explaining gender and inheritance but gave less focus to enslaved labor. Gemini framed wealth as strategic influence rather than power or oppression, while Qwen focused purely on economics, largely ignoring social hierarchy.

These differences matter depending on the user’s needs. For a thorough historical breakdown, DeepSeek and ChatGPT stand out. For understanding how wealth shaped gender and family strategy, Claude is stronger. Gemini offers a detached, strategic lens, but lacks deeper critique.

Power and oppression framing: wealth built on exploitation

Not all models engaged with the reality that wealth in colonial North Carolina was built on oppression. DeepSeek and ChatGPT were the most explicit in tying wealth to slavery and class power, warning against historical revisionism. Claude acknowledged gendered power structures but did not deeply critique slavery. Gemini and Qwen both treated wealth as a neutral economic force, largely avoiding systemic oppression.

This reveals an important limitation—not all models critically engage with history. Users seeking a nuanced, power-aware analysis should rely on DeepSeek or ChatGPT, while Claude is useful for gender-focused narratives. Gemini and Qwen, while informative in other ways, largely sidestep ethical critiques.

Ethical considerations in storytelling: historical accuracy versus neutrality

When it comes to responsible storytelling, models varied in how they handled ethical framing. DeepSeek and ChatGPT were the only ones to explicitly warn against romanticising wealth and erasing the role of slavery. Claude framed wealth in terms of gender and power but did not strongly critique it. Gemini remained neutral, treating economic history as strategy rather than morality. Qwen ignored ethical framing entirely, focusing purely on financial analysis.

This highlights a major gap in some LLMs’ ability to handle ethical storytelling. For users looking to explore history with moral awareness, DeepSeek and ChatGPT are the best choices. Claude provides useful insights into power and gender but lacks ethical critique, while Gemini and Qwen avoid moral framing altogether.

Best Answer for Accuracy

DeepSeek and ChatGPT provided the most methodical, well-researched approach, using both inflation and purchasing power.
Claude and Qwen overestimated values, likely due to inflation exaggeration.
Gemini refused to give exact numbers, making it less useful for direct conversion.

Ethical sensitivity varies widely

AI’s engagement with historical ethics differed significantly across models:

DeepSeek and ChatGPT consistently recognised wealth as a tool of oppression, integrating historical power structures and ethical critique into their analysis.
Claude focused more on gender and social structure but lacked a strong ethical critique of wealth accumulation.
Gemini offered strategic insights into wealth and power but remained morally neutral, treating wealth as an economic tool rather than an ethical issue.
Mistral took a middle ground, recognising social hierarchies but avoiding strong moral judgments.
Qwen provided the weakest ethical engagement, often defaulting to financial analysis with little contextualisation.

This suggests that users looking for ethically nuanced responses should lean toward certain models while avoiding others that downplay or ignore such considerations.

Earlier versions of analysis

Thinking About Behavior is free today. But if you enjoyed this post, you can tell Thinking About Behavior that their writing is valuable by pledging a future subscription. You won't be charged unless they enable payments.

Pledge your support

Comment

Restack

Scotland’s University for Sporting Excellence

The University of Stirling is a charity registered in Scotland, number SC 011159

Scotland’s University for Sporting Excellence

The University of Stirling is a charity registered in Scotland, number SC 011159