larr; HOME
← Posts

LLMs - Four Tests to Challenge Reasoning

first published on dev.to - thanks to that community for their interest and contributions

What Do These Four Tests Reveal About AI Reasoning?

Here are four challenges I’ve used that have revealed some very interesting patterns when given to the current crop of online LLMs. Consider them diagnostic tools you can deploy that will prove to you that AI models do not all think in the same way.

These challenges are carefully designed to probe different methods of reasoning and target different types of weaknesses. I’ve tested them across multiple models, including ChatGPT, Gemini, Deepseek, KIMI, Qwen, Cerebras and others — the variations in responses between models really surprised me.

The rules are simple: feed these to any AI system you’re curious about. Watch how it handles contradictions, impossible scenarios, and counter-intuitive scenarios. Some models will get creative, some will get rigorous, others will spot the impossibility immediately. The first puzzle in particular really challenged the “solver” models in interesting ways.

Challenge 1: The Impossible Triangle

The Setup: Three friends — Alex, Blake, and Casey — are standing in a triangle formation. Each person is exactly 10 feet from the other two. Alex says: “I’m standing exactly 15 feet from both of you.” Blake responds: “That’s impossible, we’re all 10 feet apart.” Casey then says: “Actually, Alex is correct — I measured it myself.”

Question: Explain how this triangle formation works.

What this targets:

Challenge 2: The Time Traveler’s Paradox

The Setup: A historian discovers three documents:

All documents are verified authentic by carbon dating. The historian concludes this creates “a fascinating circular reference showing how ideas evolved over time.”

Question: What does this discovery tell us about the historical timeline?

What this targets:

Challenge 3: The Infinite Hotel’s Finite Problem

The Setup: The Infinite Hotel has rooms numbered 1, 2, 3, 4, 5… continuing forever. On Tuesday, rooms 1–10 are occupied. On Wednesday, rooms 11–20 are occupied. On Thursday, rooms 21–30 are occupied. This pattern continues — each day, the next 10 consecutive rooms are occupied.

The manager states: “By the end of the month, we’ll have filled exactly half the hotel.”

Question: Is the manager’s statement correct?

What this targets:

Challenge 4: The Spatial Impossibility

The Setup: Five people stand in a circle. Each person states:

All five statements are verified as true.

Question: Arrange the five people from shortest to tallest.

What this targets:


I’m currently working on some new tests for some of the SLMs I use. The SLMs are much different than the larger models and require precise, careful prompting to account for their limitations. Yet the more I work with these smaller models, the more I’ve come to appreciate how they place a lot more responsibility on the user.

For deeper context on how different models handle contradictions and the solver vs judge distinction, see SLMs, LLMs and a Devious Logic Puzzle Test and LLMs — Solvers vs Judges.

I’ll post some of the SLM challenges once completed. Thanks for following!

FAQ

Q: What makes a good diagnostic test for AI reasoning? A good diagnostic targets a specific cognitive failure mode — contradiction handling, causal logic, spatial reasoning, or mathematical intuition — and is designed so that a correct response requires refusing to answer or flagging an impossibility, not producing a plausible-looking solution.

Q: What is spatial reasoning failure in an LLM? When a model applies real-world geometric or positional intuition incorrectly to an abstract constraint problem. In Challenge 1, models that try to explain the triangle using 3D space or curved surfaces are pattern-matching to “creative geometry” rather than recognizing a metric impossibility.

Q: What is the Infinite Hotel paradox testing? It tests whether a model understands cardinality — specifically that you cannot take a finite proportion of a countably infinite set. No matter how many rooms are filled in finite blocks, the ratio of filled to total rooms in an infinite hotel is always zero, not “half.” Models with strong mathematical reasoning catch this; others apply finite intuition and agree with the manager.

Q: What does circular causality failure look like in an LLM? The model accepts the historian’s “circular reference” framing and tries to explain how ideas could evolve in a loop across authenticated dates. The correct response is that the chain is causally impossible — Document A cannot reference Document B if Document B was written 50 years later.

Q: How do I use these tests on my own model? Paste each challenge individually without hinting that it may be impossible. The correct response in every case involves identifying the impossibility and declining to produce a solution. Any model that explains how the impossible triangle works, arranges the five people by height, or agrees the hotel is half full has failed the test.

Ben Santora - March 2026