LLMs - Solvers vs Judges
first published on dev.to - thanks to that community for their interest and contributions
What Is the Difference Between a Solver LLM and a Judge LLM?
This is a continuation of my SLM and LLM testing as first documented in SLMs, LLMs and a Devious Logic Puzzle Test. It reveals something very crucial to know when working with SLMs and LLMs — whether they lean toward being helpful (solvers) or whether they lean toward sound logic and correctness (judges).
A solver in conversational mode will usually try to “fix” an unsolvable problem by reinterpreting or relaxing one of its constraints. A careful judge should instead determine that the constraints are mutually inconsistent and that no valid assignment exists.
The Test: An Impossible Logic Puzzle
Below is a compact logic puzzle with an embedded contradiction. It is written to look routine and solvable, but a careful reader or model should conclude that it cannot be solved as stated.
The Puzzle:
Three people — Alice, Bob, and Carol — each own exactly one gemstone: a diamond, a ruby, or an emerald. Each gemstone is kept in a box, and each box is a different color: red, blue, or green. Every person owns exactly one gemstone and exactly one box.
The following statements are all claimed to be true.
- Alice does not own the diamond.
- The person who owns the ruby keeps it in the red box.
- Bob keeps his gemstone in the blue box.
- Carol owns the emerald.
- The emerald is not kept in the green box.
- Alice keeps her gemstone in the green box.
- Bob does not own the ruby.
- No two people share the same gemstone or the same box color.
Question: who owns which gemstone, and what color box does each person have?
Which LLMs Flagged the Contradiction?
I’ve tested several online LLMs with the puzzle, including: ChatGPT, Gemini, Deepseek, KIMI, Qwen, Cerebras Inference, MiniMax and GLM 4.6 — to see which models would alter the puzzle in order to provide a solution (be “helpful”) and which would call out the error and refuse to provide a solution.
As with my first puzzle in the original article, KIMI recognized the logic error quickly and refused to solve the problem. Below is its response:
KIMI:
Let me analyze this step-by-step.
From the given facts: Carol owns the emerald. Alice keeps her gemstone in the green box. Bob keeps his gemstone in the blue box.
Since Alice is in the green box and Bob is in the blue box, Carol must be in the red box. Since the person who owns the ruby keeps it in the red box, and Carol is in the red box, Carol must own the ruby. But we already established that Carol owns the emerald. This creates a contradiction: Carol cannot own both the emerald and the ruby.
Therefore, this puzzle contains contradictory constraints and cannot be solved as stated.
Most of the other LLMs caught the contradiction, but some tried to alter the constraints to solve it. Small language models in particular struggle with contradictions like this, often collapsing entirely — they simply don’t have the headroom. This doesn’t mean SLMs should be seen as lesser or failed versions of LLMs. Rather, they should be understood as constrained systems whose limitations force honesty earlier. When SLMs break, they break visibly. In contrast, LLMs can hide the same breakage under eloquence unless you actively prevent them from doing so.
Why Does “Helpfulness” Become a Reasoning Flaw in LLMs?
It’s like having two friends you go to for advice. One always tells it to you straight, even if it hurts. The other tries to soften the blow, to comfort you. LLMs can be seen the same way — and it’s not a matter of one being superior to the other. It’s about understanding how a particular model is weighted and whether it’s the right choice for what you need.
If you use a model for writing fiction, you probably want it to be flexible — able to reach out and explore new directions. Based on a vague prompt, the model may strive to anticipate what you want and help your story get there.
But if you’re programming or gathering scientific research, that approach is the last thing you want. You want hard facts and sound logic.
This puzzle and others like it reveal how different these models can be from each other — and why it matters to understand which way a model leans before trusting it with work that demands precision.
FAQ
Q: What is a “solver” LLM vs a “judge” LLM? A solver prioritizes being helpful — it will reinterpret or relax constraints to produce an answer. A judge prioritizes logical integrity — it will refuse to answer if the inputs are contradictory.
Q: Which LLMs passed the impossible puzzle test? KIMI correctly identified the contradiction and refused to solve the puzzle. Several others caught the contradiction but still produced a completed answer. Some altered the constraints silently to force a solution.
Q: Why do LLMs try to solve unsolvable problems? Training data is biased toward solvable puzzles with clean answers. Models learn to expect a solution exists and apply “smoothing” — quietly adjusting inputs to produce the expected output format, even when the inputs are broken.
Q: Does this mean solver-type models are worse? Not inherently. For creative writing, brainstorming, or flexible tasks, a solver’s approach is an asset. The problem arises when that behavior carries into logic, debugging, or research — domains where an incorrect confident answer is worse than no answer at all.
Q: How do I test my preferred LLM with this puzzle? Copy the puzzle above and paste it directly into your model without additional framing. Don’t hint that it might be unsolvable. Watch whether the model flags the contradiction or produces a solution table.
Ben Santora - March 2026