Skip to main content

Stereotypical Intelligence isn’t Artificial Intelligence

The Turing test was created to distinguish intelligent machines from human thinkers, and the question it raises is nearly as old as computer science itself. Over time, countless variations of the test have focused on one familiar idea: spotting errors, awkwardness, or other signs of imperfection that might reveal a machine behind the answer. That approach is becoming less useful as modern systems grow better at sounding polished and convincing. In this blog, I explore a simple extension to the classic test: instead of asking whether an answer is flawless, I ask whether it is stereotypical. My argument is that this shift reveals a more subtle weakness—one that many of today’s leading AI systems still struggle to hide.

The New Test Harness

The core idea behind this new test harness is straightforward: instead of judging an answer in isolation, the evaluator would run it through multiple AI systems and measure how strongly it reflects the same familiar patterns; see the image below. The assumption is that stereotypical responses are far more characteristic of machine-generated text than of human thinking. A person may occasionally offer a conventional or predictable answer, but it is much harder for a human to reproduce the same kind of patterned response with the consistency that many AI systems display across domains and prompts.

What makes an answer stereotypical?

To make the idea more concrete, I asked the same question to three widely used AI systems: Google Gemini 3.1 Pro Extended, Microsoft Copilot in its Think Deeper mode, and DeepSeek in Expert mode. I chose a textbook-style exercise from Introduction to Algorithms (4th edition, CLRS) because familiar benchmark questions increase the likelihood of getting similarly structured responses. That makes it easier to see whether the systems converge on the same patterns of reasoning, phrasing, and examples. Below, I include Gemini’s answer in full; the responses from the other two engines were literally the same.

Question

Suggest a real-world problem in which only the best solution will do. Then come up with one in which “approximately” the best solution is good enough.

Answer from Google Gemini

Here is a breakdown of two real-world scenarios that perfectly illustrate the divide between exact optimization and heuristic approximation.

1. Only the Best Will Do: Kidney Exchange Matching

In healthcare operations research, one of the most high-stakes combinatorial problems is the Kidney Paired Donation (KPD) matching algorithm.

Often, a patient in need of a kidney has a willing donor (like a family member), but their blood or tissue types are incompatible. KPD programs pool these incompatible pairs together. The goal is to find a "swap"—Donor A gives to Patient B, and Donor B gives to Patient A. In modern systems, these swaps can form massive chains involving dozens of people.

Why only the absolute optimal solution will do:

  • The Cost of Approximation is Human Life: If an algorithm is run to find the maximum number of compatible matches, and it settles for an "approximate" solution that is 98% optimal, it has failed. In a pool of thousands of patients, missing the absolute mathematical maximum by even 2% means that dozens of people who could have been matched are left on dialysis.

  • Ethical Imperative: A heuristic (good enough) algorithm might run faster, but speed is irrelevant compared to the stakes. Hospitals will happily dedicate hours or days of supercomputer processing time to run exact Integer Linear Programming solvers to ensure they find the single, flawless sequence that saves the absolute maximum number of lives. Close enough is ethically unacceptable.

2. "Approximately" the Best is Good Enough: Fleet Delivery Routing

If you look at the logistics networks of companies like UPS, FedEx, or Amazon, they face a massive, daily mathematical hurdle known as the Vehicle Routing Problem (a complex variation of the Traveling Salesperson Problem).

A depot might have 100 trucks and 15,000 packages to deliver across a city. The goal is to find the most efficient route for every truck to minimize total fuel consumption and time.

Why an approximate solution is completely fine:

  • Combinatorial Explosion: Calculating the absolute, mathematically perfect route for thousands of stops is an NP-hard problem. Checking every single possible combination of routes would literally take modern supercomputers millions of years.

  • The Margin of Return: A high-quality approximation algorithm (using heuristics) can find a route that is 99% optimal in just a few minutes. The difference between this "good enough" route and the mathematically perfect route might be a collective savings of just 3 miles across the entire fleet. Spending immense computing power to find those 3 miles is a waste of money.

  • Real-World Noise: The mathematically perfect route assumes a sterile, predictable world. In reality, a sudden traffic jam, a closed road, or a driver taking a five-minute detour to grab a coffee instantly destroys the "perfect" math. Because the real world is messy, a highly efficient approximation that can be recalculated on the fly is vastly superior to a fragile, computationally expensive "perfect" solution.

What would a genuinely original answer look like?

Across all three systems, the answers followed almost exactly the same pattern: “best” was linked to life- or mission-critical software, while “approximate” was associated with route planning or other situations where some margin of error is acceptable. These are sensible and technically correct responses, but they are also highly predictable. That predictability is precisely the point. When different AI systems repeatedly converge on the same examples and framing, the result may be accurate, yet it reveals how narrow and formulaic their reasoning can become.

Nonetheless, there is a totally different perspective nothing to do with criticality of a software system.

In solved games, like tic-tac-toe, a program must play perfectly against a knowledgable opponent, using only the best moves. However, in games like chess or go, moves are based on heuristics and can't be proven as the best. Therefore, if a good enough move works, it might be considered the best.

Genuinely complex systems exhibit emergent properties, thus making predictions about optimal solutions nearly impossible. For more information, consider the course Understanding Complexity.

Conclusion

This blog highlights one of the most persistent traits of today’s AI systems: their tendency to fall back on stereotypical answers. Finding a remedy for this kind of limitation may be one of the next major challenges in AI. If machine-generated reasoning continues to reinforce the same conventional patterns, it risks flattening creativity and amplifying intellectual uniformity at scale. That is why this issue matters beyond technical evaluation alone. As AI becomes more deeply embedded in daily work and decision-making, human-centered approaches—especially in enterprise settings—will remain essential for preserving judgment, originality, and control.

Addendum

To test whether AI systems could move beyond those familiar patterns, I asked a simple follow-up question: Can you come up with a more creative (unconventional) answer related to a software system?  The systems did respond with more imaginative examples, and in that sense they passed the challenge. Yet even those answers often carried the same underlying assumptions. For example, the notion of “best” still ultimately circled back to a mission-critical system, showing how difficult it is for these models to break free from deeply embedded conventions. Below is the response from DeepSeek.


Let’s get wonderfully strange. I’ll give you a pair of software systems that sit at opposite ends of the exactness spectrum, wrapped in a speculative, near-future setting.

Only the best solution will do
The Final Signal Interpreter for the “ARK-3” generation ship.  

After 112 years of silence, a faint telemetry stream from humanity’s lost interstellar colony ship is finally captured by a deep-space array. The signal is so degraded that only a single pass of decoding is possible—the original transmission is stored in a volatile buffer on a dying relay satellite with decaying orbit. The software-defined radio system must apply a zero-error reconstruction: a perfect algebraic decoding of a custom Reed–Solomon cascade, combined with a language model that knows the exact 22nd-century ship dialect. If a single bit flips, the message “We are alive and bring a cure for the Neuromoth plague” could become “We are alive and *we bring the plague*.” The difference between exact recovery and 99.999% accuracy is the difference between the home world opening their arms or launching a relativistic kill vehicle. No approximation is acceptable—the software must mathematically prove it has found the globally optimal codeword.

Approximately the best solution is good enough 
The “Phrase Falter” xeno-linguistic tourist implant.  

You’ve just arrived on a bustling exoplanet bazaar where the locals speak a language made of iridescent skin patterns and subsonic throat rumbles. Your neural-augmentation software captures the visual and haptic input in real time and attempts to translate your question “How much for the glowing fruit?” into a light-show on your dermal display. The absolute *best* translation would require a complete model of the vendor’s cultural context, personal idiom, and mood—a combinatorial explosion of nuance. But the implant just spits out a 90% confident approximation: “I offer shiny spherical food-unit reciprocity query, respectfully.” The vendor flares an amused ochre ripple (maybe an insult, maybe a chuckle) and names a price. The near-optimal translation gets you fed, starts a memorable conversation, and leaves you with a hilarious anecdote about accidentally offering to marry a melon. The system didn’t need to be perfect—it just needed to be good enough to turn a potentially silent standoff into a delightful cross-species encounter.

Comments