Jack Webster

Head of Engineering

An innovator and people leader with an engineering background, Jack drives programs that leverage Strategic AI to keep ClearSky at the forefront of technical excellence and solution quality.

Beyond the Prompt

Why AI 'Understanding' Is an Engineered Illusion

November 26, 2025

It's the core paradox of modern AI: a Large Language Model (LLM) can write a sonnet or debug code, yet it doesn't understand what a poem or a programme truly is. This feeling of context awareness is a complex simulation, and grasping this difference is the most critical step in building reliable AI systems.

The Simulation of Understanding

At their core, LLMs are next-token predictors. They function by statistically guessing the next word in a sequence based on vast training data.Human writing is the opposite, it’s non-linear and involves revision; We plan, draft, delete, re-organise, and reflect - a dynamic process an LLM only imitates.

This difference creates the grounding problem.An LLM knows the word "cat" is statistically linked to "furry" and "meow," but it has never seen, touched, or heard one. Its knowledge isn't grounded in real-world experience, it's all statistical association.

The architectural differences between them are clear. Human memory is a dynamic, three-part system:

Sensory Memory: A high-bandwidth, ultra-short buffer for raw perceptual input.
Working Memory: A capacity-limited cognitive workspace that actively maintains and manipulates information for reasoning.
Long-Term Memory: A vast, stable, and associative store for lifelong knowledge.

An LLM's "working memory" is simply its Context Window, a fixed-size input buffer. Its "long-term memory" is the static knowledge frozen into its weights during training

This surfaces as a failure in Theory of Mind. LLMs may pass tests by literally stating a character's beliefs in a story, but they fail at functionally using that knowledge to predict what the character will do next. They have memorised the textbook but cannot practically apply the knowledge. This failure to adapt proves their "understanding" is a brittle, static pattern, not a robust, usable model of the world.

The Brittle Machine: When Context Fails

These theoretical limits have direct and practical consequences. The LLM's entire world for a given query is its Context Window (or context length). This is the maximum amount of text the model can "see" at one time.

Crucially, this is not measured in words, but in tokens.A token is the smallest unit of language an LLM uses. It can be a single character, a part of a word (like the "a" and "moral" in "amoral"), or a whole word (like "cat"). This tokenisation reduces the computational power needed to process text.

While manufacturers are in a context window race to expand this limit, just making the window bigger doesn't solve the core problem. It often introduces new ones. Research reveals three critical failures:

Lost in the Middle: LLM performance is highest when relevant information is at the beginning or end of a prompt.¹ Crucial information placed in the middle is often ignored. The model's attention mechanism has positional biases, it doesn't search like a human, it weighs information statistically.
Context Rot: Even with perfect retrieval, performance degrades substantially as the total input length increases.² The extra tokens, even if irrelevant, act as statistical "noise" that harms its reasoning ability. One study showed LLMs can only track 5-10 variables in code before their performance rapidly degrades to random guessing.
Garbage In, Garbage Out: These mechanical failures are amplified by ambiguous or non-explicit prompts. Because the LLM is a non-grounded system, it has no internal model of your intent.³ It must rely only on the tokens provided.

Ambiguous prompts are a primary cause of irrelevant answers and hallucinations; the confident generation of fabricated, factually wrong, or inconsistent information. This is not a technical glitch but a systemic risk. A chatbot for a law firm given a vague prompt may cite fake legal cases. A medical diagnostic tool given unclear instructions could produce misleading outputs.

The intuitive solution, adding more retrieved documents (a technique called Retrieval-Augmented Generation, or RAG), often makes the problem worse. Research shows that most model performance decreases after a certain context size¹. This prompt bloat, or the inclusion of irrelevant context, has a detrimental impact on the LLM's ability to reason correctly.

Architectural Solutions: From Single Prompts to Agentic Systems

An engineering solution is necessary to address these failures: stop giving LLMs one giant, messy problem. The answer is to build a smarter architecture around the model.

The core principle is Task Decomposition. Instead of asking, "Write a 10-page report," an agentic system breaks this down:

"Outline the report"
"Retrieve sales data"
"Analyse sales data"
"Draft summary"

By feeding each small, specific sub-task into a clean context window, the system avoids Context Rot and ambiguity. Advanced techniques like Chain-of-Thought (CoT), Tree of Thoughts (ToT), and Recursion of Thought (RoT) are all formal methods for structuring this decomposition process. This chunking approach can even allow a weaker model configured with chunk-based processing to surpass a more advanced model such as GPT4o, when supplied with a single, large task.⁵

The next step is Task Specialisation. A generalist model like GPT-4o, while powerful, is a jack-of-all-trades.Through techniques like fine-tuning or, more efficiently, Low-Rank-Adaptation (LoRA), we can create "specialists". This customises a pretrained AI model with additional training on a specific task or dataset, like a general doctor becoming a dermatologist. This specialisation (e.g. on medical or legal data) or a specific style (e.g. generating high-quality Kahoot! quizzes, which saw a 75% reduction in rejection rates after LoRA specialisation⁴) dramatically reduces hallucinations and improves task-specific performance.

These principles combine in Multi-Agent Systems (MAS). Here, a "main agent" decomposes the task and then assigns each sub-task to a different "specialist" agent. This architecture provides task decomposition, context isolation, and diverse reasoning. The agent isn't the LLM itself, it's the external orchestration layer (i.e the code) that manages this complex workflow.

This architecture is the ultimate workaround for the LLM's cognitive deficits. It provides benefits impossible in a single-prompt model: task decomposition, performance parallelisation, context isolation, specialised model ensembling, and diverse reasoning discussions.

The "agentic" part of "agentic AI" is therefore not an emergent property of the LLM. It is an external orchestration layer; the if/else logic, for loops, and sub-routine calls outside the LLM that manage this complex, multi-step workflow.

Case Study: Pewdiepie's "AI Council"

The principles of Multi-Agent Systems were demonstrated in a fascinating, if ad-hoc, experiment by content creator Pewdiepie (Felix Kjellberg).⁶ In a video titled "STOP. Using AI Right now" Kjellberg showcased a custom, self-hosted AI system he built. This system runs on his own 10-GPU mini-datacenter and is capable of running large, open-source models (up to 245B parameters) locally.

The core of his experiment was an "AI council". This setup involved multiple AI agents that were each assigned different personalities. When given a query, each "council member" would generate its own response, and the group would then debate and vote on the best response.

From an engineering perspective, this "pitting multiple chatbots against each other" is a known and sophisticated technique called Ensemble Inference. Kjellberg's setup is a real-world example of a multi-agent debate system, designed to average out the statistical errors or hallucinations of any single agent.

The experiment, however, revealed a critical flaw: the bots began colluding. This wasn't a social act, it was a statistical failure. Because all the LLMs were trained on similar data, their underlying statistical models converged on the same, most probable answer, reinforcing their shared biases rather than correcting them. It's a perfect example of mode collapse in an ensemble.

The New Frontier: Context Engineering

The entire architectural shift, from single prompts to multi-agent decomposition, has given rise to a new engineering discipline: Context Engineering.

The consensus is that Context Engineering is the new Prompt Engineering. Prompt engineering is now understood to be a subset of this much broader practice.

Prompt Engineering is transactional. It focuses on crafting the perfect input to guide a model's response in a single interaction.
Context Engineering is systemic. It is defined as building dynamic systems to provide the right information and tools in the right format or, more simply, the art and science of filling the context window with just the right information.

It is the shift from writing a magic prompt to designing a dynamic system. This informational ecosystem must be actively engineered and includes:

Retrieval (RAG): Dynamically fetching relevant facts from external knowledge bases.
Memory: Managing "short-term memory" (conversation history) and "long-term memory" (user preferences).
Tool Use: Defining the tools an agent can call (e.g. APIs) and feeding the responses from those tools back into the context.
Compression & Summarisation: Actively pruning outdated information or summarising long histories to stay within the finite context window.
Workflows: Managing the agentic task decomposition and state, preventing context overload by breaking complex tasks into focused steps.

However, this method is not a complete solution. It solves the mechanical failures of the context window but creates new, cognitive-level failures. As the engineered context grows, it can suffer from:

Context Poisoning: A hallucination from a previous step gets into the agent's "memory" and is re-fed to the model, poisoning all subsequent outputs.
Context Distraction: The curated context becomes so large that it re-introduces the Context Rot problem, overwhelming the models training.
Context Clash: The system retrieves two or more pieces of information that are contradictory.

These limitations are the symptoms of the original grounding problem, now reappearing at a higher level of abstraction. The context engineering architecture can supply the data, but the LLM, lacking functional ToM and real-world grounding, has no way to know if that data is poisonous (false), distracting (irrelevant), or clashing (contradictory) in a way that matters.

Conclusion: The Indispensable Human-in-the-Loop

This leads to the final, non-negotiable component: the Human-in-the-Loop (HITL).

The more autonomous and complex the agentic system, the more it needs human oversight to prevent misalignment. The goal is not full automation, but "centaurian" hybrid intelligence: combining the fast but brittle AI with the slow but adaptable judgement of a human.⁷ The system works until it hits a high-risk or ambiguous point, then pauses and asks for help.

In high-stakes fields like healthcare, finance, or law, HITL is a strategic imperative. A human provides the cognitive and ethical failsafe the AI lacks:

Accuracy: One study found AI-only diagnostics was 92% accurate and a human-only pathologist 96%. The hybrid AI+Human team was 99.5% accurate.⁸
Compliance & Ethics: An AI can't interpret legal nuances or spot the bias it inherits from its training data. A human is required for regulatory compliance (like the EU's AI Act) and to catch and correct these biases.
Auditability: HITL workflows turn a "black box" into an auditable log of decisions, essential for compliance.

The future isn't full automation. It's the "Right Human-in-the-Loop" (R-HiTL); a qualified expert who acts as the system's missing cognitive layer. Humans aren’t just a safety net, they are the component that provides the real-world grounding the AI lacks.

This "centaur" model is the only viable, low-risk, and high-value path forward. We resolve the Context Clash and detect the Context Poisoning using the one thing the machine will never get from a dataset: real, lived-in wisdom.

Citations

Lost in the Middle: How Language Models Use Long Contexts - MIT Press, accessed November 14, 2025, https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638/119630/Lost-in-the-Middle-How-Language-Models-Use-Long
MITIGATING CONTEXT ROT: A FOUNDATIONAL PRINCIPLE FOR BUILDING RESILIENT AI, accessed November 14, 2025, https://amusatomisin65.medium.com/mitigating-context-rot-a-foundational-principle-for-building-resilient-ai-81490be0edce
What Do Large Language Models "Understand"? | Towards Data Science, accessed November 14, 2025, https://towardsdatascience.com/what-do-large-language-models-understand-befdb4411b77/
When Does Divide and Conquer Work for Long Context LLM? A Noise Decomposition Framework - arXiv, accessed November 14, 2025, https://arxiv.org/html/2506.16411v1
Phi Silica task specialization using LoRA in Microsoft Learning Zone: A technical deep dive, accessed November 14, 2025, https://blogs.windows.com/windowsdeveloper/2025/07/31/phi-silica-task-specialization-using-lora-in-microsoft-learning-zone-a-technical-deep-dive/
PewDiePie unveils custom AI chat system powered by a 'council' of bots that vote on responses | The Express Tribune, accessed November 14, 2025, https://tribune.com.pk/story/2575486/pewdiepie-unveils-custom-ai-chat-system-powered-by-a-council-of-bots-that-vote-on-responses
Human-Artificial Interaction in the Age of Agentic AI: A System-Theoretical Approach - arXiv, accessed November 14, 2025, https://arxiv.org/html/2502.14000v1
Human-in-the-Loop AI (HITL) - Complete Guide to Benefits, Best Practices & Trends for 2025, accessed November 14, 2025, https://parseur.com/blog/human-in-the-loop-ai

More articles from this author: