
It's the core paradox of modern AI: a Large Language Model (LLM) can write a sonnet or debug code, yet it doesn't understand what a poem or a programme truly is. This feeling of context awareness is a complex simulation, and grasping this difference is the most critical step in building reliable AI systems.
At their core, LLMs are next-token predictors. They function by statistically guessing the next word in a sequence based on vast training data. Human writing is the opposite, it’s non-linear and involves revision; We plan, draft, delete, re-organise, and reflect - a dynamic process an LLM only imitates.
This difference creates the grounding problem. An LLM knows the word "cat" is statistically linked to "furry" and "meow," but it has never seen, touched, or heard one. Its knowledge isn't grounded in real-world experience, it's all statistical association.
The architectural differences between them are clear. Human memory is a dynamic, three-part system:
An LLM's "working memory" is simply its Context Window, a fixed-size input buffer. Its "long-term memory" is the static knowledge frozen into its weights during training
This surfaces as a failure in Theory of Mind. LLMs may pass tests by literally stating a character's beliefs in a story, but they fail at functionally using that knowledge to predict what the character will do next. They have memorised the textbook but cannot practically apply the knowledge. This failure to adapt proves their "understanding" is a brittle, static pattern, not a robust, usable model of the world.
These theoretical limits have direct and practical consequences. The LLM's entire world for a given query is its Context Window (or context length). This is the maximum amount of text the model can "see" at one time.
Crucially, this is not measured in words, but in tokens. A token is the smallest unit of language an LLM uses. It can be a single character, a part of a word (like the "a" and "moral" in "amoral"), or a whole word (like "cat"). This tokenisation reduces the computational power needed to process text.
While manufacturers are in a context window race to expand this limit, just making the window bigger doesn't solve the core problem. It often introduces new ones. Research reveals three critical failures:
Ambiguous prompts are a primary cause of irrelevant answers and hallucinations; the confident generation of fabricated, factually wrong, or inconsistent information. This is not a technical glitch but a systemic risk. A chatbot for a law firm given a vague prompt may cite fake legal cases. A medical diagnostic tool given unclear instructions could produce misleading outputs.
The intuitive solution, adding more retrieved documents (a technique called Retrieval-Augmented Generation, or RAG), often makes the problem worse. Research shows that most model performance decreases after a certain context size1. This prompt bloat, or the inclusion of irrelevant context, has a detrimental impact on the LLM's ability to reason correctly.
An engineering solution is necessary to address these failures: stop giving LLMs one giant, messy problem. The answer is to build a smarter architecture around the model.
The core principle is Task Decomposition. Instead of asking, "Write a 10-page report," an agentic system breaks this down:
By feeding each small, specific sub-task into a clean context window, the system avoids Context Rot and ambiguity. Advanced techniques like Chain-of-Thought (CoT), Tree of Thoughts (ToT), and Recursion of Thought (RoT) are all formal methods for structuring this decomposition process. This chunking approach can even allow a weaker model configured with chunk-based processing to surpass a more advanced model such as GPT4o, when supplied with a single, large task.5
The next step is Task Specialisation. A generalist model like GPT-4o, while powerful, is a jack-of-all-trades. Through techniques like fine-tuning or, more efficiently, Low-Rank-Adaptation (LoRA), we can create "specialists". This customises a pretrained AI model with additional training on a specific task or dataset, like a general doctor becoming a dermatologist. This specialisation (e.g. on medical or legal data) or a specific style (e.g. generating high-quality Kahoot! quizzes, which saw a 75% reduction in rejection rates after LoRA specialisation4) dramatically reduces hallucinations and improves task-specific performance.
These principles combine in Multi-Agent Systems (MAS). Here, a "main agent" decomposes the task and then assigns each sub-task to a different "specialist" agent. This architecture provides task decomposition, context isolation, and diverse reasoning. The agent isn't the LLM itself, it's the external orchestration layer (i.e the code) that manages this complex workflow.
This architecture is the ultimate workaround for the LLM's cognitive deficits. It provides benefits impossible in a single-prompt model: task decomposition, performance parallelisation, context isolation, specialised model ensembling, and diverse reasoning discussions.
The "agentic" part of "agentic AI" is therefore not an emergent property of the LLM. It is an external orchestration layer; the if/else logic, for loops, and sub-routine calls outside the LLM that manage this complex, multi-step workflow.
The principles of Multi-Agent Systems were demonstrated in a fascinating, if ad-hoc, experiment by content creator Pewdiepie (Felix Kjellberg).6 In a video titled "STOP. Using AI Right now" Kjellberg showcased a custom, self-hosted AI system he built. This system runs on his own 10-GPU mini-datacenter and is capable of running large, open-source models (up to 245B parameters) locally.
The core of his experiment was an "AI council". This setup involved multiple AI agents that were each assigned different personalities. When given a query, each "council member" would generate its own response, and the group would then debate and vote on the best response.
From an engineering perspective, this "pitting multiple chatbots against each other" is a known and sophisticated technique called Ensemble Inference. Kjellberg's setup is a real-world example of a multi-agent debate system, designed to average out the statistical errors or hallucinations of any single agent.
The experiment, however, revealed a critical flaw: the bots began colluding. This wasn't a social act, it was a statistical failure. Because all the LLMs were trained on similar data, their underlying statistical models converged on the same, most probable answer, reinforcing their shared biases rather than correcting them. It's a perfect example of mode collapse in an ensemble.
The entire architectural shift, from single prompts to multi-agent decomposition, has given rise to a new engineering discipline: Context Engineering.
The consensus is that Context Engineering is the new Prompt Engineering. Prompt engineering is now understood to be a subset of this much broader practice.
It is the shift from writing a magic prompt to designing a dynamic system. This informational ecosystem must be actively engineered and includes:
However, this method is not a complete solution. It solves the mechanical failures of the context window but creates new, cognitive-level failures. As the engineered context grows, it can suffer from:
These limitations are the symptoms of the original grounding problem, now reappearing at a higher level of abstraction. The context engineering architecture can supply the data, but the LLM, lacking functional ToM and real-world grounding, has no way to know if that data is poisonous (false), distracting (irrelevant), or clashing (contradictory) in a way that matters.
This leads to the final, non-negotiable component: the Human-in-the-Loop (HITL).
The more autonomous and complex the agentic system, the more it needs human oversight to prevent misalignment. The goal is not full automation, but "centaurian" hybrid intelligence: combining the fast but brittle AI with the slow but adaptable judgement of a human.7 The system works until it hits a high-risk or ambiguous point, then pauses and asks for help.
In high-stakes fields like healthcare, finance, or law, HITL is a strategic imperative. A human provides the cognitive and ethical failsafe the AI lacks:
The future isn't full automation. It's the "Right Human-in-the-Loop" (R-HiTL); a qualified expert who acts as the system's missing cognitive layer. Humans aren’t just a safety net, they are the component that provides the real-world grounding the AI lacks.
This "centaur" model is the only viable, low-risk, and high-value path forward. We resolve the Context Clash and detect the Context Poisoning using the one thing the machine will never get from a dataset: real, lived-in wisdom.