Coelho Mollo and Millière: The Vector Grounding Problem

Post on “The Vector Grounding Problem” for the Brains Blog

Dimitri Coelho Mollo & Raphaël Millière

We first preprinted “The Vector Grounding Problem” in April 2023, about four months after the release of ChatGPT. By that point, large language models (LLMs) had started to capture the attention of philosophers, but there was still very little published work on the topic. The preprint languished on arXiv for longer than we initially planned, but it lived a life of its own and generated some interesting discussions. Since then, the philosophical literature on LLMs has grown considerably, and LLMs themselves have evolved: they’ve become more capable, more multimodal, more often embedded in tool-using systems, and fine-tuning has become more and more central for their capabilities. We finally came around to finding the paper a good home last year, and we’re glad that it has now appeared in its final form (and in great company!) in this special issue of Philosophy and the Mind Sciences. The delay between the preprinting of the first version and the publication of the final version now gives us a chance to reflect on what we argued in the paper.

Steve Harnad’s influential symbol grounding problem asked how the symbols that classical AI systems manipulate could acquire meaning, rather than merely inheriting it from human interpreters. LLMs don’t manipulate symbols in the traditional sense: they process sequences of tokens as high-dimensional vectors, and those vectors are transformed by learned algebraic operations. Still, Harnad’s worry returns in a new form: if an LLM is trained only on text, are its vectors ever about dogs, uranium, elections, or rainstorms? In other words: do some of the internal states and outputs of LLMs (and similar systems) represent anything outside patterns in language, despite being trained only on text, and for producing more text? This is what we call the vector grounding problem as a nod to Harnad’s classic paper.

Work on this question can get confusing in several ways. First, “meaning” is used to, well, mean different things: mental content, speaker meaning, conventional meaning, cognitive content, etc. In the paper, we focus on the contents of internal representations, and on the meaning of the outputs they causally contribute to producing. Making this clear also avoids a second potential confusion: having internal states with representational content does not entail having a mind, understanding language, or being conscious. Finally, “grounding” itself can be many things. In the paper, we distinguish between five different kinds of grounding. We identify one as the most fundamental, and thus most central to Harnad’s grounding problem: referential grounding, which captures how a representation hooks onto what it represents.

Our proposal, in line with mainstream theories of representation, is that referential grounding requires two broad ingredients. The first is a causal-informational relation: a state of the system must carry information about something in the world, perhaps through a long and indirect chain. The second is a suitable history of selection: ahe state must have been selected (through evolution, learning, or training) because its carrying that information helped the system’s persistence and/or reproduction.

How could LLMs satisfy these conditions? Start with the causal-informational side. LLMs are trained on human-produced text, and human language is shaped by perception, action, social coordination, and cultural history. Because text is produced by creatures who live in the world, it bears the world’s imprint. Training on text thus gives LLMs indirect causal-informational links to the things language is used to talk about. This point is sometimes missed because those links are mediated by human beings, but we constantly rely on indirect causal chains ourselves: we learn about quarks, ancient cities, and so on through testimony, diagrams, instruments, and books. LLMs’ complete dependence on such mediation, however, represents a substantial difference from the biological case.

When it comes to the selection requirement, our most straightforward argument appeals to fine-tuning. Modern chatbots are fine-tuned (further trained) to conform to specific norms such as helpfulness, harmlessness, and factual accuracy. When a model’s internal states are selected to produce more accurate answers, the relevant success conditions depend on how the world is. Internal states that help the model produce true answers persist through this further training because they carry information that matters to success in the task. In such cases, we argue, those states fulfill the second condition on referential grounding.

We also argue, more tentatively, that training on text prediction alone may sometimes be enough for referential grouding. This is most plausible in formally constrained domains. A model trained only to predict legal moves in a board game, for example, may develop internal states that track the board state because doing so helps it predict the next move. Mechanistic interpretability research on LLM-like systems trained on board games indicates that they can indeed encode board positions in ways that causally affect their outputs. In certain domains, prediction objectives may select for internal states that track the structure generating the data.

Our account has a surprising implication: multimodality and embodiment are neither necessary nor sufficient for referential grounding. A model that can take image pixels as input still needs the right learning history to represent more than patterns in pixel space. Likewise, a language model bolted onto a robotic body to translate natural language commands into low-level action commands doesn’t acquire new referential powers merely because another subsystem moves through the world. What matters is whether the system’s internal states have been selected for carrying information that guides successful behavior.

Since we wrote the first version of the paper, the most interesting cases of multimodal and embodied AI have shifted from models that merely receive pixels or issue commands through a separate controller to models whose training actually couples perception, action, and success. Recent “vision language action” models, such as Google’s Gemini Robotics and Physical Intelligence’s π-series, are trained on combinations of images, language, proprioceptive states, action trajectories, and high-level task annotations. In these systems, visual and bodily states aren’t simply extra inputs appended to a language model after the fact; at least some internal states are selected and stabilized because they help the system choose actions that make the world come out a certain way.

This motivates a more general question: what is the right metasemantics for artificial neural networks (ANN)? We deliberately appealed to mainstream theories of representation, which were developed to account for biological (cognitive) systems. But it may turn out that the best metasemantics for ANNs is substantially different. What shape(s) such a metasemantics would take is far from clear, but any good metasemantics for ANNs should preserve the explanatory roles that make representation worth positing in the first place: it should distinguish genuine content from mere correlation, explain how error and misrepresentation are possible, tell us which internal states are the relevant vehicles, and help us predict what will happen when we intervene on those states.

New agentic systems also raise interesting questions. Many AI systems now combine one or more LLMs with retrieval, external memory, tool use, code execution, and sometimes action in a computer environment or in the physical world. In such systems, what is the bearer of representational content? The base model? The temporary state of the whole scaffold? A planning module? A tool-using loop stretched over time? We may need a more modular metasemantics for AI systems, one that lets different components acquire content through different histories, functions, and roles in the larger architecture.

Finally, there is another possibility we find especially worth taking seriously and that we only hinted at in the paper. If LLMs and related systems have internal states with content, that content need not line up neatly, or at all, with human concepts. Given their “alien” training histories, alien stabilization and selection pressures, and alien success conditions, their representational contents may also be partly or fully alien. A model may track features of the world that matter for prediction, reward, or control without carving things the way we do, the human origin of the training data notwithstanding. If it is true that LLMs have content, what kind(s) of content do they have?

2 Comments

  1. Meaning Is Not a Matter of Causal Connectivity
    Coelho Mollo and Millière ask whether the internal states of large language models can be referentially grounded. The question is misconceived. It presupposes that meaning and representation are in principle explicable through causal and functional relations, and that what remains is merely to identify the right ones. But this is precisely the philosophical commitment that needs to be argued for, not assumed. Understanding meaning is not a matter of causal connectivity, whether symbolic or vectorial. It is a matter of intentionality, and intentionality cannot be captured by causal relations alone.
    The distinction between symbols and vectors, which the paper treats as a conceptual advance, is irrelevant at this level. In both cases, meaning enters the system from outside, through attribution: by the humans who produce the training corpus, set the optimization targets, and interpret the outputs. Causal chains, selection histories, and fine-tuning may show that internal states correlate with features of the world. But correlation with the world is not the same as understanding. A thermometer correlates with temperature without understanding temperature. Adding causal complexity changes nothing in principle. Machines do not understand meaning, regardless of how elaborately their internal states are causally connected, because understanding is a phenomenon that no causal description exhausts. The vector grounding problem is not a harder version of the symbol grounding problem. It is the same mistake in a more sophisticated idiom.

  2. Wojciech Kryszak

    Dear Authors,

    Fully agreeing with Wolfgang Stegemann’s comment I would like to add that the possibility you highlight in the last paragraph is just a special variant of an idea presented and defended by Donald David Hoffman. Taking aside his general Conscious Realism theory, I do see strong resemblances of what you are worried about in this paragraph with some features of his Interface Theory of Perception. But this hinges upon a strong IF: If it is true that LLMs have content….

    Now back to the main concern of the first comment. I just have felt similar uneasement with the general attitude of such research, but having no professional training in this area I wouldn’t be able to write about that so clearly, nonetheless I dare to hope that all that I write below is not too naive and can be of some use for shedding more light on this basic uneasement.

    You write: “What matters is whether the system’s internal states have been selected for carrying information that guides successful behavior.”
    Well, the internal states of any PID controller are selected for the same reason, for guiding successful behaviour (“carrying information” is so tricky a concept that I would like to avoid it for now). Where is the difference? Both systems are fine-tuned by human beings. Is it so crucial that one system is shaped directly and with almost full understanding of possible effects and another is fine-tuned (huge exaggeration, I know, pace AI engineers) haphazardly? Here and there it is the human being that judges the success. Like Wolfgang said: “Adding causal complexity changes nothing in principle”. So, do we need something other than human judgement?

    I would like to avoid excursions into the land of natural-like selection for artificial systems, too dangerous for a profane like me, so let me introduce some Gedankenexperiment inspired by your example about chess, where you ask about plausibility and probability of internal states representing chessboard states (even if indirectly).

    Let’s consider a kind of chess game, but with much bigger chessboard (and maybe slightly modified rules). Let some LLM be given rules of this game and a sequence of moves of both players. The question asked is but a very untypical one. Instead of asking about legal moves, we ask: What is the probability the players will marry soon after this game? The same as for 2 random people playing chess (on the same level)? Higher? Lower?

    This is a very special sequence, and I am not able to produce it, but in principle it can be prepared. It is very unlikely any game would be played this way if at least one player (black) is really aiming at checkmate, but this is unimportant. For the sake of argument let’s pretend this just happened. The sequence of this game, and this is le clou the whole experiment hinges on, brings (some subgroup of the remaining) white figures into a formation, that for a black player looks like a hearth sign (maybe with I and U on both sides).

    I would be really impressed would any LLM-like system give a correct answer (i.e. “Higher”). Of course, this is rather a test for capabilities of abstraction, not a test for having some representation of chessboard states (necessary for passing the test?).
    (Well, I had to admit being surprised by LLM capabilities many times before, so I would be more surprised had this concept, or a similar one, not been proposed somewhere in English-speaking internet before, hundreds times, perhaps. At least, some variant of this funny story has been already published on my old blog, but in Polish and for a different purpose.)
    OK, enough chess-like fun for now.

    In the other paragraph I read: “multimodality and embodiment are neither necessary nor sufficient for referential grounding. A model that can take image pixels as input still needs the right learning history to represent more than patterns in pixel space. Likewise, a language model bolted onto a robotic body to translate natural language commands into low-level action commands doesn’t acquire new referential powers merely because another subsystem moves through the world.”
    This complaint is a bit obsolete now, but 3 years ago I would ask: Agreed, but why don’t you consider both powers (embodiment and multimodality) coupled? Why can’t we have any feedback, backreaction, strange loop, why is everything that the simplest bug can play the game of life equipped with excluded? (I know, for the sake of argument, but I ask generally.)
    Now, reading the subsequent paragraphs, it seems to me that you are leaning towards admitting that such coupling to the environment could be a gamechanger, some day.

    At the end of the day, the problem of such research is purely philosophical, for we can (and should) ask: how is it that the potentials and chemical compounds that our own biochemistry systems inside our skulls manipulate somehow acquire meaning? (How would you translate your research program to be meaningful for biological brains?)

    Best Regards
    Wojciech Kryszak

Ask a question about something you read in this post.

Back to Top