By:
Will Bridewell, Naval Research Laboratory
Alistair M.C. Isaac, University of Edinburgh
(View all posts in this series here.)
In part 1, we motivated an apophatic methodology for the science of consciousness. The basic idea was to take a model’s success at reproducing some set of consciousness-relevant phenomena as negative evidence: evidence that these phenomena are jointly insufficient for producing consciousness. We argued that this methodology allows quantifiable progress in consciousness science without committing to any particular metaphysics of consciousness.
There are other reasons one might reason apophatically, however, such as when the target phenomenon is difficult to define or demarcate. In these cases, we might take the attitude of Potter Stewart who, as a US Supreme Court justice during a landmark First Amendment trial, famously wrote about pornography, “I know it when I see it.”
Progress in artificial intelligence (AI) is driven by Stewartian convictions.
In the 1950s, optimism emerged for a future of machines able to do all the things that people can, only better and without complaining. By the 1960s, the dream was to replace not only manual labor, but intellectual labor as well. Machines, after all, could reason. A brief glance through Computers & Thought (1963), the first widely read collection of articles on AI, reveals programs that play chess, prove theorems, answer questions written in natural language, and recognize hand-written characters. And let’s not forget “GPS, a Program that Simulates Human Thought,” which teaches us that “the free behavior of a reasonably intelligent human can be understood as the product of a complex but finite and determinate set of laws.”
As the field developed, however, AI came to re-evaluate these early attempts at reproducing “intelligence.” The recognition that tasks taken to be the hallmarks of intelligence can be solved through brute-force computation tempered the community’s understanding of intelligence itself. Beating the world’s best chess player or proving mathematical theorems cannot be sufficient evidence for intelligence, as they can be accomplished through mere mechanical search by computers otherwise inept. As the marks of intelligence became more refined, the goal seemed to move farther away.
The overarching effect of these difficulties was to fractionate AI into specialized subdisciplines. Interpreting handwriting and categorizing images became pattern recognition. Answering questions over a fixed domain became natural language processing and database access. Problem solving gave way to automated planning. The strategy driving these shifts was to limit the scope of AI in ways that enabled benchmarks of success to be clearly specified and technological progress to be positively measured.
Today, the rise of large language models (LLMs) like OpenAI’s GPT-4 and Google’s PaLM 2, are steering the discipline back toward general intelligence and Stewart’s dictum. LLMs at the most basic level of description repeatedly predict the next most likely word of text, yet they display a remarkable capacity to generate fluid, coherent, and relevant responses to user input. More impressively, the most capable LLMs appear to have made considerable inroads on the “commonsense” problem in AI, producing consistently plausible, if not always factually accurate, content. Are these systems intelligent? How would we know?
An apophatic approach would assess the intelligence-relevant behavior that an LLM can reproduce, declare these behaviors insufficient evidence for intelligence simpliciter, and identify new intelligence-relevant behaviors that should be incorporated in the next iteration of the model.
We see something like this pattern in “Sparks of Artificial General Intelligence,”[1] an unreviewed report on studies of GPT-4 by employees of Microsoft Research. While the surface rhetoric of the paper is positive, enthusiastically declaring that GPT-4 “attains a form of general intelligence,” the authors ultimately stress its “lack of planning, working memory, ability to backtrack, and reasoning abilities.” Crucially, the upshot of the paper is to revise the definition of intelligence to incorporate intuitively relevant capacities manifestly absent from GPT-4’s behavior.
Schematically, the “Sparks” analysis takes the following form:
- The authors begin with a vague definition of intelligence: “broad capabilities…, including reasoning, planning, and the ability to learn from experience…at or above human-level.”
- Instead of a formal evaluation, the authors pursue a “subjective and informal approach,” confident that whatever intelligence is, like Justice Stewart, they know it when they see it.
- After an exhaustive examination of GPT-4, the authors conclude that it is not really intelligent (framed with a positive twist by the “sparks” qualifier).
- They then identify characteristics of intelligence that are not part of the working definition, such as the ability to recognize fabrications or inconsistent reasoning.
When compared to the apophatic method there are similarities. In particular, success at satisfying an existing criterion for intelligence is not deemed adequate for intelligence itself. Rather, it invites the modelers to formally specify further, intuitively intelligence-relevant behaviors and demand that future models instantiate these as well.
One crucial difference between the science of consciousness and AI is a belief that “intelligence” can be operationally defined; that it is, at heart, computational. In contrast, operational definitions of consciousness face a “zombie” objection: any consciousness-indicative behavior can (supposedly) be conceptualized as consciousness-free. For good or ill, the AI community holds out hope for a well-defined pattern of behavior that is “zombie proof,” such that any agent exhibiting it is necessarily and inarguably intelligent.
Each historic success in AI has shown that our definition of intelligence is inadequate, but by constantly shifting the goal posts for success, the field seemingly makes progress. Will the finish line one day be crossed? Whether that consensus can be reached, one thing is clear: it is the naysayers that drive AI models forward into increasing sophistication through the systematic identification of intelligent behavior still found lacking in each successive model.