5. Joint Attention

When I started work on The Shared World, the innocent plan was to write a straightforward philosophical treatment of joint attention. But it became apparent soon enough that a comprehensive account of the topic required thinking about so much else – demonstrative reference, communication in both its bodily and linguistic forms, knowledge, experience, and ultimately space. So the whole book can be read as an attempt to devise a theory of joint attention. I am subscribing to a “thick” view (Racine 2011), on which the capacity to attend to aspects of the environment with others amounts to more than gaze-following; it has a vital role to play in a more general view of the human mind that places its social dimension at the heart of our perceptual and cognitive capacities. 

The significance of joint attention for human development has been recognised for decades. It has been thought to facilitate intentional communication (Bates, Camaioni, and Volterra 1975), underwrite the cognitive mastery of meaning through the sharing of experiences (Trevarthen 1980), and hold the key to human cognitive uniqueness (Tomasello 1999). It has also played a crucial role in autism research (Baron-Cohen 1999, Hobson and Hobson 2011). And yet we do not, as far as I know, have anything approaching a well worked-out account of joint attention. You can distinguish between at least four broad takes on the notion, not all of which are necessarily mutually incompatible. There is what I call a “subject-based view,” on which you suppose that the individual perceiver’s access to her own mental life provides the resources that make it possible to attend to third objects with others; simulation-based accounts of joint attention are paradigm instances of this view (Tomasello 1999). Then there is the idea that intersubjective attunements are at the heart of joint attention (Hobson and Hobson 2005, Reddy 2008). Further, you have attempts to spell out joint attention in terms of the possession of a general Theory of Mind (Baron-Cohen 1999); and, finally, the “object-based view” (Campbell 2011), on which knowing what object the other person is attending to is prior to questions about the particular perspective she is bringing to bear on the thing. 

The theory I develop in The Shared World builds on this last recommendation. It tries to account for joint attention by drawing on the notion of social space that I introduced in previous posts. The theory distinguishes between two capacities: there is, firstly, the bodily form of demonstrative communication that children begin to master around their first birthday and that is facilitated by a social spatial framework. And there is, secondly, the ability to linguistically communicate using demonstrative expressions about objects presented in that framework. The first one is possible without the latter (prelinguistic children and perhaps some nonhuman animals master it) and the latter presupposes the former; grouping them together under the label “joint attention” glosses over crucial differences. Joint interaction with third objects in social space is not, perhaps, best thought of as a kind of attention at all. It is more usefully conceived as a form of embodied perception whose objects are presented as offering affordances for joint activity. And the form of attention that underwrites linguistic demonstrative communication and that is made possible by a reflective appropriation of the social spatial framework is really exercised by individuals. It is a version of the kind of perceptual highlighting that makes its objects available for demonstrative thought (Campbell 2002). On my view, then, the term “joint attention,” despite having been enormously influential and useful in shaping the debate about the mind’s socio-cognitive faculties, strictly speaking turns out to be something of a misnomer. 

One question arising for my proposal is how it deals with cases in which the object of linguistic demonstrative communication is placed outside perceivers’ action space. Some recent work in psycholinguistics may shed light on the problem: Peeters, Hagoort, and Özyürek (2015), in an investigation of the spatial meaning of demonstrative terms, find that users of demonstratives identify objects by expressions indicating proximity (“This” rather than “That”) even though they are out of their reach, if placed between participants in a conversation. They suggest a “shared space account,” according to which interlocutors build up such a space throughout a conversation. If this is right, we can be enactivists about joint perception and still explain how we come to demonstratively communicate about objects we could not touch. 

Baron-Cohen, S. 1999. Mindblindness. Cambridge, MA: MIT Press.

Bates, E., L. Camaioni, and V. Volterra. 1975. “Performatives Prior to Speech.”  Merrill-Palmer Quarterly 21:205-226.

Campbell, J. 2002. Reference and Consciousness. Oxford: Oxford University Press.

Campbell, J. 2011. “An Object-Dependent Perspective on Joint Attention.” In Joint Attention: New Developments in Psychology, Philosophy of Mind, and Social Neuroscience, edited by A. Seemann, 415 – 430. Cambridge, MA: MIT Press.

Hobson, P., and J. Hobson. 2005. “What Puts the Jointness into Joint Attention?” In Joint Attention: Communication and Other Minds, edited by N. Eilan, C. Hoerl, T. McCormack and J. Roessler, 185-204. Oxford: Oxford University Press.

Hobson, P., and J. Hobson. 2011. “Joint Attention or Joint Engagement? Insights from Autism.” In Joint Attention: New Developments in Psychology, Philosophy of Mind, and Social Neuroscience, edited by A. Seemann, 115 – 135. Cambridge, MA: MIT Press.

Peeters, D., P. Hagoort, and A. Özyürek. 2015. “Electrophysiological Evidence for the Role of Shared Space in Online Comprehension of Spatial Demonstratives.”  Cognition136:64-84.

Racine, T. 2011. “Getting Beyond Rich and Lean Views of Joint Attention.” In Joint Attention: New Developments in Psychology, Philosophy of Mind, and Social Neuroscience, edited by A. Seemann, 21-42. Cambridge, MA: MIT Press.

Reddy, V. 2008. How Infants Know Minds. Cambridge, MA: Harvard University Press.

Tomasello, M. 1999. The Cultural Origins of Human Cognition. Cambridge, MA: Harvard University Press.

Trevarthen, C. 1980. “The Foundations of Intersubjectivity: Development of Interpersonal and Cooperative Understanding in Infants.” In The Social Foundations of Language and Thought: Essays in Honor of J.S. Bruner, edited by D. Olson, 316-342. New York: Norton.


  1. Tad Zawidzki

    Hi! I really like this project, and particularly enjoyed the last post. I wonder, regarding your concluding question, whether you know of any research involving virtual social spaces, like multi-player video games. Would these count as outside participants’ action spaces, or would they be virtual action spaces with participants’ egocentric action potentialities somehow projected onto them? I first noticed this phenomenon when my kids had friends over a number of years ago, and they were all interacting in the same “Minecraft” virtual space on different devices, talking about virtual objects of common reference, even though they were on different devices and shared no physical space. It struck me as a really interesting phenomenon, and a potentially powerful tool for empirical studies. The reason is that every move, interaction, and communicative act can be exhaustively, automatically logged, creating a near-perfect source of data. All that’s needed is the applied-theoretical imagination to design experiments. Anyways, have you heard of anyone pursuing empirical questions about joint reference, attention, etc., using such tools?

  2. Axel Seemann

    Hello Tad, very glad you could do something with these posts! There are interesting questions about mind reading and -shaping that arise when you take my suggestions seriously. I don’t explore those in the book at any length (except suggesting that bare-bones joint perception does not require mindreading at all) but hope to do so soon.

    To your question: I’ve come across the odd attempt to simulate joint attention in virtual space, often for clinical purposes (here’s one I found insightful: https://ieeexplore.ieee.org/document/6851182). I’m not familiar with any attempts to think through the capacity for (joint) reference in those terms (though you’d think that sort of study has to exist, for the reasons you mention). One methodological problem for such a study is, I think, that it would seem quite difficult to experimentally capture the difference between gaze following and (joint) attention. A difficulty that arises particularly for a study of the joint case is how you account for the (on my view) absolutely vital notion of perceptual common knowledge – that what makes the environment shared. You can easily think of a scenario in VR where scenes are simulated to two players so that each believes (and believes that the other believes, etc) that they are attending to the same scene. What additional step (if any) does it take to get from there to the joint case? That’s a vital question, and one (it seems) that a VR simulation that helps us think about joint reference has to address. If anyone reading this knows about studies of this kind, be in touch!

  3. Axel Seemann

    Oh, and to your actual question (sorry…): in my framework, social space is constituted by two (or more) agents treating a variety of locations as centres of perception and agency. If it’s just me socially triangulating, I am not operating in social space; I could not be operating in such a space on my own, since I could then not act jointly on it objects and demonstratively justify my perceptual claims about them to others. That’s why communication, with its success condition of a spatial kind of perceptual common knowledge, matters so much. The single agent who (if my speculations about the multilocation hypothesis happen to be right) is integrating the kinds of sensory information required for motor action at a location not occupied by herself may well (falsely) believe she is operating in social space; but if there is no corresponding triangulation carried out by a co-operator she could not demonstratively communicate about the thing , and that’s what it takes to be a social agent in my sense.

    It’s an interesting question how to think about the location that a single agent treats as another’s viewpoint. As far as non-reflective action space goes, I think the best answer is that it is simply a failed attempt to establish co-operation in social space – it’s not strictly speaking an exercise in ordinary, run of the mill triangulation, the sort you can carry out in allocentric space, but it’s not successsful social triangulation either. But you can certainly treat a location not currently occupied by another agent as a centre of perspective once you have reflectively appropriated the social spatial framework and thus enjoy a conception of space, in which (along my broadly Evansian lines) a variety of places in objective space can be treated as centres of egocentric space.

    • Tad Zawidzki

      Hi Axel. Thanks for the reply. I guess what struck me most about these multi-player virtual social spaces is that participants use referring expressions in the same way as they would interacting in a physically real, common social space. E.g., in the “Minecraft” interactions I observed among my kids and their friends, they would cooperate on joint projects, e.g., building some structure out of virtual Minecraft “blocks”, and would ask each other, e.g., to “help me bring some blocks from over there”. But “over there” doesn’t refer to any physical location. Each participant is on her own device, viewing a different visual representation, on her own screen, of the “same” virtual object location, to which they all refer using demonstrative pronouns. It’s kind of weird. They’re not looking at the same physical *token* of the object location, since each is viewing a different icon on her own screen. But neither are they referring to an object location *type*: in the virtual social space in which their avatars are interacting to help each other on a cooperative building project, the object location is a *virtual token* around which they’re all coordinating, using demonstrative communicative acts. They’re somehow projecting the cognitive resources they use to coordinate in real, physical, public, social spaces, onto this virtual public social space.

      • Axel Seemann

        Ah I see now – yes, interesting. One thing I don’t know is whether their respective simulated spaces each represent the other actor(s) in them, or whether it’s just a representation of objects grouped around the player and there is an assumption amongst players that each is acting with simulations of the same environment. If the latter, you might think that they are each operating in their own representation of action space in which they use demonstrative terms in the same way in which you’d use them in ordinary individual action and thought, and it’s just the assumption that their spaces represent the same environment that makes them suppose that demonstrative communication works here. If the former (so they each see the other’s avatar moving about and acting in their space, and the avatar is responsive to what they do and say themselves), you could think that they are each operating with a simulation of social space. Because what is being simulated is social space (and because there is still an assumption that each player is working with a simulation of the same shared space, though from different perspectives), the players think that they can demonstratively pick out objects at the same locations in these simulated spaces (“same” relative to procedures of triangulation carried out in both/all spaces) in communicatively successful ways. And because the assumption is true, the speaker’s use of the demonstrative actually picks out the intended referent and the hearer recognises the speaker’s referential intention. What’s different from operating in actual social space is that the players could not have common knowledge of where these referents are – they could, at best, only have the kind of knowledge that is regressive in the way that troubled Schiffer and other people. Perhaps it is actually more accurate to say that they only have (more or less accidentally) true perceptual beliefs about what each player knows, since there is no perceptual knowledge they could have (not even individual knowledge) about what the other sees. And of course the communication can only work because there is not only an assumption of sameness of simulated social space but because there actually are other players outside the simulation. Not sure this is quite right, but it’s a first shot – thanks for this, it’s a great example for thinking through social space.

        • Tad Zawidzki

          Yeah, I thought you’d like it, given your last post. Typically, in the cases I have in mind, avatars are “visible” to each other… I see why it’s troublesome to claim they have perceptual access to the simulated virtual objects/locations. But they act as if their avatars do. It’s a simulation of perceptual access to a social space relativized to a virtual social space defined by avatars and objects/locations on which they can jointly act. I’m thinking you don’t need an infinite regress of mental state attributions to get this off the ground. You just need to deploy the same resources you use in real social spaces deployed in some kind of pretend/simulated way… Perhaps it’s akin to an example Dennett often uses regarding workers who learn to handle hazardous materials using remote controlled robot arms. After a while, the robot arm feels like an extension of one’s own arm, and one claims to be able to feel the heft of an object held by the robot hand. Somehow normal, physical perceptuo-motor resources and associated phenomenology are projected onto an artificial appendage one can control seamlessly. Perhaps something similar happens with the cognitive resources employed in social spaces when one interacts as an avatar with other avatars in a virtual social space…

Comments are closed.

Back to Top
%d bloggers like this: