A new action-based theory of spatial perception

by Andrew Glennerster and James Stazicker

(Psychology and Philosophy, University of Reading)

In both the neuroscience and the philosophy of spatial perception, it is standard to assume that humans represent a perceived scene in either an egocentric or a world-based 3D coordinate frame, and a great deal of work in both disciplines trades on this assumption in one way or another. Work in Andrew’s virtual reality lab at Reading presents some striking challenges to the assumption. For example:

Subjects presented with an expanding virtual scene made precise (i.e. highly repeatable) pairwise comparisons of the distances to objects. However, there was no consistent depth ordering of the objects that could explain this performance (e.g. A>B>D yet also A<C<D). No single 3D-coordinate plotting of the objects can explain the pattern of judgement, so this behavioral evidence undermines the assumption that subjects represent objects’ locations in a 3D coordinate frame.

We explore an alternative type of representation that avoids 3D coordinates other than the retinotopic frame in early visual cortex (where disparity corresponds to depth). Our alternative is closer to the approaches that are being developed in computer vision using reinforcement learning, where learned ‘policies’ generate actions that move an agent from one sensory context to another in a way that receives reward, without ever constructing a 3D coordinate frame.

Our joint project, supported by an AHRC Research Network grant, develops Andrew’s work on the neuroscience of spatial perception, exploring its connections with computer vision and philosophy. For detailed discussion see our preprint article (comments welcome): https://philsci-archive.pitt.edu/13494/. Here we would just like to tell you about some key ideas. Our proposal has broad implications in neuroscience and philosophy, and is relevant to current trends in machine learning for autonomous vehicles and robots acting in a 3D world.

From a neuroscientific perspective, a key advantage of our proposal is that it relies on well-understood mechanisms identified over 40 years ago. In general, from the level of a single neuron up to large neural networks, neurons compare a set of incoming firing rates to a set of synaptic weights stored in the neuron (or neurons) and the output depends on whether the input is sufficiently close to the stored weights. For example, in standard models of the cerebellum, incoming firing rates signal a sensory context which determines the next motor output. In our proposal, this well-understood kind of mechanism is recruited for perception too: perception of the shape of an object or the spatial layout of a scene is, at a neural level, nothing more than a sequence of sensory contexts (different views) joined by actions (head and eye movements), with no construction of 3D coordinate frames required.

By contrast, the assumption that the brain generates 3D coordinate-frame representations requires complex transformations between different frames (retinotopic, head-centred, hand-centred, body-centred, world-centred) but a neural mechanism that could implement these transformations remains elusive.

Our proposal also provides an alternative to the assumption, common in philosophy, that the contents of perceptual experience include 3D coordinate frames. Our alternative approach to spatial perceptual experience is ‘embodied’, in the sense that experience of a scene’s layout consists in a sequence of sensory contexts joined by bodily movements. But we argue that the mechanism we describe can nonetheless be understood as a form of representation. Moreover, even though the system of representation we describe is action-based, we argue that it is a system of genuinely perceptual representation.

One reason why we think this is a system of genuinely perceptual representation is that the sequence of movements and contexts we describe is systematically sensitive to distal properties of a scene that are constant through the subject’s movements. For example, take a scene in which three objects are relatively close to you. The distance to these objects explains why, as you move your head to look at them from different directions, the angular separation between the objects’ projections on the retina changes substantially (if all three objects were distant there would be little change). Your resulting sequence of movements and sensory contexts is systematically sensitive to object distance. This, we argue, is one reason to think that the system represents object distance and other distal properties that are constant through the subject’s movements.

Philosophers have often argued that representation in an allocentric coordinate frame is crucial for objective representation—that is, for representation of properties of a scene which are independent of one’s actual and possible actions and experiences. Since we propose that spatial properties are represented insofar as they contribute to certain sequences of actions and sensory contexts, our proposal might seem ill placed to explain objective representation. However, we argue that in fact our proposal is as well placed here as a theory that postulates allocentric coordinate frames. Briefly, any account of how internal states constitute representations of distal space must appeal at some point to causal relations between distal space and internal states and/or to internal states’ contributions to action. Our more immediate appeal to action and its sensory consequences, as what implements spatial perception, makes the problem of explaining objective representation more vivid but not more pressing.

Our proposal has a related, novel aspect which connects it with recent reinforcement learning approaches to navigation in autonomous vehicles: in our proposal, a context includes motivational or task-dependent signals as well as sensory signals. In terms of vectors, sensory and motivational signals are concatenated to give rise to a higher dimensional description of the context than either sensory or motivational input alone. This means that the same sensory context does not always give rise to the same perception or action. Actions move the current sensory+motivational state to a new one, much as in recent reinforcement learning approaches. This is quite different from moving the representation of the observer from one 3D location to the next in an internal 3D model, which is more like the older ‘simultaneous localisation and mapping’ (SLAM) approach to 3D representation in computer vision. Neuroscientists’ discussion of ‘place cells’ and ‘grid cells’ has often assumed that the animal’s internal representation is of the latter kind. Our proposal may ultimately provide an alternative approach there too.

2 Comments

Neil Rickert

December 12, 2017 at 8:40 am 8 years ago

In general, from the level of a single neuron up to large neural networks, neurons compare a set of incoming firing rates to a set of synaptic weights stored in the neuron (or neurons) and the output depends on whether the input is sufficiently close to the stored weights.

This looks to me as if it could instead be described as:

The incoming signal is compared to calibration marks, and the output depends on whether the input is sufficiently close to those calibration marks.

Personally, I think the computation paradigm is a mistake, and a measurement paradigm might be more appropriate.
- Andrew Glennerster
  
  December 12, 2017 at 1:53 pm 8 years ago
  
  A single neuron has about 10,000 synaptic weights (simplistically, each of these can be ‘on’ or ‘off’) and about 10,000 neural inputs (again, each one ‘on’ or ‘off’). Suppose that out of 10,000 synapses only 500 are ‘on’ and only 500 of the inputs are ‘on’ (firing). If all 500 firing inputs match up with all 500 ‘on’ synapses then the neuron will fire maximally. That is what we mean when we say the set of incoming firing rates is ‘close to’ the pattern of synaptic weights stored in the neuron. The same idea is extendable to large numbers of neurons (a network). This might be described as measurement (I would not like to say), but the important thing for the comparison we describe is that it takes place in a high dimensional space (10,000 dimensions for one neuron but a much higher dimensional space for a network of neurons).

Comments are closed.