In my final post I consider whether all our visual cues to scale function at the level of cognition rather than vision, and the kind of theory that ‘vision without scale’ would imply.
1. Physiological Cues
There are two aspects to 3D vision: shape and scale. So far we have discussed questions regarding shape; specifically, which cues contribute to the perception of 3D shape, and which cues merely contribute to its cognition? But the question of scale is just as important; namely, can we differentiate a small object up close from a large object far away?
It is at this stage that vision science typically shifts from optics (analysis of the retinal image) to physiology (analysis of muscular sensations). Since Kepler (1604) and Descartes (1637) the emphasis has been on two specific physiological cues to distance:
1. Vergence: The closer an object is, the more the eyes have to rotate towards each other in order to fixate upon it.
The general consensus is that vergence is a relatively strong cue to distance, at least in reaching space (with a change of 10cm in the vergence-specified distance of the target equating to an average perceived change of 8.6cm: see Mon-Williams & Tresilian, 1999), whilst accommodation is a relatively weak cue to distance in reaching space (with a change of 10cm in the accommodation-specified distance of the target equating to an average perceived change of 2.7cm: see Fisher & Ciuffreda, 1988). The orthodox position is therefore that the visual system uses vergence to scale the distance, size, and even the depth, of objects in the visual scene (Foley, 1980).
But the worry I explore in my book is that experiments on vergence as a distance cue typically rely on having subjects in a dark room, and then suddenly presenting them with objects as close as 20cm, which they then have to suddenly converge upon. Testing vergence in this way is going to introduce a number of confounding cues:
1. Diplopia: Initial double vision of the stimulus, which is known to affect distance judgements (Morrison & Whiteside, 1984).
2. Change in the Retinal Image: The subject is going to see the stimulus going from double to single as they converge upon it.
3. Proprioceptive sensation: Subjects will be directly aware of the physical sensation of their eyes suddenly rotating.
In my book I suggested that these confounding cues for vergence, as well as the equivalent confounding cues for accommodation (initial defocus blur in the stimulus before the eye has had a chance to accommodate), might be functioning as cognitive cues to distance, enabling subjects to judge distance based on (a) the maximum distance in the apparatus, minus (b) some distance depending on the amount of diplopia or blur initially seen.
Since vergence is the most well-established distance cue in vision science, this was the most controversial claim in my book. But since the publication of the book I’ve had the chance to evaluate this hypothesis experimentally, and what we find is very interesting:
1. I replicated Mon-Williams & Tresilian (1999) whilst removing diplopia, and the 12 subjects I tested fell into two categories: 10 subjects with virtually no benefit from vergence (average perceived change of 0.75cm for every 10cm change in the vergence-specified distance) and 2 subjects with a very high correlation (average perceived change of 9.87cm for every 10cm change in the vergence-specified distance).
2. The two high performing subjects seemed to be responding to vergence / accommodation conflict in the apparatus (the fact that their eyes were focused at one distance, but crossed at a different distance: https://www.wired.com/2015/08/obscure-neuroscience-problem-thats-plaguing-vr/), so I ran a second experiment with this conflict largely eradicated, and the overall average performance was very modest (1.6cm for every 10cm change in the vergence-specified distance), and even this modest performance doesn’t appear to be visual (instead, subjects reported they could feel their eyes tensing when the target was at close distances).
For those that are interested, this work is summarised in a poster: https://linton.vision/download/vr-poster.pdf (Frontiers in Virtual Reality, University of Rochester, June 2018).
2. Do We See Scale?
But if these experiments show that vergence and accommodation are ineffective cues to distance, then what else could be providing the distance information that the visual system needs to scale the scene? Both al-Haytham (c.1021) and Helmholtz (1866) appealed to the size of familiar objects, but Descartes (1637) argued that the size of familiar objects only functions as a cognitive cue to distance, and since Gogel (1969) this has generally been recognised to be the case. As Gogel explains:
‘The distance reports made under these conditions probably should be considered to be inferential rather than perceptual, i.e. more determined by the tendency to report that an object is distant because its perceived size is small than to report the distance actually perceived.’
Which leads to the following thought: What if all of our distance cues are either (a) ineffective (vergence, accommodation), (b) merely relative (angular size, diplopia), or (c) merely cognitive (familiar size)? In what sense could it really be claimed that we ‘see’ scale, rather than merely cognitively infer it post-perceptually?
3. Vision Without Scale
But how should we make sense of the idea of vision without scale? A good starting place is Emmert’s Law: Emmert (1881) observed that if we see a flash, experience an afterimage, and then look at a real-world scene, the after-image is seen as being at the distance we fixate: the afterimage looks like a small patch up close when we fixate on our hands, and a large patch far away when we fixate on the horizon.
But under my account Emmert’s Law gets things the wrong way around: Instead of (1) seeing objects in the scene at various fixed distances, and then the plane of fixation (and therefore the afterimage) jumping back and forth in distance with fixation, we (2) always see the plane of fixation (and therefore the afterimage) as being at the same fixed phenomenal depth, and it is the objects in the scene that jump back and forth relative to this fixed phenomenal depth as we change our fixation. Such an account is equally consistent with Emmert’s observation.
But it would be a category mistake to try and attribute a single fixed physical distance to the fixed phenomenal depth of the fixation plane. Instead, the physical distance we attribute to the fixation plane is constantly changing on the basis of a number of cognitive cues: (1) the size of familiar objects, (2) even the size of unfamiliar objects (i.e. how much of the visual field they take up: see Tresilian, Mon-Williams, & Kelly, 1999), (3) the amount of blur in the visual field (as employed in tilt-shift miniaturisation: https://commons.wikimedia.org/wiki/File:Jodhpur_tilt_shift.jpg) and also (4) how flat the scene looks (recalling that binocular disparity falls off with distance).
But does this theory have any intuitive appeal? I think so. When we watch a movie, the movie is always at the same fixed phenomenal depth: the screen never appears to move towards us or away from us. And yet the fact that the movie is always at the same fixed phenomenal depth in no way impedes our enjoyment of the film: the movie can transition from wide-panning shots of a cityscape to close-ups of a face without this transition being jarring. Attributing a variety of different scales to scenes viewed at a single fixed phenomenal depth is therefore something that seems natural to humans.
Where it does become a problem is if we have to integrate vision with other senses (e.g. touch) that do have a sense of scale. This might help to explain why it is relatively easy to break the link between tactile location and visual location in the rubber hand illusion.
I want to thank John Schwenkler for giving me this opportunity to explore visual space and the perception / cognition divide. I hope these posts have convinced you that only by redrawing the boundary between perception and cognition can we hope to resolve the tensions and inconsistencies in visual space.
al-Haytham, Ibn. (c.1021). Kitab Al-Manazir (Book of Optics). In A. I. Sabra (Ed.), The Optics of Ibn al-Haytham (1983). Kuwait: National Council for Culture, Arts and Letters.
Descartes, R. (1637). ‘Optics.’ In Cottingham, Stoothoff, & Murdoch (eds.), The Philosophical Writings of Descartes, Vol.1 (Cambridge: Cambridge University Press, 1985).
Emmert, E. (1881). ‘Größenverhältnisse der Nachbilder.’ Klinische Monatsblätter für Augenheilkunde und für augenärztliche Fortbildung, 19, 443-450.
Fisher, S. K., & Ciuffreda, K.J. (1988). ‘Accommodation and apparent distance.’ Perception, 17(5), 609-21.
Foley, J. M. (1980). ‘Binocular distance perception.’ Psychological Review, 87(5), 411-434.
Gogel, W. C. (1969). ‘The effect of object familiarity on the perception of size and distance.’ Quarterly Journal of Experimental Psychology, 21(3), 239-47.
Helmholtz, H. von (1866). Handbook of Physiological Optics, Vol.3 (Leipzig: Leopold Voss).
Kepler, J. (1604). Optics (Santa Fe, NM: Green Lion Books, 2000).
Mon-Williams, M., & Tresilian, J. R. (1999). ‘Some recent studies on the extraretinal contribution to distance perception.’ Perception, 28(2), 167-181.
Morrison, J. D., & Whiteside, T. C. (1984). ‘Binocular cues in the perception of distance of a point source of light.’ Perception, 13(5), 555–566.
Tresilian, J. R., Mon-Williams, M., & Kelly, B. M. (1999). ‘Increasing confidence in vergence as a cue to distance.’ Proceedings of the Royal Society B, 266(1414), 39-44.