During natural viewing, the eyes are never still. Even during fixation, miniature movements
of the eyes move the retinal image across tens of foveal photoreceptors. Most theories of
vision implicitly assume that the visual system ignores these movements and somehow
overcomes the resulting smearing. However, evidence has accumulated to indicate that
fixational eye movements cannot be ignored by the visual system if fine spatial details are
to be resolved. We argue that the only way the visual system can achieve its high
resolution given its fixational movements is by seeing via these movements.
Seeing via eye movements also eliminates the instability of the image, which would be
induced by them otherwise. Here we present a hypothesis for vision, in which coarse
details are spatially encoded in gaze-related coordinates, and fine spatial details
are temporally encoded in relative retinal coordinates. The temporal encoding presented
here achieves its highest resolution by encoding along the elongated axes of simple-cell
receptive fields and not across these axes as suggested by spatial models of vision.
According to our hypothesis, fine details of shape are encoded by inter-receptor
temporal phases, texture by instantaneous intra-burst rates of individual receptors,
and motion by inter-burst temporal frequencies. We further describe the ability of the
visual system to readout the encoded information and recode it internally.
We show how reading out of retinal signals can be facilitated by neuronal
phase-locked loops (NPLLs), which lock to the retinal jitter;
this locking enables recoding of motion information and temporal framing of shape and
texture processing. A possible implementation of this locking-and-recoding process by
specific thalamocortical loops is suggested. Overall it is suggested that high-acuity
vision is based primarily on temporal mechanisms of the sort presented here and
low-acuity vision is based primarily on spatial mechanisms.