The spiking of dopamine neurons in animals, and apparently analogous BOLD signals at dopaminergic targets in humans, appear to report predictions of future reward. Prominent computational theories of these responses suggest that they both support and reflect trial-and-error learning about which actions have been successful, based on simple associations with past rewards. This is essentially a neural implementation of Thorndike's (1911) behaviorist principle that reinforced behaviors should be repeated. However, it has long been known that organisms are not condemned merely to repeat previously successful actions, but instead that even rodents' decisions can under some circumstances reflect other sorts of knowledge about task structure and contingencies. The neural and computational bases for these additional effects, and their interaction with the putative reinforcement systems in the basal ganglia, are poorly understood.
Such interactions are of considerable practical importance because, for instance, disorders of compulsion in humans, such as substance abuse, are thought to arise from runaway reinforcement processes unfettered by more deliberative influences.
I first discuss how such extra-reinforcement effects – e.g., planning novel routes based on cognitive maps, or incorporating "counterfactual" feedback about foregone actions – can be incorporated in the framework of existing computational theories, via algorithms for “model-based reinforcement learning." Rather than learning about actions' past successes directly, such algorithms learn a representation of the task structure, and can use it to evaluate candidate actions via mental simulation of their consequences. This computational characterization allows reasoning about (and explaining empirical data concerning) under which circumstances the brain might efficiently adopt either this strategy or the reinforcement one. It also allows quantifying and dissociating either strategy's effects on decision making and associated neural signaling.
Next, I discuss human fMRI experiments characterizing these influences in learning tasks. By fitting computational models to decision behavior and BOLD signals, we demonstrate that neither choices nor (putatively dopamine-related) BOLD signals in striatum can be explained by past reinforcement alone, but instead that both reflect additional learning and reasoning about task structure and contingencies. That such influences are prominent even at the level of striatum challenges current models of the computations there and suggest that the system is a common target for many different sorts of learning. Additional experiments examine individual variation in the tendency to employ either system; the patterns of both spontaneous and experimentally induced variation suggest that the dominance of model-based decision influence over simpler reinforcement systems employs cognitive control mechanisms that have previously been studied in other areas of cognitive neuroscience. Finally, I report results showing that patients with several disorders involving compulsion show abnormally reinforcement-bound choices on our tasks, supporting a link between these neurocomputational learning mechanisms and pathological habits.

