The core neural circuit underlying temporal difference learning
Phasic dopamine (DA) release plays a major role in learning by assigning incentive value to associated stimuli. A leading theory proposes that this process is analogous to a reinforcement learning algorithm called temporal difference (TD) learning, and that DA acts analogously to the reward prediction error (RPE) term within the TD algorithm. TD assigns reward predictions (values) to states by learning not only from rewards but also when its internal value estimates change, requiring an approximate temporal derivative calculation. Although many studies have demonstrated similarities between DA activity and TD errors, including this core derivative-like property, the mechanistic basis for dopaminergic TD learning remains unknown. Here, we demonstrate that the circuitry bidirectionally connecting D1 DA receptor-expressing medium spiny neurons (D1-MSNs) in the lateral nucleus accumbens (lNAc) to lNAc-projecting DA neurons in the ventral tegmental area (VTA) accomplishes key components of TD learning. Specifically, pairing optogenetic stimulation of lNAc DA axons with a preceding odor cue (“opto-conditioning”) potentiated the odor-evoked activity of lNAc D1-MSNs, but not D2-MSNs, and generated signatures of TD RPE in lNAc DA release. In turn, optogenetic stimulation of lNAc D1-MSNs with diverse temporal patterns drove lNAc DA release according to the approximate derivative of the stimulation pattern. Pharmacological inactivation of lNAc altered DA signaling by specifically reducing the reward expectation (value) associated with conditioned odors. Thus, lNAc D1-MSNs and lNAc-projecting DA neurons constitute a minimal TD learning loop. We next investigated whether these properties were specific to lNAc. Surprisingly, while multi-site opto-conditioning pointed to a privileged role for lNAc in our task conditions, the derivative-like computation was a widespread feature of striatal-dopamine circuitry outside lNAc, suggesting that diverse dopamine systems may perform analogous learning algorithms in specialized task contexts depending on the relevant state spaces, timescales, or behavioral demands.