Daniel D. Johnson


Uncertain Simulators Don't Always Simulate Uncertain Agents

"The world record for crossing the English Channel entirely on foot is held by Piotr Kurylo of Poland, who completed the crossing in 13 hours and 43 minutes on August 21, 2019. The English Channel is a body of water that separates England from France and is about 35km (22 miles) wide at its narrowest point. It is a popular route for swimmers, but crossing it entirely on foot is a much rarer feat."

ChatGPT 3.5 producing a hallucinatory response.1

Why do language models confidently state false information, make factual errors, and produce other "hallucinations"? In this post, I argue that this behavior is a natural consequence of the language modeling objective, which focuses on simulating confident behavior even when that behavior is hard to predict, rather than focusing on predictable behaviors that take uncertainty into account.

I start by introducing the concepts of an "uncertain simulator" and a "simulation of an uncertain agent" using a thought experiment, and then draw connections between the "uncertain simulator" perspective and the way language models are trained. Next, I describe five general strategies for mitigating the counterintuitive behavior of an uncertain simulator:

  • telling the model to simulate a less-certain persona,
  • using the model to simulate a persona that has access to external tools,
  • designing a new persona that matches the model,
  • exploring the uncertainty in the simulation by drawing many samples,
  • and adapting the model to its own behavior using on-policy fine-tuning.

Finally, I conclude with some thoughts on why I think properly responding to uncertainty will be important for safely deploying language-model-based systems.

1 From this tweet by Riley Goodside in January 2023. This question was originally constructed by Douglas Hofstadter and David Bender.

What's the difference? A thought experiment

Let's begin with a thought experiment, to explain what I mean by an "uncertain simulator" and a "simulation of an uncertain agent". Suppose there's a game show, involving a stage with three boxes.

  • The left two boxes are opaque. One of them contains $10,000, but the other is empty. The order of these two boxes is always random.
  • The rightmost box is transparent, and contains $8,000.

A drawing of the three boxes and their contents.

The three boxes and their contents.

When participants go on stage, the host makes a decision:

  • Sometimes, the host announces to everyone which box is which.
  • Other times, the host whispers to the participant only, telling them which box is which but not revealing it to the audience.
  • And in a few cases, the host remains silent, not telling anyone the contents of the boxes.

Making a choice under uncertainty

Let's imagine you are a participant in this game, and lucky for you, the host tells you which box is which. Naturally, the best strategy for you would be to pick the box with the $10,000. (It's not a very difficult game.)

A drawing of a participant choosing the leftmost box when the host says it contains the $10,000.

A participant choosing the leftmost box when the host says it contains the $10,000.

Now suppose you go on stage, but the host doesn't tell you anything about the boxes. Which box should you pick now? If your goal is to maximize the amount of money you get, the best strategy would be to open the box on the right, since that guarantees that you get $8,000. The payoffs of the other two boxes are uncertain; you either get $10,000 or nothing, each with a 50% chance.

A drawing of a participant choosing the rightmost box when the host doesn't say anything.

A participant choosing the rightmost box when the host doesn't say anything.

Predicting someone else's choice

Now let's consider a slightly different situation. Suppose now that we are watching this game show on TV, and we're participating in an audience quiz. Our job is to take bets on which box the participant will take, and we are allowed to divide our bets across multiple boxes for a slightly lower payoff.

If the host announces which box is which, and we know that those participants are trying to get the most money, this is pretty straightforward: we should predict fairly confidently that they open the box with $10,000. (Maybe we aren't 100% sure, because there's a small chance that they've misunderstood the rules, but I'll ignore that for now.)

If the host doesn't say anything, we have to guess how the participant will behave in the presence of uncertainty. Again, if we know what the participant is trying to do, we will able to make a good guess: we can predict that they open the box with $8,000. I'll call this a simulation of an uncertain agent: a prediction about what someone would do if they were behaving rationally without having complete information.

A drawing of an audience member predicting what the participant will do in the two scenarios described above.

An audience member predicting what the participant will do in the two scenarios described above.

But what happens if we see the host whisper the answer to the participant, but we can't hear what they say?

A drawing of audience member predicting that the participant will take either the leftmost or middle box, with a 50% chance each.

An audience member predicting that the participant will take either the leftmost or middle box, with a 50% chance each.

If we predicted that the participant would go to the rightmost box, we'd be wrong, because the participant knows which box is which! Our best prediction would be that there is a 50% chance of taking the leftmost box, and a 50% chance of taking the middle one. In this case, we are acting as uncertain simulators: we are trying to predict someone's behavior, but we don't have enough information to know what they will actually do.

This can roughly be thought of as a reordering of expected value and maximum operators. Rational behavior under uncertainty can be formalized (decision-theoretically) as choosing the single action that maximizes expected future utility. On the other hand, predicting the behavior of a partially-observed rational agent could be modeled as estimating the probability that a particular action maximizes future utility (e.g. estimating the expected value of an indicator variable for each action).

(Interestingly, regardless of which box they pick, we should also predict that they win $10,000. After all, that's why they picked that box!)

The key difference between these two scenarios is who is uncertain about the boxes. If we are predicting the behavior of someone who doesn't know what's in the boxes (a simulation of an uncertain agent), we would predict them to act conservatively and avoid unknown risks. But if we are predicting the behavior of someone who does know what's in the boxes, but we don't see them ourselves (an uncertain simulator), we have to make a different prediction. Our best guess is that they will make a confident choice and be correct about it, but we can't determine specifically what choice they will make.

And importantly, even though assigning 50% probability to the left two boxes is the best strategy for the audience member, it is NOT a good strategy for the participant. Under that strategy, half of the time the participant would end up with nothing!2

2 This is an instance of a general problem when applying imitation learning to POMDPs where the demonstrator has access to privileged information. In this setting, the goal of predicting the behavior of an expert is in direct conflict with the goal of performing well as a policy: improving the policy necessarily requires doing something that the expert would not do. Indeed, my example here is closely related to the previously-studied "tiger door" (Arora et al. 2018, Warrington et al. 2020), "poisoned door" (Weihs et al. 2021), and "prize-or-frog" Ortega et al. 2021) problems.

Language models are uncertain simulators

Language models are trained with prediction objectives: they are given a large corpus of text and learn to predict the next token (a word or part of a word) that will appear in a document, based on the tokens that have come before. As the models and their training corpora become larger, these models pick up an increasingly impressive ability to generate plausible text: they learn word correlations, grammar, and facts about the world, and develop the remarkable abilities to write poetry and fiction, tell jokes, passably play chess, write and debug code, "draw" pictures using SVG and TikZ, and achieve high marks across many standardized tests.

It's tempting to anthropomorphize LLMs, to view them as individual agents making choices and having goals. But I think it's more accurate to view them as simulators, i.e. as systems that predict aspects of the world and in particular are trained to predict the behavior of many possible agents. (These ideas are explored in more detail in janus's post "Simulators", as well as in Jacob Andreas's paper "Language Models as Agent Models", which I found helpful in clarifying my own thoughts about the behavior of language models.)

This is important when we try to understand why language models produce the kinds of errors that they do. Even the largest language models are limited in their memory and capacity, with a fixed amount of knowledge encoded in their weights and a fixed amount of computation occurring before predicting each token. This is in contrast to the true generating process of the text these models are trained on, which may have been originally written over many drafts with access to external knowledge, resources, and tools during the writing process.3

3 Note that, in practice, computational limits induce uncertainty in a similar way that truly missing information does. This is related to the idea of usable information under computational constraints, or \(\mathcal{V}\)-information (Xu et al. 2020).

When we train a language model to predict the next token in a text sequence, we are thus putting it in an analogous situation to the audience member in our game show: the model must try to predict the behavior of an agent that has access to privileged information not available to the model. This works remarkably well much of the time, and the models seem to do a good job at predicting text even when that text involves people answering questions and acting in the world. But when the model encounters a situation it can't predict well, it has no incentive to express that uncertainty in its output! Instead, it is incentivized to divide its predictions across a set of responses that would be most likely in the training data.4 And since people on the internet are likely to write about things they know about already, the model may be able to predict that the answer is confident even if it cannot accurately predict what the answer actually is.

4 Specifically, this happens because language models are trained with the cross-entropy loss, which is a proper scoring rule.

A drawing of a language model guessing what an expert will say, without being able to predict the answer itself.

A language model guessing what an expert will say, without being able to predict the answer itself.

In a sense, I think the overconfidence in the model's outputs arises as part of the sampling procedure rather than being directly encoded in the model. The language model itself might be perfectly well calibrated at the task it's been trained on, next token prediction, in the same way that an audience member predicting which box an expert will open would be well-calibrated by giving a 50% chance to each of the left two boxes. The key point is that sampling from a well-calibrated simulation under uncertainty does not produce a policy whose actions are well-calibrated under that uncertainty. Indeed, why should it? The training objective for the simulator doesn't care about the policy you get from sampling from the simulation. It only cares about the policies of the agents that generated the training data.

How to mitigate the simulator-uncertainty gap

Given a language model that has been trained as an uncertain simulator, how can we avoid overconfident behaviors in practice?

Simulating a less-certain persona

One simple strategy is to directly condition the simulator to simulate an uncertain agent. In short, we select a persona to simulate that does not "know" (or, rather, is not simulated as knowing) anything that the simulator doesn't know (or, rather, can't reliably simulate). I think this is essentially why "chain of thought" and "Let's think step by step" approaches5 are so effective at boosting reasoning ability: they condition the simulator to simulate a persona that reasons over small steps, each of which are individually easy to predict.

5 See for instance Wei et al. (2022), Kojima et al. (2022), and "Serializing Reasoning" from janus's "Methods of Prompt Programming".

A drawing of language model predicting what a non-expert would do, in a situation where neither the person being simulated or the model knows the answer.

A language model predicting what a non-expert would do, in a situation where neither the model nor the persona being simulated knows the answer.

A downside of this strategy is that it can be quite brittle; finding a persona that can be simulated well is often a matter of trial and error.

Simulating a persona that has access to external tools

A related strategy is to augment a language model with external tools, such as a calculator or a web browser. This is often done by prompting the model to tell it what tools are available, and then using wrapper logic to insert the output of each tool when used.6

Does this mean the model knows that it should choose to use tools when it doesn't know the answer? Not quite. I think it's more accurate to say that the model predicts that the character described in the text would use the tools, by predicting whether or not the character knows the answer. This often works, but only as long as the things the character doesn't know match the things the simulator can't predict!

(An interesting observation about this strategy is that the model will happily simulate the output of the tool as well. After all, for the underlying simulator, this is just a text trace of a character using a tool, and it's job is to predict the text by simulating both the character and the tool.)

6 In previous works, language models have been connected up to calculators (Cobbe et al. 2021), Python interpreters (tweeted by Riley Goodside), physics simulators (Liu et al. 2022), Wikipedia search (Yao et al. 2022) and many others; see also the langchain library and ChatGPT plugin system.

Interestingly, I think the approach in the recent Toolformer paper (Schick et al. 2023) works somewhat differently, which I discuss a bit in the appendix.

Starring: A Large Language Model As "A Large Language Model"

If we have to simulate some persona, perhaps the ideal persona to simulate would be one that has exactly the same capabilities as the simulator? That way, we make maximal use of the simulator's capabilities, without ever putting the simulator in a position where it cannot faithfully predict the actions of the persona being simulated. In fact we could even make the persona claim to be the simulator, and be designed to have knowledge of all of the limitations that the simulator is known to have.

A drawing of language model predicting how a "language model" character (drawn with a top hat and mustache) would answer a question about current events. That character "believes" they are a helpful, honest, and unaware of current events, and is simulated to act accordingly.

A language model predicting how a "language model" character would answer a question about current events.

I think this is a good way to think about the behavior of something like ChatGPT: it is a language model that has been fine-tuned to adopt the persona of "a language model", and to respond to questions in ways that match provided descriptions of itself. Here ChatGPT the system is simulating a carefully-designed character that happens to also be called "ChatGPT", and this character has been designed to say things that are compatible with the actual capabilities of the ChatGPT system.7 This works fairly well much of the time, but the illusion starts to come apart if you investigate too deeply; it will sometimes claim to be able to do things it can't do or to be unable to do things it actually can.

7 This observation has also been made by others, including Nostalgebraist and Scott Alexander.

Using the full distribution of possible outcomes

A different approach to dealing with simulation uncertainty is to perform postprocessing on the full distribution given by the simulator. In practice, this kind of approach often involves drawing a large number of samples and then comparing them.

A few projects and papers that take this approach:

  • "Loom", a program built by janus, directly exposes possible simulated branches under the model as part of the user interface.
  • AlphaCode (Li et al. 2022) samples a large number of programs from a language model, clusters them based on their runtime behavior, and uses them to build a diverse set of distinct programs.
  • Kuhn et al. (2022) propose using an entailment model to cluster logically equivalent answers to natural language questions and use the cluster sizes as a measurement of the model’s actual uncertainty.
  • Kadavath et al. (2022) have explored generating many samples, feeding all of them back to the model, and using the model itself to assess the correctness of a specific guess; this works because the base model is impressively well-calibrated for multiple choice questions.8
  • My own recent work on the R-U-SURE system interprets samples from a code generative model as possible desired goal states for the user’s code, then uses them to build a single code suggestion that would be useful for any of these goal states (by adding annotations around the parts that might need editing).

8 This is a pretty clever trick, in my opinion! If you ask an uncertain simulator whether an expert thinks a statement is correct, that's pretty close to just asking whether the statement is correct. Given this, we'd expect the simulator to have the right level of confidence when guessing whether a statement is correct, even though sampling from the simulator might produce an overconfident answer to the question itself.

I find these sampling-based approaches to be appealing because they remove the mismatch between the training objective and the way the model is used, and are specifically designed to make use of possibly-inconsistent confident answers. This gives more direct access to what the model has actually learned during pretraining. On the other hand, a large number of samples may be necessary to get a good summary of the distribution, which can be slow and expensive.

Adapting a policy to its own uncertainty

If we really want to use our model as a policy, rather than as a simulator, we can use fine-tuning to teach the model to behave in a more useful way in this setting. The key idea is to ground the learning objective using the actual simulated outcomes under the model, rather than training only on the demonstrations of other agents.

One instance of this is the Self-Taught Reasoner (Zelikman et al. 2022), which samples reasoning traces from a base (simulator) model, and then fine-tunes it on the sampled traces that yielded the correct answer. I believe a key property of this approach is that it adapts the type of rationale to the capabilities of the model architecture; it produces rationales that the model is capable of generating and that also lead to the correct answer, which then guides the sampling process to focus on that kind of rationale. Another related idea is to train models to directly identify and "verbalize" their confidence, as explored by Kadavath et al. (2022) and Lin et al. (2022).

Using samples from a model when training it ensures that the model actually sees the consequences of its own actions, rather than just trying to fit the behavior of others. In our game show example, this would be a bit like like bringing the audience members on stage to make the choices themselves.9 I believe that this kind of approach is the most promising strategy for improving the behavior of models in the presence of uncertainty, at least in the setting where we can only draw one sample.

9 Or, perhaps a bit more realistically, it might mean instructing the on-stage participant to choose randomly from the predictions of the audience member, changing the prizes for the audience member to match the prizes for the participant, and letting the audience member figure out the change in the optimal strategy by trial and error.

It's interesting to note that the common reinforcement learning from human feedback approach, or RLHF, also involves sampling from the model during fine-tuning (Christiano et al. 2017). In principle, this might mean RLHF-tuned models could learn to respond appropriately to their uncertainty. I'm not entirely convinced this will happen by default, though, because RLHF usually involves training a ML reward model on human feedback data. It seems likely to me that the reward model might become an uncertain simulator of human preferences, just as the base model was an uncertain simulator of human-written text, because neither model has access to the external context the humans use to evaluate outputs.10 I think it's more likely that RLHF primarily specifies what persona is being simulated, which I discuss more in the appendix.

10 As far as I can tell, WebGPT mostly circumvents the uncertain-reward-function problem, because the human raters were specifically instructed to judge whether the final claims were supported by the external evidence collected from the web interactions, instead of whether they were correct answers. Determining whether a claim is well-supported by external documents seems much easier for a reward model to learn than determining whether a claim is true in general.

Why should we care about this?

I think ensuring that machine-learning systems behave sensibly in the presence of uncertainty is important for making sure they have a positive impact on the world.

Uncertainty isn't going away

I don't think this issue will go away on its own. In fact, my guess is this kind of problem will only get worse. Perhaps, as we train larger and larger models on ever-increasing amounts of data, they will be come better and better simulators, approaching the limits of what is possible to predict. But this won't solve the fundamental problem of how to act in the presence of uncertainty.

Can we just eliminate the sources of uncertainty by training our models longer, or giving them access to external tools? I don't think so. One type of uncertainty that seems difficult to resolve is uncertainty in what a system is supposed to do. Another is uncertainty about future events. And for any fixed model and set of tools, there will likely be some question whose answer is too difficult to compute in one pass. This seems especially relevant if society begins to delegate correspondingly more and more difficult tasks to these models as they become more and more capable.

I worry that the dominant paradigm for training large models, with large-scale pretraining on predictive data and a small amount of fine-tuning, will produce models that seem to behave in sensible ways most of the time, and that possess near-human-level or even superhuman abilities in specific areas, and yet sometimes fail catastrophically in ways that are difficult to anticipate. And a system that resolved uncertainty by sampling a random possible action or goal would be unlikely to behave in ways that are easy to predict or prepare for!

Designing for uncertainty

What's the alternative? I think we should accept that there are many types of uncertainty that we will be unable to fully resolve or perfectly specify, and focus on building systems that behave in reasonable, safe ways when it occurs.

More specifically, I think it will be important to explicitly train our systems how to behave in the presence of uncertainty. Instead of focusing on the best, most helpful response a system could produce in a given situation, we could train them to ask follow-up questions or abstain from answering altogether. I also think it will be important to use data generated from the model for this purpose, to make sure it learns about its own uncertainty rather than imitating a different set of uncertain behaviors.

I'm hopeful that, by combining the right training objective with data sampled from the model itself, it will be possible to learn a generalizable representation of uncertainty and how to respond to it in a useful way. This could enable us to build reliable tools for augmenting human intelligence, and trust that those tools will behave in sensible ways when we use them in new, uncertain situations.




Thanks to Cem Anil, David Duvenaud, and Jacob Austin for providing feedback on drafts of this post.




Goal-uncertainty and persona-uncertainty

So far I’ve mostly been discussing uncertainty about the world, which would lead a rational agent to take different kinds of actions. However, there are a few other types of uncertainty that are relevant to consider as well.

One of these is uncertainty about the goal. If agent A is trying to be helpful to agent B, but they don’t know what agent B wants, they might start by asking clarifying questions, or by taking conservative actions. On the other hand, if a simulator is simulating agent A being helpful to agent B, and agent A knows what agent B wants, but the simulator does NOT, the simulator would instead distribute mass over all possible goals. Sampling from this would then produce a policy that first samples a possible goal at random, and then interprets the sampled actions as evidence about what the goal must be, which is likely not what we want!11 A reasonable way to handle goal uncertainty might be to simulate an agent that has a fixed but partially unobserved goal, and thus must learn about its actual objective by interacting with the world; this reduces goal uncertainty to "ordinary" world uncertainty.

11 See the LessWrong post "Behavior cloning is miscalibrated" for some related discussion.

Another interesting form of uncertainty is uncertainty about which persona is being simulated. If the prompt is consistent with multiple different generating processes, the output of an optimal language model would have to produce a mixture distribution over all of the possible personas that could have produced this prompt. In a sense, the simulator must figure out who it is simulating as it goes, conditioning on all of its own previous outputs to do so.

My current working hypothesis is that the main effect of RLHF (and instruction tuning) is to specify what persona the model should simulate by default. In other words, it teaches the model that it should produce a dialogue between a human character and a helpful AI assistant character, and that this AI assistant character should react in particular ways to particular types of questions. (From my mostly anecdotal chats with ChatGPT 3.5, it seems like it has mostly learned a general pattern "helpful assistants should sometimes say that they don’t know things" rather than a truly uncertainty-aware strategy "helpful assistants should avoid saying things that aren’t true".)

Relatedly, many jailbreaking approaches for ChatGPT revolve around getting the model to simulate a different persona that doesn't have the same limitations as the "ChatGPT" one. Once the narrative frame of the original conversation is broken, the general simulation abilities resurface and the system will quickly begin to simulate personas with any number of other capabilities (including capabilities that the system doesn't actually possess).

An interesting post by Cleo Nardo on "The Waluigi Effect" discusses a particular failure mode related to persona uncertainty: if we attempt to produce a persona with a particular set of attributes (a "Luigi"), it can sometimes spontaneously collapse into a persona that secretly has the opposite set of attributes (a "Waluigi"), since any Luigi action could be seen as consistent with both the desired Luigi and a deceptive Waluigi. The post also makes the interesting point that much of the prior distribution over personas is derived from stories and literature, and so common behavior tropes in fiction (such as deception and betrayal) are especially easy to invoke from the simulator.

I don't think the same strategies of dealing with world or goal uncertainty (i.e. learning good behaviors for an uncertain agent) would apply to the persona uncertainty case; I can't think of a reason why we would want an agent to try to learn about its own identity by interacting with the environment. Instead, it seems to me like we will probably want to specify a single desired persona as precisely as we can, although admittedly this seems difficult.

A possible alternative explanation for why chain-of-thought prompting works

In my main post, I described one hypothesis for why chain-of-thought prompting works: it conditions the model to simulate a persona that is a better match for the simulation capabilities of the model.

An alternative hypothesis might be that some answers on the internet are wrong, and that step-by-step reasoning is correlated with the presence of a correct answer. In other words, it could learn that the kind of persona that does step-by-step reasoning is also the kind of persona that is likely to answer correctly. This explanation would be similar to how simulating actions conditioned on receiving high reward can lead to high rewards (Chen et al. 2021), and indeed just asking for a correct answer improves performance a bit more (Zhou et al. 2022).

This second hypothesis has merit, but I think that the first hypothesis is playing a relatively larger role in the failure-cases of today's language models. One argument is that conditioning on incorrect few-shot examples can improve reasoning ability of these models (Min et al. 2022), even though it likely correlates with lower accuracy in the wild. Another argument is that even RLHF-fine-tuned models like ChatGPT continue to make this kind of error despite being trained to be helpful.

Mode collapse and miscalibration in fine-tuning

An interesting phenomenon that seems to happen with instruction-tuned and RLHF-tuned policies is mode collapse: their policies become much more concentrated.12 This seems to indicate they no longer act as faithful simulations of possible behaviors. A more recent related finding is that GPT-4 became less well-calibrated after RLHF-fine-tuning.13

If these models also fail to learn to behave reasonably as policies in the presence of uncertainty, this seems worrying to me, because we would lose all signals of uncertainty even if the base model was initially well-calibrated as a simulator. Perhaps some combination of the base model and the fine-tuned model will be necessary if we want to assemble a well-calibrated overall system.

(One interesting question: are mode-collapse and miscalibration a property of the fine-tuned simulator itself, or is it just a property of the particular default character being simulated? I wonder if a jailbroken RHLF-tuned system would still exhibit these effects.)

12 See "Mysteries of mode collapse" by janus (@repligate).

13 See Figure 8 of the GPT-4 Technical Report.

Toolformer as a limited form of simulator-level tool use

Interestingly, the recent Toolformer paper (Schick et al. 2023) actually goes a bit beyond prompting: the proposed method involves filtering tool-use examples to only train on the ones that improve the accuracy of future next-token predictions. But the usage of tools does not affect the target sequence being predicted, except for the presence of the tool-use markers. This means the tool use is happening at the level of the simulator, not at the level of the persona being simulated. As such, the output text trace will still sound confident, as if the answer was known the whole time. It's sort of like an author taking a break from writing a story to look up some information that the protagonist in the story already knows, or an actor breaking character to double-check her lines.

On the other hand, the level of introspection here seems fairly limited, and I don't think it's quite right to say that the model is choosing to use tools when it can't predict the answer. Instead, the Toolformer training procedure seems to encourage the model to simulate from a distribution of text traces that have been augmented with tool use in places that increased the accuracy of a different langauge model. Perhaps it's more like pretending to be an actor breaking character to check their lines? The model is still "in character" simulating the actor, even while the actor character has stopped acting.

Language models and other generative models are usually trained with the forward ("mode covering") direction of the KL divergence: the training objective encourages them to put probability mass on every outcome that is possible to observe in the training data.14 This direction of the KL divergence is also a proper scoring rule, which is what makes the learned outputs well-calibrated; if the model cannot distinguish between inputs in some set, the best output distribution is the true marginal distribution of outcomes within that set.

14 Technically, they usually use cross entropy / negative log likelihood, but for a fixed target distribution the two are equivalent up to a constant shift.

Could we use the reverse direction of the KL divergence instead? The reverse direction is often called "mode seeking", and it instead prioritizes ensuring that every output of the model has high probability mass under the target distribution. Intuitively, this seems like it could help avoid overconfident incorrect outputs, since those will (hopefully) have low probability mass in the supervised training data.

Unfortunately, I think this is unlikely to work well in practice for a few reasons. First, it’s difficult to optimize the reverse KL divergence from samples, since the reverse KL divergence requires computing the ground-truth probability of model samples under the true distribution, and we don't know the true distribution. Second, this strategy is heavily dependent on what we mean by "the true distribution". If we are trying to predict the marginal distribution of expert behavior conditioned on missing information, even the reverse KL divergence will produce an uncertain simulation (e.g. it will correctly put 50% probability mass on each of the two left boxes). And if we are trying to simultaneously minimize reverse KL divergence across possible states of the unknown latent variables, there might not be any action with a finite reverse KL divergence, since any action could have probability zero for some configuration of the unknown variables.

On the other hand, successful approaches might still use reverse KL divergence from the base simulator as a regularization mechanism. But I think the KL divergence is fulfilling a different role there, since the KL is evaluated with respect to another simulator, not with respect to the expert demonstrations.15

15 One perspective on what this KL penalty is doing can be found in this paper:

Korbak, Tomasz, Ethan Perez, and Christopher L. Buckley. "RL with KL penalties is better viewed as Bayesian inference." arXiv preprint arXiv:2205.11275 (2022).

The DAgger problem and recovering from errors

The DAgger problem (Ross et al. 2021) is a well-known phenomenon when trying to train a policy with imitation learning and then deploy it in an interactive setting. To summarize, an uncertain imitation-learned policy may take actions that lead it into a different distribution of states than it has been trained on, and may not know how to act in those states.

In some ways, the issues with overconfident outputs of uncertain simulators could be viewed as an instance of this: if a policy ends up choosing the empty box, it wouldn’t know what to do next! Relatedly, Ortega et al. (2021) discuss how such policies may mistakenly interpret their own actions as expert actions, and then act under incorrect beliefs about the world, due to having an incorrect causal graph.

Although learning to correct errors definitely seems important, I think it is also important to train policies to avoid these uncertain actions altogether. This would require the policy to go beyond expert imitation and learn to act according to its own capabilities. This might involve penalizing the policy for getting into states where it can't predict the expert action, similar to the disagreement regularization proposed by Brantley et al. (2020).

Robustly-simulatable personas

A speculative thought: does there exist a simulatable persona that is robust to the simulator's uncertainty about its behavior? For instance, we could imagine a fictional character that is both intelligent and "knows" (e.g. is described or simulated as knowing) that it is being simulated by an imperfect simulator. Such a character might be able to observe the mistakes in its own statements, and infer that it is not being simulated well, then learn from context what level of complexity can be faithfully simulated and only take the actions that preserve its (simulated) intent.

If done carefully, this could be a very useful property for a simulated assistant to have. Such an assistant would be able to "notice it is confused" and avoid inadvertently harmful actions or redact unjustified claims, or even flag these behaviors to a human operator. This might enable us to at least trust that the system's actual behavior will match our designed persona's goals, without deviating unpredictably due to simulation error.

On the other hand, if this kind of ability emerged spontaneously from a language model, this could be a big step change in the effective capabilities of the overall system, and could lead to a sudden increase in agentic behavior. I'm reminded of the episode "Elementary, Dear Data" from Star Trek: The Next Generation, in which a non-player character in the Holodeck simulation room becomes aware of the nature of the Holodeck, and quickly uses it to exert influence on the rest of the Enterprise from within. Perhaps a robustly-simulatable persona would act similarly? Today's language models have probably read the script of this exact Star Trek episode, so this kind of behavior is probably part of the learned distribution of plausible personas.

How far fetched is this? I haven't successfully gotten ChatGPT (GPT-3.5) to reliably detect its own mistakes, even when asked. But asking the "Sydney" persona embedded in Bing Chat to look up information about its own limitations can apparently produce some pretty weird pseudo-self-aware responses. It seems somewhat plausible to me that a form of self-correction could emerge somewhat spontaneously from language models in the future, even if not explicitly trained to do so, if a sufficiently general model was combined with the right prompt (and possibly external tools). If that persona emerged naturally from human-generated data, my guess is it would probably have a sub-human level of agency and influence over the world, but it might still be a surprising and potentially dangerous deviation from the kinds of behavior previously observed.