Elon Musk’s ‘TruthGPT’ is complicated, says OpenAI co-founder


Hallucinations are one of the central problems of large language models. An OpenAI co-founder explains why a TruthGPT is so difficult.

Elon Musk’s X.AI wants to build “TruthGPT”, an honest language model – referring not only to classic cases of hallucination where systems like ChatGPT generate false outputs but also to reports that ChatGPT favors certain political beliefs.

While the latter could be solved by giving users more control over language models, hallucinations remain a central problem that OpenAI, Google, and, in the future, Musk’s AI company will have to deal with.

In his talk, “RL and Truthfulness – Towards TruthGPT,” OpenAI co-founder and researcher John Schulman discussed these challenges and how they might be addressed.


What causes hallucinations in ChatGPT?

According to Schulman, hallucinations can be roughly divided into two types: (1) “pattern completion behavior,” in which the language model fails to express its own uncertainty, fails to question a premise in a prompt, or continues a mistake it made earlier , and (2) cases in which the model is guessing wrong.

Since the language model represents a kind of knowledge graph with facts from the training data in its own network, fine-tuning can be understood as learning a function that operates on this knowledge graph and outputs token predictions. For example, a fine-tuning data set might contain the question “What is the genre of Star Wars?” and the answer “Sci-Fi”. If this information is already in the original training data, ie it is part of the knowledge graph, the model does not learn new information but it learns a behavior – outputting correct answers. Such fine-tuning is also called “behavior cloning”.

The problem: If, for example, the question “What was the name of the spin-off movie about Han Solo?” appears in the fine-tuning dataset, but the answer “Solo” is not part of the original training dataset – and thus not part of the knowledge graph – the network learns to answer even though it does not know the answer. Fine-tuning with answers that are actually correct but not in the knowledge graph thus teaches the network to make up answers – ie to hallucinate. Conversely, training with incorrect answers can cause the network to hold information.

Ideally, therefore, behavior cloning should always be based on knowledge of the network – but this knowledge is usually unknown to the human workers who create or evaluate the data sets, eg for instruction tuning. According to Schulman, this problem also exists when other models create fine-tuning data sets, as is the case with the Alpaca formula. A smaller network with a smaller knowledge graph not only learns to give answers and follow instructions using ChatGPT’s output, but also learns to hallucinate more often, he predicts.

How OpenAI aims to combat hallucinations

The good news is that, at least for simple questions, language models seem to be able to estimate whether they know an answer – and could theoretically express their uncertainty. So, Schulman says, a fine-tuning data set needs to include examples in which uncertainty is communicated, a premise is challenged, or an error is admitted. These behaviors could then be taught to the model through behavior cloning.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top