A new method significantly increases the truthfulness of large language models and shows that the models know more than they reveal.
Researchers at Harvard University have developed a technique called Inference-Time Intervention (ITI) to improve the truthfulness, or factuality, of large language models – and create a “Honest LLaMA”, as it’s called on GitHub. The work is motivated by the fact that ChatGPT and other chatbots provide correct information in some contexts, but hallucinate in others – so the facts are there, but sometimes lost in the model’s inference.
The team uses linear probes to identify sections in the neural network that have high accuracy in factuality tests using parts of the TruthfulQA benchmark. Once the team has identified these sections in some of the transformer’s attention heads, ITI shifts model activations along these attention heads during text generation.
ITI significantly increases Alpaca’s truthfulness
The researchers show that with ITI, the accuracy of the open-source Alpaca model in the TruthfulQA benchmark increases from 32.5 to 65.1 percent, with similar jumps for Vicuna and LLaMA. However, too large a shift in model activations can also have negative consequences: The model denies answers and thus becomes less useful. This trade-off between factuality and helpfulness can be balanced by adjusting the intervention strength of ITI.
ITI has some overlap with reinforcement learning, where human feedback can also increase factuality. However, RLHF can also encourage misleading behavior as the model tries to match human expectations. ITI does not have this problem and is also minimally invasive, requiring little training data and computational power, the researchers say.
Studies of large language models could lead to a better understanding of “truth”
The team now wants to understand how the method can be generalized to other datasets in a real-world chat setting, and develop a deeper understanding of the trade-off between factuality and helpfulness. In addition, it may be possible in the future to learn the manually identified network segments in a self-supervised manner to make the method more scalable.
Finally, the researchers point out that the topic could also make a wider contribution: “From a scientific perspective, it would be interesting to better understand the multidimensional geometry of representations of complex attributes such as ‘truth’.”
The code and more information are available on GitHub.