Improving AI-Generated Responses: Techniques for Reducing Hallucinations

July 15, 2024

L Burleigh

Data professionals are increasingly concerned about AI hallucination, particularly in education. AI hallucination occurs when a large language model (LLM) creates nonsensical or inaccurate outputs due to perceiving patterns or objects that are nonexistent or imperceptible to humans.

Hallucinations often occur in generative AI chatbots or computer vision tools. Hallucinations not only provide incorrect information but can also replicate and perpetuate societal biases against marginalized populations, including race, gender, or sexual orientation. Incorrect outputs and reinforcing bias can have detrimental effects on learning.

Hallucinations not only provide incorrect information but can also replicate and perpetuate societal biases against marginalized populations.

Over the past few weeks, various approaches have been examined to address hallucinations, leading to the following techniques to reduce or rectify hallucinations in AI-generated responses.

I’ve divided my recommendations into two parts: Ways to reduce the likelihood of receiving a hallucination and identifying and addressing potential hallucinations in AI-generated responses.

Reducing the Rate of Hallucinations

Prompt Engineering Suggestions

There are a few techniques in engineering the AI prompt that can help reduce the likelihood of receiving a hallucination. The first is through Prompt Engineering. When prompting an AI to generate a response, some simple techniques should be considered, as the prompt’s wording will often determine the generated response.

Some things to consider:

Provide explicit instructions and request the AI to verify its information. For example, if the AI outputs “arithmetic” when prompted with “Provide 5 synonyms for math that start with the letter c,” instead try this: “Provide 5 synonyms for math that start with the letter c and note whether each word provided begins with the letter c.”
The “chain of thought” technique. Ask the model to break a problem down into smaller chunks and explain the step-by-step process it uses to deliver a final solution.
Specify that no answer is better than an incorrect answer. When AI is unable to locate a correct answer, it can create answers that are close to addressing the prompt without meeting all criteria specified. Instructing the model that no answer is preferred over an incorrect or partial answer can reduce the likelihood of incorrect responses.
Provide examples of correct answers in your prompt. Giving the AI examples can help guide it to the information that is being requested.
Provide full or additional context. Models are limited by the information and data they are provided. Pasting additional information the model may benefit from into the prompt, such as relevant text from a webpage, document, or transcript, provides additional context to aid in its response.

RAG Approach

Another approach is an end-to-end Retrieval-Augmented Generation (RAG) technique which incorporates pre-trained models and informative latent documents (e.g., Wikipedia) that guide the production of AI-generated responses to greatly decrease hallucinations and create more factual and specific responses.

This technique uses general-purpose fine-tuning RAG models that are pre-trained and combined into a probabilistic model trained end-to-end by marginalizing latent documents with a top-K approximation. This setting includes parametric (knowledge is embedded within model parameters) and non-parametric (explicitly retains training data) memory components to pre-train and pre-load with extensive knowledge, allowing knowledge to be accessed without additional training. The document retrieval and generation are trained end-to-end, ensuring that they learn jointly and result in enhanced performance on various knowledge-intensive tasks, demonstrating the efficacy of combining parametric and non-parametric memory in generation models.

The document retrieval and generation are trained end-to-end, ensuring that they learn jointly and result in enhanced performance on various knowledge-intensive tasks, demonstrating the efficacy of combining parametric and non-parametric memory in generation models.

I found two of the tasks explored in this publication particularly interesting, both of which resulted in generated responses that were more factual, specific, and diverse than the Bidirectional and Auto-Regressive Transformer (BART) baseline, a state-of-the-art parametric-only model. The first task interested me as the question provided could not be answered using Wikipedia, the documentation source, alone. The RAG technique, however, relies on parametric knowledge to generate reasonable responses without access to specific pieces of information required to generate the reference answer. In this task, the RAG models were qualitatively found to hallucinate less and generate factually correct text more often than BART. The second task was a Jeopardy question generation in which a Jeopardy response (e.g., “The Divine Comedy”) was input to generate Jeopardy questions (e.g., “This 14th-century work is divided into 3 sections: “Inferno”, “Purgatorio”, & “Paradiso”). These Jeopardy-style questions can be challenging to generate via AI as they often contain two separate pieces of information. This RAG technique may perform better at this task than other models because it can generate responses that combine content from several documents.

This RAG technique could be a particularly viable option when additional data or documentation is at hand to more highly inform the models on the prompt topic(s).

The end-to-end RAG code is open-sourced and can be found here.

Identifying and Rectifying Hallucinations

Once AI has generated responses, I suggest three potential approaches to fixing hallucinations: Self-consistency, Chain of Verification (CoVe), and RealTime Verification and Rectification (EVER). These methods seek to identify hallucinations and derive the correct response. Self-consistency utilizes quality checks and repeated response generation to identify the correct response while CoVe and EVER assume a language model can check and correct its work if prompted. CoVe and EVER prompt the LLM to verify whether each generated AI response that needs to be checked is correct and generates a final, corrected response by accounting for the inconsistencies between the original response generated and the verified statement.

Self-consistency

Self-consistency is a thorough technique by Pardos and Bhandari in which multiple responses are AI-generated to each question/prompt. These are then evaluated by humans to ensure quality responses. Similar answers are grouped together, with the largest group deemed the correct option.

This error mitigation technique reduces the frequency of incorrect statements by prompting the model many times and assessing the generated responses. Rather than taking the first immediate answer, self-consistency gathers multiple response samples to identify the most consistent answer. Self-consistency is applied via 3 steps:

ChatGPT was used to generate 10 responses for each prompt.
Each set of 10 responses was quality-checked on a 3-point criterion by six undergraduate students. The 3-point screening ensured that 1) the correct answer was given in the response, 2) the work shown was correct, and 3) no inappropriate language was used.
Responses with the same answer were grouped. The answer with the greatest number of responses was deemed the most correct. If two groups contained the same number of responses, a random number generator randomly selected which group of answers to consider.

Pardos and Bhandari found that ChatGPT 3.5 has an error rate and failed quality checks on 32 percent of problems in Algebra and Statistics prompts. This error rate dropped to nearly 0 percent for Algebra problems and 13 percent for Statistics problems after applying the self-consistency technique. The publication also notes that ChatGPT 4 has recently been reported to have an error rate of 27 percent in AP Calculus, Physics, and Chemistry questions, very close to ChatGPT 3.5’s 32 percent error rate.

Chain of Verification (CoVe)

The CoVe method assumes the language model can generate and execute a plan to verify itself and check its work when suitably prompted. CoVe consists of four steps:

Generate Baseline Response using the LLM given a prompt.
Plan Verifications. Generating a list of verification questions to help the LLM self-analyze if there are mistakes in the original response given the prompt and baseline response.
Execute Verifications. Ask the LLM to answer the planned verification questions. After each response, check the answer against the original responses for inconsistencies or mistakes.
Generate a Final Verified Response given the discovered inconsistencies.

Step 3– Execute Verifications – can have several variants. The suggested factored variant does not contain the original baseline response and therefore is not prone to copying or repeating it. In this approach, all questions are answered independently as separate prompts.

By breaking down the verification into simpler questions using the CoVe method, models answer verification questions with higher accuracy than when answering the original query. Controlling the attention of the model via the factored approach in Step 3 so that it cannot attend to its previous answers helps alleviate the model copying the same hallucinations.

RealTime Verification and Rectification (EVER)

Similar to the CoVe method, the EVER pipeline identifies and rectifies hallucinations through validation prompts. In the EVER pipeline, to verify an LLM’s prompt response, multiple Yes/No validation questions are asked in parallel. The LLM is instructed to assign each validation question one of three flags (True, False, or Not Enough Information) through Chain-of-Thought prompting. If at least one validation question is not True, EVER rectifies the corresponding sentence based on the evidence gathered. If the hallucination is intrinsic (i.e., the generated output contradicts the source content), the revision is based on the evidence from the last step. If the hallucination is extrinsic (i.e., the output is neither supported nor refuted by the evidence), the sentence undergoes a rewrite by taking into account feedback that pinpoints the issue and uses retrieved evidence as a reference.

EVER effectively addresses both intrinsic and extrinsic hallucinations while reducing the propagation of errors that may occur in sequential text generation.

Conclusion

As we move forward with the integration of LLMs and AI chatbots into our daily lives, it is crucial to reflect on the importance of diligently assessing and rectifying AI-generated hallucinations. These inaccuracies and societal biases can often go unnoticed, yet their impact can be significant. The responsibility lies in our hands to implement mindful prompt engineering and robust techniques such as the end-to-end RAG method, Self-consistency, CoVe, and EVER. These approaches not only address the frequency and occurrence of hallucinations but also ensure that we utilize accurate and reliable information. By actively engaging with these methods, we can mitigate the risk of perpetuating misinformation from AI-generated content. As AI continues to evolve, we must consider the implications and the necessity of maintaining vigilant oversight in using these technologies.

L Burleigh

Data Analyst, The Learning Agency