Retrieval Augmented Generation (RAG) & Hallucinations

What is Retrieval Augmented Generation (RAG)?

Retrieval augmented generation is a technique that combines the capabilities of a large language model (LLM) with a knowledge retrieval system to enhance the LLM’s output. LLMs are usually pre-trained using a vast amount of publicly available data. As a result, their ability to generate responses about knowledge outside the training dataset is limited. RAG solves this problem by retrieving relevant information from external knowledge bases and passing this information to the LLM as an additional factual grounding to generate more accurate and relevant responses.

How does RAG work?

Typically LLMs take the prompts from a user and create a response based on information it was previously trained on or what it already knows. With RAG an information retrieval component is introduced as a new layer that utilizes the user’s prompt to first pull information from a new data source. The user’s prompt is augmented with the relevant information and passed on to the LLM. The LLM uses the new knowledge and its pre-trained knowledge to generate better responses. The entire process is summarized in the diagram below.

How does RAG work

An overview of the process is provided in the following sections.

1. Create external data
The data outside the LLM’s original training data set is often referred to as external data. It could be obtained from various data sources such as databases, document repositories, file storage, or APIs. The pulled data can be in different formats as well like files, long-form text, or database records. Ideally, all the data is converted into numerical representations and stored in vector databases using another technique called embedding language models. This first step is crucial to build a relevant knowledge library that the LLMs can understand.

2. Retrieve relevant information
The next step is to perform a relevancy search. Let’s take an example where an employee in an organization prompts its smart chatbot for “How many annual leaves do I have?”. The system will convert the user query to a vector representation and perform match queries with the previously created knowledge library to find relevant data. Ideally, this would return documents such as the annual leave policy document and the relevant employee’s past leave records since these are highly relevant to what the employee has input.

3. Augment prompt
Finally, the system augments the user’s prompt by adding the relevant retrieved-context using prompt engineering techniques. The augment prompt is passed on to the LLM and the final response is returned as output to the user. The augmented prompts enable effective communication with the LLM and help generate more accurate answers to user queries.

4. Update external data
One concern that might arise is what happens if the external data becomes stale. Automated processes should be in place to asynchronously update documents and their relevant embedding representations to ensure up-to-date information is available for retrieval. This is considered a common challenge in data analytics and different strategies can be employed to manage and update data specific to business needs.


Hallucination in AI refers to the phenomenon where the model generates outputs that are nonsensical, irrelevant, or factually incorrect. AI models are only as effective as the data they’ve been trained on. LLM outputs can be non-deterministic and based on probabilities and not on actual knowledge or consciousness. When LLMs such as ChatGPT are prompted on niche topics that are outside their knowledge base they often make up nonsensical answers.

LLMs understand the statistical relationship between words but they may not always understand what the words actually mean. Hallucinations in generative AI applications can be particularly problematic, especially in applications where high accuracy is crucial such as content creation for the healthcare sector, as they can undermine the reliability and trustworthiness output generated by the LLM.

Listed below are some of the leading causes of hallucinations,

  • Model overfitting: The model might overinterpret its training data and focus on unnecessary noise and outliers within the data leading to low-quality pattern recognition.
  • Data quality: Mislabeled, miscategorized, poor-quality data can lead to bias, errors, and hallucinations. As an example, if a photograph of a car is mislabeled as a truck, then the model could potentially return an answer later that states cars are great for trade workers who need to haul or tow large loads across long distances.
  • Data sparsity: Lack of fresh and relevant data to train the model can also cause hallucinations as the models might tend to fill knowledge gaps on their own. Which can lead to answers that are not factually grounded.

How does RAG impact hallucinations?

In an application that implements RAG, the generative component uses the retrieved information passed as the context to construct the response. This ensures that the output is grounded in factual data and relevant context. This reduces the chances of producing hallucinations.

By relying on external sources of information, RAG minimizes the risk of the AI model overinterpreting its training data and addresses gaps in scenarios where the training data is sparse. While RAG significantly reduces the incidence of hallucinations in generative AI models, it is not entirely immune to them.

RAG is most effective when it comes to “knowledge-intensive” scenarios, for example, if a user needs to find who won the Super Bowl last year the document that answers this question is very likely to contain similar keywords as the prompt (“Super Bowl”, “year”), making it relatively easier to relevance match. But with “reasoning-intensive” tasks such as math or coding, it’s harder to find relevant documents that can solve these problems since it’s practically difficult for the search query to accommodate the necessary concepts required to answer the request. Here are some more scenarios in which RAG might still generate hallucinations,

  • Inaccurate/incomplete retrieval: If the retrieval component fails to search for relevant information, the LLM may still produce outputs based on the pre-trained data, which can create hallucinations. This can occur when the knowledge base being searched is not comprehensive enough, or the retrieval algorithm is ineffective.
  • Lack of quality data sources: When the retrieval component retrieves inaccurate, outdated, or misleading information from unreliable sources, the generative model may incorporate this incorrect data into its responses, resulting in hallucinations.
  • Poor prompting/integration of retrieved information:The integration process with the generative component needs to be correctly implemented to ensure that the retrieved information is utilized properly. The new prompt generated with the additional context must be concise and effective to get good results from the LLM.
  • Ambiguous queries: When the input query is imprecise or lacks sufficient detail, the retrieval component may struggle to find the most relevant information, leading to generative outputs that fill in the gaps with hallucinated content.

Retrieval Augmented Generation (RAG) & Hallucinations

  • Reduce Risk
  • Simplify Compliance
  • Gain Visibility
  • Version Comparison

Reducing RAG Hallucinations

Addressing hallucinations within RAG requires a multifaceted approach. Following are some strategies to reduce the hallucinations in RAG systems

Enhance the accuracy of retrieval:

  • Refinement of Query Processing: Implement advanced natural language processing techniques such as semantic similarity, hybrid search techniques, named entity recognition, similarity scoring, etc., to better understand user queries, leading to more accurate retrieval from knowledge bases.
  • Expansion of Knowledge Bases: Ensure the knowledge bases are extensive and regularly updated with the latest data. Including diverse and high-quality sources helps provide comprehensive and relevant information.
  • Relevance Scoring: Utilize sophisticated relevance scoring algorithms to rank the retrieved documents or pieces of information, ensuring the most pertinent information is used in the generative process.
  • Improving Quality of Data Sources: Prioritize credible and reliable sources in the retrieval process. This involves filtering out low-quality or unreliable sources and focusing on peer-reviewed, authoritative, and up-to-date information.

Optimizing Prompting Techniques:

  • Contextual Prompts: Design prompts that effectively integrate the retrieved information with the pre-trained knowledge of the LLM. This involves crafting prompts that clearly instruct the model on how to utilize the additional context.
  • Iterative Refinement: Continuously refine and test prompts to improve the coherence and relevance of the generative outputs. This might include using feedback loops where human reviewers assess and suggest improvements to the prompts.

Feedback and Learning Systems:

  • Human-in-the-Loop: Incorporate human reviewers in the loop to assess and correct outputs, providing feedback that helps the system learn and improve over time.
  • Continuous Learning: Utilize machine learning techniques that enable the system to learn from feedback and improve its performance iteratively, reducing the likelihood of future hallucinations.

Future of RAG

RAG represents a significant advancement in the field of AI, combining the strengths of LLMs with sophisticated retrieval mechanisms to produce more accurate and relevant answers. RAG has the ability to transform how humans engage with and use AI for information retrieval and decision-making by addressing hallucination concerns, improving data source quality, and continually improving the system through feedback and learning. This will ultimately lead to increased trustworthiness and reliability of AI systems, paving the way for innovative applications in a variety of disciplines.