RAG is a core component of LLM applications. The idea is to index your data in a vector database and then use the LLM to generate responses based on the indexed data.
The concept seems simple but the implementation can be complex. I recently am researching on Dify - a popular LLM application platform and found its RAG engine is a good case study to understand how to implement a comprehensive RAG system.
(dify architecture from its website)
In this post, I will walk through the source code of RAG engine in and see the thought process behind its RAG solution.
We can generally break down the RAG system into 2 parts: Indexing and Retrieval.
Document indexing
1. Process overview
Document indexing is the process of embedding the document content into a vector space. It is essentially a process of ETL (Extract, Transform, Load) with vector database as the target.
If we look at the “indexing_runner.py” which is the entry point of the indexing process. It consists of 4 main stages:
- Extract: Reads content from various sources (uploaded files, Notion, websites)
- Transform: Processes and cleans the text
- Load Segments: Splits content into segments and saves to database
- Creates searchable indexes
2. Index types
Normally RAG systems use vector/embedding indexes for semantic search. However, dify also supports keyword index for basic search.
On keyword index, by default, dify uses Jieba’s TF-IDF algorithm to generate the keyword index. It stores the keyward index in a two-tier storage:
- Redis: Used as a cache layer for fast access
- Database: Used as permanent storage
In the case of Redis, it is also utilized as a locking mechanism to keep multiple indexing processes from running simultaneously.
3. Cleaning up
Data cleaning is a crucial step in the indexing process. It ensures that the data is in a format that can be easily processed and indexed.
Essential cleaning which defined in “clean_process.py” include:
- Removing extra whitespace/newlines
- Stripping URLs and emails that don’t add semantic value
- Removing special characters that could interfere with processing
- Normalizing text format (like fixing broken Unicode characters)
4. chunking
Chunking is the process of splitting the document into smaller segments. It is a crucial step in the indexing process as it allows the system to search for relevant information more efficiently. An interesting part of rag in dif is, besides the default chunking method, you can choose to use Parent-child chunking(“parent_child_index_processor.py”). of which the child chunk is used for retrieval and parent is used for recall as context. It has several benefits:
- Reference the parent document when needed
- Understand where a chunk fits in the larger document
- Maintain document hierarchy
- Can store metadata at parent level instead of duplicating across all chunks
5. Vector storage
This is relatively straightforward. It first uses a vectorizer to convert the text into a vector space. The vectorizer could be a local embedding model or a remote embedding service like OpenAI. Then it stores the vector into the vector database. The vector can then be used for semantic search.
6. Tracking
It is important to visualize the indexing process to user. So dify tracks the document status through states like “splitting”, “indexing”, “completed” , so on frontend we can show the progress of the indexing process. becise it can estimate the indexing requirements before processing, which helps user to understand the cost of the indexing process.
Retrieval pipleline
If you investigate “dataset_retrieval.py”.The retrieval pipeline is the process of searching for relevant information in the indexed data and do essential post-processing before returning the result. In dify it has made several efforts to improve the retrieval quality.
1. Initial retrieval
The initial retrieval is the process of searching for relevant information in the indexed data. Although RAG is supposed to be a semantic search, in practice, it is not always accurate. So dify has offered several options including vector search, full text(keyword) search or hybrid search. under the hood, it uses vector database, sql database or both.
2. Top-k setting
For large documents, the initial retrieval may return too many results. And you may need to limit the number of results to improve the retrieval quality. In terms of what is the best k value, it is kind of a tuning process by experiment.
3. Reranking.
Reranking is a crucial step in modern RAG pipelines that helps improve retrieval accuracy by re-scoring and reordering initially retrieved documents. While initial retrieval (like vector similarity search) is optimized for speed and broad coverage, reranking focuses on precision.
In particular, reranking is to take an initial set of potentially relevant documents and sort them based on their actual relevance to the query. This is typically more accurate than the initial retrieval step because:
- It can perform a more detailed comparison between the query and each document
- It can consider semantic meaning rather than just keyword matching(for keyword search)
- It assigns relevance scores that can be used for filtering or prioritizing results
There are several reranking methods, cohere’s reranker is a popular one of them.
4. Result output
The last step is to prompt the LLM to generate the response based on the retrieved documents. With all the context from the previous steps, dify uses ReAct(Reasoning + Acting) pattern to interact with the LLM and generate the most relevant response.
specifially it works like this:
- Thinks about the problem
- Takes an action (selects a dataset)
- Observes the result
- Continues this cycle until reaching a final answer
Final thoughts
This is merely a very concise overview of the RAG engine in Dify. In my opinion, although we have the extremely powerful LLM to assist us in understanding nearly everything, the complexity of the RAG system might still be in data ETL and information retrieval. These aren’t new ideas. Thus, I strongly suspect that, at least for now, we don’t need to put in a great deal of engineering effort anymore.
As for the future, I have particular interest in how the multi-model RAG would roll out as the LLM models are evolving super fast. This is really an exciting era.