How to Rag - case study from dify

RAG is a core component of LLM applications. The idea is to index your data in a vector database and then use the LLM to generate responses based on the indexed data.

The concept seems simple but the implementation can be complex. I recently am researching on Dify - a popular LLM application platform and found its RAG engine is a good case study to understand how to implement a comprehensive RAG system.

（dify architecture from its website）

In this post, I will walk through the source code of RAG engine in and see the thought process behind its RAG solution.

We can generally break down the RAG system into 2 parts: Indexing and Retrieval.

Document indexing

1. Process overview

Document indexing is the process of embedding the document content into a vector space. It is essentially a process of ETL (Extract, Transform, Load) with vector database as the target.

If we look at the “indexing_runner.py” which is the entry point of the indexing process. It consists of 4 main stages:

Extract: Reads content from various sources (uploaded files, Notion, websites)
Transform: Processes and cleans the text
Load Segments: Splits content into segments and saves to database
Creates searchable indexes

2. Index types

Normally RAG systems use vector/embedding indexes for semantic search. However, dify also supports keyword index for basic search.

On keyword index, by default, dify uses Jieba’s TF-IDF algorithm to generate the keyword index. It stores the keyward index in a two-tier storage:

Redis: Used as a cache layer for fast access
Database: Used as permanent storage

In the case of Redis, it is also utilized as a locking mechanism to keep multiple indexing processes from running simultaneously.

3. Cleaning up

Data cleaning is a crucial step in the indexing process. It ensures that the data is in a format that can be easily processed and indexed.

Essential cleaning which defined in “clean_process.py” include:

Removing extra whitespace/newlines
Stripping URLs and emails that don’t add semantic value
Removing special characters that could interfere with processing
Normalizing text format (like fixing broken Unicode characters)

4. chunking

Chunking is the process of splitting the document into smaller segments. It is a crucial step in the indexing process as it allows the system to search for relevant information more efficiently. An interesting part of rag in dif is, besides the default chunking method, you can choose to use Parent-child chunking(“parent_child_index_processor.py”). of which the child chunk is used for retrieval and parent is used for recall as context. It has several benefits:

Reference the parent document when needed
Understand where a chunk fits in the larger document
Maintain document hierarchy
Can store metadata at parent level instead of duplicating across all chunks

5. Vector storage

This is relatively straightforward. It first uses a vectorizer to convert the text into a vector space. The vectorizer could be a local embedding model or a remote embedding service like OpenAI. Then it stores the vector into the vector database. The vector can then be used for semantic search.

6. Tracking

It is important to visualize the indexing process to user. So dify tracks the document status through states like “splitting”, “indexing”, “completed” , so on frontend we can show the progress of the indexing process. becise it can estimate the indexing requirements before processing, which helps user to understand the cost of the indexing process.

Retrieval pipleline

If you investigate “dataset_retrieval.py”.The retrieval pipeline is the process of searching for relevant information in the indexed data and do essential post-processing before returning the result. In dify it has made several efforts to improve the retrieval quality.

1. Initial retrieval

The initial retrieval is the process of searching for relevant information in the indexed data. Although RAG is supposed to be a semantic search, in practice, it is not always accurate. So dify has offered several options including vector search, full text(keyword) search or hybrid search. under the hood, it uses vector database, sql database or both.

2. Top-k setting

For large documents, the initial retrieval may return too many results. And you may need to limit the number of results to improve the retrieval quality. In terms of what is the best k value, it is kind of a tuning process by experiment.

3. Reranking.

Reranking is a crucial step in modern RAG pipelines that helps improve retrieval accuracy by re-scoring and reordering initially retrieved documents. While initial retrieval (like vector similarity search) is optimized for speed and broad coverage, reranking focuses on precision.

In particular, reranking is to take an initial set of potentially relevant documents and sort them based on their actual relevance to the query. This is typically more accurate than the initial retrieval step because:

It can perform a more detailed comparison between the query and each document
It can consider semantic meaning rather than just keyword matching（for keyword search）
It assigns relevance scores that can be used for filtering or prioritizing results

There are several reranking methods, cohere’s reranker is a popular one of them.

4. Result output

The last step is to prompt the LLM to generate the response based on the retrieved documents. With all the context from the previous steps, dify uses ReAct(Reasoning + Acting) pattern to interact with the LLM and generate the most relevant response.

specifially it works like this:

Thinks about the problem
Takes an action (selects a dataset)
Observes the result
Continues this cycle until reaching a final answer

Final thoughts

This is merely a very concise overview of the RAG engine in Dify. In my opinion, although we have the extremely powerful LLM to assist us in understanding nearly everything, the complexity of the RAG system might still be in data ETL and information retrieval. These aren’t new ideas. Thus, I strongly suspect that, at least for now, we don’t need to put in a great deal of engineering effort anymore.

As for the future, I have particular interest in how the multi-model RAG would roll out as the LLM models are evolving super fast. This is really an exciting era.