Using DSPy with Qdrant to Build Advanced RAG Pipelines

The no-prompt technique to building a RAG

picture taken from DSPy github: https://github.com/stanfordnlp/dspy

What Is DSPy

DSPy is a framework developed by researchers at Stanford NLP that focuses on programming over prompting in building applications using large language models (LLMs). Unlike traditional methods like LangChain or LlamaIndex that rely on hand-crafted prompts, DSPy emphasizes upon a programming model to design its pipelines.

By prioritizing programming, DSPy enables users to adapt their pipelines more efficiently when components (for e.g the LLM, or the type of query) change. In other words, DSPy streamlines the process of working with language models by moving away from manual prompt engineering towards a more systematic and automatically optimizing approach.

In this article, we’ll design a RAG pipeline using the DSPy framework and the LLM Gemma-2b. As we proceed, we’ll demonstrate the usefulness of DSPy by using its out-of-the box modules for prompt engineering.

We’ll also power our pipeline with Qdrant, which is a vector similarity search engine and vector database that provides a production-ready service with a convenient API to store, search, and manage points (vectors) with an additional payload.

Let’s get started.

RAG Pipeline Using DSPy

First install all the dependencies, by making a requirements.txt file:

'qdrant-client[fastembed]'
datasets
dspy-ai
transformers
torch
accelerate
pip install -r requirements.txt

We’ll be using an E Commerce Customer Service dataset from Hugging Face, which contains 1000 rows of customer service conversations.

First load the dataset:

from datasets import load_dataset
dataset = load_dataset('NebulaByte/E-Commerce_Customer_Support_Conversations')

Convert it into a format that can be used to upload into the vector index:

df_pandas = dataset['train'].to_pandas()
documents = df_pandas['conversation'].to_list()

Create a list of ids for the vectors:

ids = list(range(1,len(documents)+1))

Launch a local instance of Qdrant Client by running the following in your terminal.

docker run -p 6333:6333 -p 6334:6334 \
   -v $(pwd)/qdrant_storage:/qdrant/storage:z \
   qdrant/qdrant

Initialize a client object.

from qdrant_client import QdrantClient
client = QdrantClient("localhost", port= 6333)

Insert the Customer Service data into the Vector DB.

client.delete_collection(collection_name= "customer_service")
client.add(collection_name= "customer_service",
           documents= documents,
           ids= ids)

Make sure you have logged in to your Hugging Face account.

!huggingface-cli login --token 'YOUR_HF_TOKEN'

Load the LLM.

import dspy

# Configure language model
llm = dspy.HFModel(model = 'google/gemma-2b')

Combine the LLM and the vector database together into a retriever model.

from dspy.retrieve.qdrant_rm import QdrantRM

qdrant_retriever_model = QdrantRM("customer_service", client, k=10)

dspy.settings.configure(lm=llm, rm=qdrant_retriever_model)

Note that the embeddings model by default will be FastEmbed.

If you want to change the model:

# client.set_model("sentence-transformers/all-MiniLM-L6-v2")

List of supported models: https://qdrant.github.io/fastembed/examples/Supported_Models

class RAG(dspy.Module):
    def __init__(self, num_passages=3):
        super().__init__()

        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_answer = dspy.ChainOfThought("context, question -> answer")

    def forward(self, question):
        context = self.retrieve(question).passages
        prediction = self.generate_answer(context=context, question=question)
        return dspy.Prediction(context=context, answer=prediction.answer)

In DSPy, you can define a program that represents the architecture and flow of information in your RAG pipeline. This is similar to defining a neural network architecture in frameworks like PyTorch.

Here’s how we structure our DSPy program in the code above:

1. In the __init__() method:

- Define the modules that you will use in your RAG pipeline.

- In this case, you will have two modules:

- A Retrieve module, which will be responsible for retrieving additional context from the vector database based on the input query.

- A ChainOfThought module, which will prompt Gemma (the language model) with a chain of thought prompting technique, such as “Let’s think step by step.”

2. In the forward() method:

- Define the flow of information among the modules you defined in the __init__() method.

- Specify how the input query will be processed by the Retrieve module to obtain additional context.

- Pass the retrieved context along with the original query to the ChainOfThought module, which will generate a response using the chain of thought prompting technique.

By structuring your DSPy program in this way, you can create a clear and modular representation of your RAG pipeline. The __init__() method allows you to define the necessary modules, while the forward() method specifies the flow of data and the order in which the modules are executed.

This modular approach makes it easier to understand, modify, and extend your RAG pipeline. You can easily swap modules or add new ones as needed, and the flow of information remains clear and structured.

DSPy provides a high-level and intuitive way to define the architecture and behavior of your RAG pipeline, abstracting away some of the low-level details and allowing you to focus on the overall structure and flow of your program.

With that done, let’s test out our RAG.

uncompiled_rag = RAG()

example_query = "Tell me about the instances when the customer's camera broke"

response = uncompiled_rag(example_query)

print(response.answer)

To Summarize

This blog post provided a concise introduction to building a basic Retrieval-Augmented Generation (RAG) pipeline using DSPy, Gemma (a language model hosted on Hugging Face), and the Qdrant vector database. The pipeline demonstrated how to integrate these components to enhance the generation capabilities of the language model by leveraging additional context retrieved from the vector database.

However, it’s important to note that this is a simplified example, and there are several ways to further enhance and optimize the RAG pipeline. Here are a few suggestions:

1. Fine-tuning the language model: Instead of using an off-the-shelf language model like Gemma, you can fine-tune the model on a specific domain or task-related dataset. This can help the model generate more relevant and accurate responses tailored to your specific use case.

2. Try a different embedding model.

3. Try this pipeline on a different dataset for a different use case.

GitHub

GitHub: https://github.com/vardhanam/qdrant_dspy