This article was originally published on medium @ https://medium.com/@vardhanam.daga/hindi-language-rag-pipeline-with-mistral-and-qdrant-d2bcf162b6e5

Why Hindi Language RAG Pipeline

In my previous blogs I’ve explored setting up Qdrant powered RAG pipelines with various LLMs with different kinds of datasets. But, I have never really ventured to use any dataset that wasn’t in the English Language. In this blog post, I want to take on an interesting challenge — to develop a RAG pipeline for a Hindi text corpus. This is because there are hardly any demonstrations out there that show us how to vectorize and retrieve Hindi textual information for developing an LLM based application.

The development of Hindi language Large Language Model (LLM) applications is crucial for several reasons. Firstly, it democratizes access to technology and information for a vast number of Hindi speakers across the globe. With Hindi being one of the most spoken languages in the world, especially in India, creating LLM applications in Hindi ensures that the benefits of artificial intelligence and machine learning are accessible to a larger population. Secondly, it supports the preservation and growth of the Hindi language by integrating it into modern technology, thereby encouraging its use in digital communication and content creation. This is essential for maintaining its cultural heritage and allowing the language to evolve with technological advancements. Additionally, Hindi LLM applications can significantly enhance the user experience of regional-language speakers by providing more accurate and contextually relevant responses in natural language processing tasks such as voice recognition, text analysis, and automated customer service. This will not only improve user engagement but will also open up new avenues for businesses and educational institutions to cater to the Hindi-speaking market. Finally, by fostering innovation in Hindi language technology, it will stimulate research and development in the field of computational linguistics within India, promoting a more inclusive technological advancement that takes into account the linguistic diversity of the country.

I’ll use the Hindi model by FastText to create sentence embeddings for our data. FastText is an open-source, free, lightweight library that allows users to learn text representations and text classifiers. I’ll use the Hindi Aesthetic Corpus dataset, which contains about 1000 text files of Hindi content. This dataset is composed of novels and short stories written in the Hindi language, gathered from various online sources. It includes content from http://hindisamay.com, an electronic library managed by the Mahatma Gandhi International Hindi University, Wardha; http://premchand.co.in, dedicated to the celebrated Hindi novelist Premchand; and the digital library of the Bhandarkar Oriental Research Institute (borilib.com). In preparation for the analysis, the text was segmented into sentences, and all special characters, English words, and Latin numerals were removed.

Metadata details:

- Total number of unique words: 145,508

- Total number of unique lemmas: 118,266

The collection comprises 978 works, including novels, short stories, and non-fiction texts. Metadata for 164 of these works could not be located.

First I’ll load and split the corpus into smaller chunks using libraries from LangChain. Then I’ll vectorize these chunks and upsert them into our vector database — Qdrant, after which I will hook it up to our LLM to provide context for question-answering.

I will be using Mistral-7B as our choice of LLM. Our first choice for the LLM was OpenHathi, the Hindi Language LLM developed by Sarvam Ai, but upon initial exploration I found that the performance of the model was below par as it could not generate coherent responses to our queries. This is why I decided to go ahead with Mistral instead of OpenHathi.

Step-by-Step Guide to Building a Hindi-Language RAG Pipeline with Mistral and Qdrant

Let’s get started.

Install the following libraries:

pip install langchain transformers qdrant-client accelerate torch bitsandbytes

Import the DirectoryLoader and TextLoader from LangChain. Use a RecursiveCharacterTextSplitter to fragment the text into smaller chunks to account for limited context length of LLMs.

from langchain_community.document_loaders import DirectoryLoader
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
 chunk_size=1000,
 chunk_overlap=20,
 length_function=len,
 is_separator_regex=False,
)
loader = DirectoryLoader('hindi-corpus', loader_cls=TextLoader)
docs = loader.load_and_split(text_splitter=text_splitter)

Create a Pandas dataframe of our docs. The columns being the page content, metadata, id, and generated embeddings.

import pandas as pd
data = []
for doc in docs:
 row_data = {
 "page_content": doc.page_content,
 "metadata": doc.metadata
 }
 data.append(row_data)
df = pd.DataFrame(data)
df['page_content'] = df['page_content'].replace('\\n', ' ', regex=True)
Install fasttext by following the guide here. Then download the Hindi model from here.
import fasttext as ft
# Loding model for Hindi.
embed_model = ft.load_model('wiki.hi.bin')

Generate sentence embeddings.

df['embeddings'] = df['page_content'].apply(lambda x: (embed_model.get_sentence_vector(x)).tolist())

Add the id column.

df['id'] = range(1, len(df) + 1)

Create a payload to be inserted alongside the vector. The payload is additional information that can be pulled from the Vector DB during similarity search. In this case, it’ll be the page content and metadata of our text corpus as that will help in feeding in information to the LLM.

payload = df[['page_content', 'metadata']].to_dict(orient='records')

Initialize an instance of Qdrant in the memory. This will be quite fast as all our data will be stored directly in the RAM itself.

from qdrant_client import QdrantClient
client = QdrantClient(location=':memory:')

Create a vector collection; remember the vector size to be 300 as that is the embedding dimension generated by FastText.

from qdrant_client.http import models
client.delete_collection(collection_name="hindi_collection")
client.create_collection(
 collection_name="hindi_collection",
 vectors_config=models.VectorParams(size=300, distance=models.Distance.COSINE),
)

Now bulk upload the data into the Vector DB.

client.upsert(
 collection_name="hindi_collection",
 points=models.Batch(
 ids=df['id'].to_list(),
 payloads=payload,
 vectors=df['embeddings'].to_list(),
 ),
)

Load the quantized version of Mistral-7B model for faster performance.

from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
# preparing config for quantizing the model into 4 bits
quantization_config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_compute_dtype=torch.float16,
 bnb_4bit_quant_type="nf4",
 bnb_4bit_use_double_quant=True,
)
# load the tokenizer and the quantized mistral model
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
model_4bit = AutoModelForCausalLM.from_pretrained(
 model_id,
 device_map="auto",
 quantization_config=quantization_config,)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# using HuggingFace's pipeline
pipeline = pipeline(
 "text-generation",
 model=model_4bit,
 tokenizer=tokenizer,
 use_cache=True,
 device_map="auto",
 max_new_tokens=5000,
 do_sample=True,
 top_k=1,
 temperature = 0.01,
 num_return_sequences=1,
 eos_token_id=tokenizer.eos_token_id,
 pad_token_id=tokenizer.eos_token_id,
)

Then create a helper function to take in the question as the input and generate the answer. This function first generates the context for the LLM by doing a similarity search on the question within the Vector DB, post which it is fed into a prompt template to be ingested by the LLM.

def generate_text(question):
 # Searching for relevant hits in the 'speech_collection'
 hits = client.search(
 collection_name="hindi_collection",
 query_vector= embed_model.get_sentence_vector(question).tolist(),
 limit=10,
 )
 # Creating context from the hits
 context = ''
 for hit in hits:
 context += hit.payload['page_content'] + '\n'
 # Constructing the prompt
 prompt = f"""<s>[INST] आप एक सम्मानीय सहायक हैं। आपका काम नीचे दिए गए संदर्भ से प्रश्नों का उत्तर देना है।
 संदर्भ: {context}
 प्रश्न: {question} [/INST] </s>
 """
 # Generating text using the GPT model
 sequences = pipeline(
 prompt,
 do_sample=True,
 temperature=0.7,
 top_k=50,
 top_p=0.95,
 num_return_sequences=1,
 )
 return sequences[0]['generated_text']

Let’s take a look at a sample output. I will ask our RAG pipeline to dig out content about Mahatma Gandhi, known for his policy of non-violence in politics.

generate_text("मुझे गांधी के बारे में विस्तार से बताएं।")

महात्म गांधी जी बघोर-शाह राजनेता थे, जो भारत के स्वतंत्रता आंदोलन के लिए प्रासंगिक रूप से जगर्खणे के साथ-साथ हिंदू-मुस्लिम तनाव को खत्म करने के लिए काम करते थे। उन्हें ब्रिटिश राज्य के साथ-साथ हिंदू-मुस्लिम संतुलन के साथ रहने का कारण था, जिसमें वे सांप्रदायिक तनाव को हम मिलकर खत्म करेंगे। उन्होंने भारतीय राष्ट्रीय कांग्रेस में प्रभावी शख्सियत के रूप उभरने के आरंभित दिनों तक गांधी जी ने यह राय रखी, कि सांप्रदायिक तनाव ब्रिटिश शासन की देन है। परंतु, उन्होंने कभी भी ब्रिटिशों को किसी भी हालत में जाना होगा। हिंदू-मुस्लिम तनाव को हम मिलकर खत्म करेंगे और भारतीयों की गुलामी हिंदू-मुस्लिम तनाव की तुलना में ज्यादा बड़ी समस्या है।\n\n गांधी जी का जीवन के साथ-साथ उनके विचारों को भी प्रतिक्रियावादी होने की बात है। उन्हें परिकर्तन होने के लिए वह सारी हिंद स्वराज में पाई जाती थी, जो उनके श्रद्धा काम करती रही थी। उन्हें सांप्रदायिक तनाव को हम मिलकर खत्म करने के लिए जीवन का पूरा निरीक्षण के लिए चढ़ावेला था। उन्हें अपनी विचारों को विश्वास होने के लिए भी आवश्यक था, जो उनके साथ-साथ रहने वाले स्वतंत्र स्वराज का विरोध और पश्चिमी सभ्\u200dयता के तीनों के बारे में भी विवेचित थे।\n\n गांधी जी का किताब "हिंद स्वराज" में वह उस समय की सituation को विविध रूप से वर्णन करते हैं, जिसमें वह सांप्रदायिक तनाव के बरकरार रहने की स्थिति में ब्रिटिशों को कैसे बाहर किया जा सकता है दिया है। उन्हें सांप्रदायिक तनाव के हिंदू-मुस्लिम तनाव की तुलना में ज्यादा बड़ी समस्या है दिया गया है, और वह उस समस्या को हम मिलकर खत्म करेंगे। उन्होंने बहुमूल्य विचारों को जानने के लिए उनके साथ-साथ रहने वाले शाहिरें तथा पुस्\u200dतकों से सुपूर्ण प्राकृतिक और पुरातात्मक ज्ञान प्राप्त किया था।\n\n गांधी जी का जीवन के अन्य अहिंसा के कार्य की साथ-साथ उनके विचारों को प्रतिक्रियावादी होने की बात है। उनके विचारों को विश्वास होने के लिए वह अपनी कविताएँ सेवकों से सामग्री के लिए चुनाव के साथ भी विविध रूप से प्रक्रिया करने थे। उन्हें अपनी विचारों को विश्वास होने के लिए अपने मानसिकता को भी पहले से समझने के लिए चुनाव के साथ भी उत्सुक थे। उनके विचारों को विश्वास होने के लिए वह अपने घर वालों से भी अपने अधिकारों को अपने हीना पहुँचाते थे। उन्हें साथियों के साथ रहने के साथ-साथ अपनी स्वतंत्रता को पाएंगे जानने के लिए वह अपनी जीवन का पूरा निरीक्षण के लिए चढ़ावेला था।\n\n गांधी जी की जीवन का एक धनुष था, जिससे वह हिंदू-मुस्लिम तनाव को हम मिलकर खत्म करने के लिए सभी प्रकार को प्रयास करने थे। उन्होंने प्रतिद्विधी रूप से सांप्रदायिक तनाव को हम मिलकर खत्म करने के लिए अपनी जीवन का पूरा निरीक्षण के लिए चढ़ावेला था। उन्हें अपनी विचारों को विश्वास होने के लिए अपनी मानसिकता को भी पहले से समझने के लिए चुनाव के साथ भी उत्सुक थे। वह अपनी कविताएँ सेवकों से सामग्री के लिए चुनाव के साथ भी विविध रूप से प्रक्रिया करते थे। उनके विचारों को विश्वास होने के लिए वह अपने घर वालों से भी अपने अधिकारों को अपने हीना पहुँचाते थे। वह साथियों के साथ रहने के साथ-साथ अपनी स्वतंत्रता को पाएंगे। उन्होंने सांप्रदायिक तनाव के हिंदू-मुस्लिम तनाव की तुलना में ज्यादा बड़ी समस्या को हम मिलकर खत्म करने के लिए अपनी जीवन का पूरा निरीक्षण के लिए चढ़ावेला था।'

Conclusion

Voila! It was so easy to set up a Hindi Language RAG pipeline! By taking inspiration from the code above, you can create your very own RAG pipelines attuned to a language of your choice. Thanks for reading, and let me know if you have any questions by commenting below!

Hindi Language RAG Pipeline with Mistral and Qdrant

Why Hindi Language RAG Pipeline

Step-by-Step Guide to Building a Hindi-Language RAG Pipeline with Mistral and Qdrant

Conclusion