Within the last one year, we have seen LLM-powered AI applications emerge, with Semantic Search as a key component. Semantic Search is extremely powerful because unlike traditional keyword-based search, semantic search understands the meaning of the search query and can retrieve results that match the query’s meaning, not just its specific words. This is achieved by mapping paragraphs, sentences, or documents into a vector space, and then storing them into a Vector Database or a Vector Search Engine.

Underneath, the way it works is that text encoders compute embeddings for each document in a corpus, and semantic search retrieves texts whose meaning matches a search query. This is then used as the context for an LLM or an AI model, to respond accurately to user queries.

In other words, Semantic Search has become a pivotal component in building AI applications. However, when developers start building a production-grade AI application, a key architectural question that they ask is — what is the right way to store user-specific vector data? Is it done by creating separate collections for each user? Or, is there a more efficient way to achieve this?

In this deep-dive, we address this question around multi-tenancy. Multi-tenancy in a web application stands for the architectural approach where multiple users can access and interact with their unique datasets from within the same data stack.

Understanding Multi-Tenancy in Vector Stores

As we discussed above, a typical RAG pipeline architecture incorporates a combination of vector search (or vector database) for storing context and an LLM model to handle the interaction. This means that the developer would need to store user-specific data within the Vector Database.

Let’s understand this with a real-world scenario: assume that you are a developer working to build an AI platform which serves multiple companies, each of which has millions of users. Each company would have their users, and each user would have their documents.

To achieve this, many developers make the mistake of creating isolation through a separate collection for each company or user. However, this can be prohibitive in terms of cost and unnecessary.

The right approach is to use multi-tenancy features of a Vector Store like Qdrant, where the same collection is leveraged for the LLM stack, with the ability to partition the payload through use of a unique ID.

This is extremely powerful because it allows it to scale much better without resource overhead. Let’s understand how this works. We will use Qdrant as our Vector Search or Vector Database engine.

Qdrant’s Multi-Tenancy Capabilities

Qdrant, the vector database, comes with an in-built feature that makes multi-tenancy quite easy to implement. You would need to do a simple setup tweak that uses a unique id. This helps developers to design and deploy large-scale LLM applications for multiple users in a straightforward manner.

Here’s how it works:

client.upsert(
 collection_name="{collection_name}",
 points=[
 models.PointStruct(
 id=1,
 payload={"group_id": "user_1"},
 vector=[0.9, 0.1, 0.1],
 ),
 models.PointStruct(
 id=2,
 payload={"group_id": "user_1"},
 vector=[0.1, 0.9, 0.1],
 ),
 models.PointStruct(
 id=3,
 payload={"group_id": "user_2"},
 vector=[0.1, 0.1, 0.9],
 ),
 ],
)

Using the above command one can create a single collection but partition it for different users based on the parameter “group_id”.

And, in order to retrieve data specific to a user, you can use a filter based on the “group_id” in the following manner:

from qdrant_client import QdrantClient, models
client = QdrantClient("localhost", port=6333)
client.search(
 collection_name="{collection_name}",
 query_filter=models.Filter(
 must=[
 models.FieldCondition(
 key="group_id",
 match=models.MatchValue(
 value="user_1",
 ),
 )
 ]
 ),
 query_vector=[0.1, 0.1, 0.9],
 limit=10,
)

Fairly simple and easy to understand, as is also explained in the Qdrant documentation.

Now, when you construct your group_id, you can use various strategies depending on your stack. Here are some ideas:

– For simple stacks, for a single company, with multiple users: use an ID unique to each user.

– For serving multiple companies, each with their own users: construct an ID that combines company id and user id.

So on and so forth. This simple strategy means that you use the same collection, but harness it for one or more companies, each with one or more users.

We will demonstrate this with a RAG application stack — so you can see how multi-user setups can be built.

This will help you understand how it integrates with the LLM QA pipeline. We will also create an API endpoint to give you a hint of how you can do this at scale using your own backend, such as Django, Laravel or others.

Building a Multi-Tenant AI Application Using Qdrant and Mistral-7B

Let’s get started with designing a simple multi-tenant LLM application using Qdrant and the LLM Mistral-7B. We will create an API endpoint for our application using Flask, so that users can query their respective documents using simple curl commands.

Let’s take 3 users: user_1, user_2, and user_3.

Each has a separate document that needs to be inserted into the vector database and then queried in a question-answer format with the LLM.

We’ll take Harvard University commencement speeches from the Harvard Gazette.

User_1 selects Mark Zuckerberg’s speech,
User_2 goes for J.K. Rowling’s speech, and
User_3 wants to upload Steven Spielberg’s speech.

This is, of course, an assumption — that all our users are interested in speeches by celebrities. However, I digress. 🙂

Kickstart your Colab or your laptop, or wherever you run your Jupyter Notebook. Next, let’s install all the dependencies needed to design the application.

!pip install transformers qdrant-client sentence-transformers accelerate tqdm sentence-transformers PyPDF pandas bitsandbytes langchain flask

Next, load these documents using the PyPDFLoader.

from langchain_community.document_loaders import PyPDFLoader
loader_zuck = PyPDFLoader('/home/vardhanam/zuck_harvard.pdf')
loader_rowling = PyPDFLoader('/home/vardhanam/jk_rowling.pdf')
loader_spielberg = PyPDFLoader('/home/vardhanam/steven_spielberg.pdf')

After that we’ll split the documents into smaller chunks, so that we can provide the LLMs only with the relevant context. This is done because the context length of an LLM is limited and therefore the entire document cannot be passed in one go.

from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
 chunk_size=1000,
 chunk_overlap=20,
 length_function=len,
 is_separator_regex=False,
)
docs_zuck = loader_zuck.load_and_split(text_splitter=text_splitter)
docs_rowling = loader_rowling.load_and_split(text_splitter=text_splitter)
docs_spielberg = loader_spielberg.load_and_split(text_splitter=text_splitter)

Let’s create a helper function to make a Pandas dataframe out of our documents. Using a Pandas dataframe makes it easy to perform various operations on our data, as we will see shortly.

import pandas as pd
def create_dataframe(docs, user_id):
 data = []
 for doc in docs:
 row_data = {
 "group_id": user_id,
 "page_content": doc.page_content,
 "metadata": doc.metadata
 }
 data.append(row_data)
 df = pd.DataFrame(data)
 df['page_content'] = df['page_content'].replace('\\n', ' ', regex=True)
 return df

Creating dataframes for the different documents and concatenating them into a single dataframe:

df_zuck = create_dataframe(docs_zuck, 'user_1')
df_rowling = create_dataframe(docs_rowling, 'user_2')
df_spielberg = create_dataframe(docs_spielberg, 'user_3')
result_df = pd.concat([df_zuck, df_rowling, df_spielberg], ignore_index=True)

Adding an id column to the dataframe:

result_df['id'] = range(1, len(result_df) + 1)

Let’s load an embedding model to vectorize the content of the document. We’ll be using the ‘sentence-transformers/all-MiniLM-L6-v2’ model. It vectorizes textual data into a 384 dimensional space, and works well for creating textual embeddings.

from sentence_transformers import SentenceTransformer
embed_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

Next, apply this to the page_content column of the dataframe and then save the embeddings in a separate column.

result_df['embeddings'] = result_df['page_content'].apply(lambda x: embed_model.encode([x])[0])

This is how our final dataframe looks like:

result_df

Now run Qdrant on your local server with the following command:

docker run -p 6333:6333 -p 6334:6334 \
 -v $(pwd)/qdrant_storage:/qdrant/storage:z \
 qdrant/qdrant

Create an instance of Qdrant client:

from qdrant_client import QdrantClient
client = QdrantClient("localhost", port=6333)

Create a new collection called “Speech_Collection” in it.

from qdrant_client.http import models
client.delete_collection(collection_name="speech_collection")
client.create_collection(
 collection_name="speech_collection",
 vectors_config=models.VectorParams(size=384, distance=models.Distance.COSINE),
)

Now create a payload, which is basically metadata information that will be inserted into the Vector DB along with the vectors themselves.

payload = result_df[['group_id', 'page_content', 'metadata']].to_dict(orient='records')

Now we are ready to upsert the vectors with their payload.

client.upsert(
 collection_name="speech_collection",
 points=models.Batch(
 ids=result_df['id'].to_list(),
 payloads=payload,
 vectors=result_df['embeddings'].to_list(),
 ),
)

Check if the vectors have been inserted:

client.scroll(collection_name="speech_collection", limit=100)

Now we’ll quantize and load our Mistral-7B model. Quantization reduces the precision of the model slightly but ensures greater inference speed.

from transformers import BitsAndBytesConfig
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
# preparing config for quantizing the model into 4 bits
quantization_config = BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_compute_dtype=torch.float16,
 bnb_4bit_quant_type="nf4",
 bnb_4bit_use_double_quant=True,
)
# load the tokenizer and the quantized mistral model
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
model_4bit = AutoModelForCausalLM.from_pretrained(
 model_id,
 device_map="auto",
 quantization_config=quantization_config,)
tokenizer = AutoTokenizer.from_pretrained(model_id)
# using HuggingFace's pipeline
pipeline = pipeline(
 "text-generation",
 model=model_4bit,
 tokenizer=tokenizer,
 use_cache=True,
 device_map="auto",
 max_new_tokens=5000,
 do_sample=True,
 top_k=1,
 temperature = 0.01,
 num_return_sequences=1,
 eos_token_id=tokenizer.eos_token_id,
 pad_token_id=tokenizer.eos_token_id,
)

We’ll create a helper function that will take in the username (‘user_1’ for e.g.) and the query question and return the generated output from the model.

We’ll create a context for the model by retrieving documents from the Vector DB which are similar to the question. This is known as a similarity search, which basically means that the vector database is looking for material that resembles the content of the question, so that this material can then be passed on to the LLM for refinement and analysis.

This way users can get contextually relevant answers to their queries.

def generate_text(user, question):
 # Searching for relevant hits in the 'speech_collection'
   hits = client.search(
   collection_name="speech_collection",
   query_filter=models.Filter(
   must=[
   models.FieldCondition(
   key='group_id',
   match=models.MatchValue(
   value=user,
   ),
   )
   ]
   ),
   query_vector=embed_model.encode(question).tolist(),
   limit=10,
   )
   # Creating context from the hits
   context = ''
   for hit in hits:
   context += hit.payload['page_content'] + '\n'
   # Constructing the prompt
   prompt = f"""<s>[INST] You are a helpful, respectful and honest assistant.
   Your task is to answer questions from context below.
   {context}
   {question} [/INST] </s>
   """
   # Generating text using the GPT model
   sequences = pipeline(
   prompt,
   do_sample=True,
   temperature=0.7,
   top_k=50,
   top_p=0.95,
   num_return_sequences=1,
   )
   return sequences[0]['generated_text']

Once this is done, we can create an API endpoint using Flask.

In an actual production setup, a frontend framework, such as React or Vue or Svelte, would be used by developers to send API requests to the Flask endpoint. In this case, we will skip the frontend stack, and showcase the API request and response cycle through curl.

This would allow developers to send the username and question to this endpoint and receive the answers that the model generates for them. In a real frontend stack, they would then use this response, and render / update their DOM accordingly.

So, below is the Flask API endpoint. We have kept the API payload very simple:

from flask import Flask, request, jsonify
app = Flask(__name__)
@app.route('/generate_text', methods=['POST'])
def generate_text_api():
 try:
 user = request.json['user']
 question = request.json['question']
 # Call your generate_text function
 generated_text = generate_text(user, question)
 # Print the generated text in the terminal
 print("Generated Text:", generated_text)
 return jsonify({"generated_text": generated_text}), 200
 except Exception as e:
 return jsonify({"error": str(e)}), 500

Now, we’re all set. Let’s run our Flask app.

app.run(port=5000)

We’ll show example queries for each user.

Let’s start with user_1, who wants to know what Mark Zuckerberg’s viewpoint is on failures.

We can do this by sending a curl request to the Flask endpoint with the right API payload:

curl -X POST -H "Content-Type: application/json" -d '{"user":"user_1", "question":"What does 
Zukcerberg have to say about failure?"}' http://127.0.0.1:5000/generate_text

Output

Mark Zuckerberg emphasizes the importance of failure and the freedom to fail in order to create historic enterprises. He uses his own experiences of building various projects, some of which didn’t succeed, as examples. He also mentions that today’s society is over-indexed on rewarding success and doesn’t do enough to make it easy for everyone to take lots of shots. He believes that there’s something wrong with the system when some people have the freedom to turn their ideas into historic enterprises while others can’t afford to pay off their loans or even start a business. He encourages the graduates to be idealistic and prepare to be misunderstood, as well as to not let fear of making mistakes keep them from starting. He calls for public works projects, such as stopping climate change and involving millions of people in the effort, to define their generation.

Remember that we had used the same Vector collection in Qdrant for storing the user_1’s documents, as well as user_2 and user_3.

Now, let’s try the same with user_2.

User_2 wants to know J.K. Rowling’s advice for nurturing the imagination.

curl -X POST -H "Content-Type: application/json" -d '{"user":"user_2", "question":"How is one to train their imagination? What does J K Rowling suggest?"}' http://127.0.0.1:5000/generate_text

Output:

According to J.K. Rowling, one way to train one’s imagination is by empathizing with humans whose experiences we have never shared. She emphasizes the importance of using our imagination to understand and connect with others, rather than just to manipulate or control. Rowling also encourages us to be open to new experiences and to exercise our imaginations actively, rather than remaining comfortably within the bounds of our own experience. She believes that by using our imaginations in this way, we can not only enrich our own lives but also make a positive impact on the world by empathizing with and helping those who are less fortunate.\n\nRowling also suggests that failure can be a powerful catalyst for imagination and growth. By facing and overcoming failure, we can gain new perspectives and develop the resilience and determination needed to succeed in the future. Ultimately, Rowling encourages us to use our imaginations to expand our horizons, connect with others, and make a positive difference in the world.

As we can see, it accurately responded with the right answer. Now, let’s try the same with user_3, who was interested in Steven Spielberg’s talk.

User_3 has a simple query: “Which films does Spielberg talk about in his commencement speech”

curl -X POST -H "Content-Type: application/json" -d '{"user":"user_3", "question":"Which films does Spielberg talk about in his speech?"}' http://127.0.0.1:5000/generate_text

Output:

1. \”It’s a Wonderful Life\”\n 2. \”1941\”\n 3. \”The Color Purple\”\n 4. \”Jurassic Park\”\n 5. \”Star Wars: The Force Awakens\”\n 6. \”Indiana Jones\”\n\nHe mentions these films in the context of character-defining moments and the impact they had on him.”

As you can see, our LLM application powered by Qdrant gives lucid and detailed answers to the queries of three different users on their respective documents. This way we have established a smooth multi-tenancy system in our web application. By getting inspired from the code above, you can set up your own applications and deploy it on a server for multiple users at scale.

Where Should You Use Separate Collections in Qdrant?

Now, while harnessing a single collection with payload-based partitioning is going to work for most use cases, there are situations where creating separate collections makes more sense.

The thumb rule is this — whenever you have a lesser number of users, but you need strict isolation between users, you could use a separate collection. However, as we discussed before, this comes at a cost. Firstly, this leads to resource overhead. Secondly, you would need to ensure that one overloaded user doesn’t affect the performance of the others.

Hope you enjoyed reading this article! If you did, it would be great if you can put in a few likes and maybe leave a comment or two. Thank you.

Building a Multi-Tenant Qdrant Powered LLM Application

Understanding Multi-Tenancy in Vector Stores

Qdrant’s Multi-Tenancy Capabilities

Building a Multi-Tenant AI Application Using Qdrant and Mistral-7B

Where Should You Use Separate Collections in Qdrant?