How to Build an Enterprise Chatbot Using Qdrant and Llama 2

The entire code for the blog post can be found in the GitHub link here: https://github.com/vardhanam/enterprise_chatbot_qdrant/tree/main

Originally posted on medium: https://medium.com/@vardhanam.daga/how-to-build-an-enterprise-chatbot-using-qdrant-and-llama-2-d2af666942a4

The integration of enterprise chatbots with access to internal company data represents a pivotal advancement in streamlining business operations and enhancing employee productivity. By leveraging AI to navigate and retrieve information from vast internal databases, these chatbots can significantly reduce the time employees spend searching for information, whether it’s sales figures, inventory levels, or project status updates.

This immediate access to relevant data not only accelerates decision-making processes but also fosters a more agile and informed workforce capable of responding quickly to changing business conditions. Moreover, by automating routine inquiries and tasks, such chatbots allow employees to focus on more strategic and creative tasks, thereby boosting overall productivity and innovation within the company.

In an era where data is a critical asset, enterprise chatbots that can efficiently mine and manage this information are becoming an essential tool for companies aiming to maintain a competitive edge.

How Are Enterprise Chatbots Designed?

AI chatbots are built using a sophisticated combination of large language models (LLMs) and vector databases, harnessing the power of advanced artificial intelligence to understand and respond to user queries with high accuracy.

LLMs, such as GPT, Llama 2, and Mistral, are trained on vast datasets to comprehend and generate human-like text, enabling chatbots to process natural language queries and engage in conversations that feel intuitive to users.

To enhance their responsiveness and relevance, these chatbots utilize vector databases, which efficiently store and retrieve high-dimensional data vectors representing text. This setup allows for the quick matching of user queries with the most relevant information or responses by measuring the similarity between vectors.

Together, LLMs and vector databases form the backbone of AI chatbots, enabling them to deliver fast, accurate, and contextually aware interactions, transforming how businesses and consumers communicate.

Key Tools Used for Our Enterprise Chatbot

In this blog, we shall design a Streamlit-based enterprise chatbot powered by Llama 2 and Qdrant. Qdrant is an open-source vector database optimized for similarity search in high-dimensional data, supporting real-time updates and advanced filtering for dynamic AI applications.

Moreover, using Streamlit’s authentication modules, we shall also bake in a user authentication widget into our UI. This way only verified users from within the enterprise will have access to the chatbot. This makes the app secure from unauthorized users, ensuring that sensitive company information remains confidential.

Step-by-Step Implementation of the Code

Here’s the entire code for the enterprise chatbot. You can paste it in a file (e.g. app.py) and then run it by using the command streamlit run app.py.

from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
BitsAndBytesConfig,
pipeline
)

import transformers
import torch
import streamlit as st

from langchain.llms import HuggingFacePipeline

import os
import time  # Just for simulating a delay

from langchain_community.document_loaders import DirectoryLoader, PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.vectorstores import Qdrant
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_core.prompts import ChatPromptTemplate, PromptTemplate
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

#function to load the llama2 llm()

import streamlit_authenticator as stauth
import yaml
from yaml.loader import SafeLoader

UPLOAD_DIR = '/home/vardhanam/enterprise_chatbot/uploaded_pdfs'

def save_uploaded_file(uploaded_file):
    try:
        # Create a directory to save the file if it doesn't exist


        # Save the file
        with open(os.path.join(UPLOAD_DIR, uploaded_file.name), 'wb') as f:
            f.write(uploaded_file.getbuffer())

        return True

    except Exception as e:
        # If there's an error, print the exception
        print(e)
        return False

def generate_response(query):
     return chain.invoke(query)


@st.cache_resource
def load_llm():

    #Loading the Llama-2 Model
    model_name='NousResearch/Llama-2-7b-chat-hf'
    model_config = transformers.AutoConfig.from_pretrained(
    model_name,
    )
    tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
    tokenizer.pad_token = tokenizer.eos_token
    tokenizer.padding_side = "right"

    # Activate 4-bit precision base model loading
    use_4bit = True
    # Compute dtype for 4-bit base models
    bnb_4bit_compute_dtype = "float16"
    # Quantization type (fp4 or nf4)
    bnb_4bit_quant_type = "nf4"
    # Activate nested quantization for 4-bit base models (double quantization)
    use_nested_quant = False

    #################################################################
    # Set up quantization config
    #################################################################
    compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

    bnb_config = BitsAndBytesConfig(
    load_in_4bit=use_4bit,
    bnb_4bit_quant_type=bnb_4bit_quant_type,
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=use_nested_quant,
    )
    # Check GPU compatibility with bfloat16
    if compute_dtype == torch.float16 and use_4bit:
        major, _ = torch.cuda.get_device_capability()
        if major >= 8:
            print("=" * 80)
            print("Your GPU supports bfloat16: accelerate training with bf16=True")
            print("=" * 80)


    model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    )

    # Building a LLM text-generation pipeline
    text_generation_pipeline = pipeline(
    model=model,
    tokenizer=tokenizer,
    task="text-generation",
    temperature=0.2,
    repetition_penalty=1.1,
    return_full_text=True,
    max_new_tokens=1000,
    )

    llm = HuggingFacePipeline(pipeline= text_generation_pipeline)

    return llm


@st.cache_resource()
def process_document(folder_name):

    global text_splitter
    # Simulate some document processing delay
    text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=20,
    length_function=len,
    is_separator_regex=False,
    )
    loader = DirectoryLoader(folder_name, loader_cls=PyPDFLoader)
    docs = loader.load_and_split(text_splitter=text_splitter)

    #Loading the embeddings model
    embeddings = HuggingFaceEmbeddings(
        model_name="sentence-transformers/all-mpnet-base-v2"
    )

    global qdrant_vectorstore
    qdrant_vectorstore = Qdrant.from_documents(
        docs,
        embeddings,
        location = ":memory:",
        collection_name = "depp_heard_transcripts",
    )

    qdrant_retriever = qdrant_vectorstore.as_retriever()

    template = """Answer the question based only on the following context:
    {context}

    Question: {question}
    """
    prompt = ChatPromptTemplate.from_template(template)

    global chain
    chain = (
    {"context": qdrant_retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
    )

    return chain



with st.spinner("Loading llm"):
    llm = load_llm()


with st.spinner("Creating Vector DB"):
    chain = process_document(UPLOAD_DIR)


with open('/home/vardhanam/enterprise_chatbot/config.yaml') as file:
    config = yaml.load(file, Loader=SafeLoader)

authenticator = stauth.Authenticate(
    config['credentials'],
    config['cookie']['name'],
    config['cookie']['key'],
    config['cookie']['expiry_days'],
    config['preauthorized']
)

authenticator.login()


if st.session_state["authentication_status"]:
    authenticator.logout()
    st.write(f'Welcome *{st.session_state["name"]}*')
    # Streamlit app starts here
    st.title(‘Documents Processing App')

    with st.form("Upload Form", clear_on_submit= True):

        # Use st.file_uploader to upload multiple files
        uploaded_files = st.file_uploader("Upload Document PDF files:", type='pdf', accept_multiple_files=True)

        submitted = st.form_submit_button("Submit")

        if submitted:
            # If files were uploaded, iterate over the list of uploaded files
            if uploaded_files is not None:
                for uploaded_file in uploaded_files:
                    # Save each uploaded file to disk
                    if save_uploaded_file(uploaded_file):
                        st.success(f"'{uploaded_file.name}' saved successfully!")

                    else:
                        st.error(f"Failed to save '{uploaded_file.name}'")
                with st.spinner("Refreshing Vector DB"):
                    process_document.clear()
                    chain = process_document(UPLOAD_DIR)
                    uploaded_files = None


    # Initialize chat history
    if "messages" not in st.session_state:
        st.session_state.messages = []

    # Display chat messages from history on app rerun
    for message in st.session_state.messages:
        with st.chat_message(message["role"]):
            st.markdown(message["content"])

    # Accept user input
    if prompt := st.chat_input("What would you like to know?"):
        # Add user message to chat history
        st.session_state.messages.append({"role": "user", "content": prompt})
        # Display user message in chat message container
        with st.chat_message("user"):
            st.markdown(prompt)

        with st.chat_message("assistant"):
            with st.spinner("Analyzing Query"):
                stream = generate_response(prompt)
                st.markdown(stream)

        st.session_state.messages.append({"role": "assistant", "content": stream})

elif st.session_state["authentication_status"] is False:
    st.error('Username/password is incorrect')

elif st.session_state["authentication_status"] is None:
    st.warning('Please enter your username and password')

Here’s how the config.yaml file (where you store usernames and passwords of authorized users) will look like. You can tweak it for your use case.

credentials:
  usernames:
    vardhanam:
      email: vardhanam@superteams.ai
      name: Vardhanam Daga
      password: vardhanam # To be replaced with hashed password
    soum:
      email: soum@superteams.ai
      name: Soum Paul
      password: soum # To be replaced with hashed password
    debasri:
      email: debasri@superteams.ai
      name: Debasri Rakshit
      password: debasri # To be replaced with hashed password
    akriti:
      email: akriti@superteams.ai
      name: Akriti Upadhyay
      password: akriti # To be replaced with hashed password
cookie:
  expiry_days: 30
  key: random_signature_key # Must be string
  name: random_cookie_name
preauthorized:
  emails:
  - melsby@gmail.com

Let’s break down the code of our app in a step-by-step format.

1. Import Libraries: Essential libraries and modules from transformers, torch, streamlit, streamlit_authenticator, and yaml are imported, alongside components for document loading, text splitting, vector storage, embedding, and prompt management from the langchain library.

2. Upload Directory Setup: A variable UPLOAD_DIR is defined to specify the directory where uploaded PDF documents are stored.

3. File Upload Function: save_uploaded_file function is designed to save uploaded files to the specified directory.

4. Generate Response Function: generate_response is a function that invokes a processing chain to generate responses to user queries.

5. Load Language Model with Caching: The @st.cache_resource decorator is applied to the load_llm function, which loads the Llama 2 model and configures it for efficiency with 4-bit quantization. This decorator ensures that the loaded model is cached, reducing load times for subsequent invocations.

6. Document Processing with Caching: Similarly, @st.cache_resource is used for the process_document function, which performs document loading, splitting, embedding, and updating the Qdrant vector store. Caching the results of this computationally intensive process improves the application’s responsiveness.

7. Streamlit Authentication: Utilizes streamlit_authenticator to set up a secure login mechanism based on credentials stored in a config.yaml file.

8. Streamlit Interface Setup: The user interface is created using Streamlit, starting with login verification. If authentication is successful, the user is greeted and presented with a form for uploading PDF files.

9. File Processing: Upon file submission, uploaded files are saved, and the document processing chain is refreshed to include the new data.

10. Chat Interface: A simple chat interface allows the user to submit queries, which are processed by the generated response function, and responses are displayed in the chat.

11. Session Management: The application manages user sessions to handle authentication states and chat histories, ensuring seamless user experience and secure access control.

Screenshots from Our App

The login screen

File uploading section

The chat interface

(I uploaded some legal documents, files available in the GitHub repo.)

Closing Words

Voila! We have reached the end of this blog post. You are now ready to build your very own enterprise chatbot and deploy it for the perusal of employees in your company. Let me know if you have any questions by commenting below. I hope you enjoyed reading this blog as much as I enjoyed working on it.