Build a Customer Service Chatbot Using Flask, Llama 2, LangChain and Qdrant

Introduction

Customer service chatbots play a pivotal role in transforming the customer experience across diverse industries. These intelligent virtual assistants are employed in a myriad of use cases to streamline and enhance customer support. They excel in handling routine tasks, such as automated FAQs, providing instant responses to common queries and offering real-time updates on order status and tracking. Chatbots are instrumental in automating appointment scheduling, making reservations, and even facilitating product recommendations based on user preferences. Additionally, they prove invaluable in resolving technical issues through guided troubleshooting, collecting valuable customer feedback, and easing the onboarding process for new users. Industries ranging from e-commerce to healthcare leverage chatbots for tasks such as billing inquiries, language translation, and lead generation. The versatility of customer service chatbots extends to internal support, where they assist employees with HR-related queries. While providing swift and efficient responses, these chatbots significantly contribute to optimizing operational processes and fostering positive customer interactions.

In this article, we’ll build a Customer Service Chatbot that is powered by Flask, Qdrant, LangChain, and Llama 2. As an example we’ll supply the chatbot with a document containing Google’s Code of Conduct, but you can use your own company’s documentation as a context for the chatbot.

Components

Flask: Flask is an eminent web framework in the Python programming community, renowned for its simplicity and elegance. As a micro-framework, it is minimalistic yet powerful, offering developers a solid foundation for building a variety of web applications without imposing any specific tools or libraries. This flexibility allows Flask to be lightweight and straightforward to learn, making it an excellent choice for both beginners and experienced developers. It leverages the Jinja2 template engine for dynamic content rendering, ensuring a seamless integration of Python code with HTML. Flask’s built-in development server and debugger facilitate efficient development and troubleshooting. Moreover, its capacity for easy extension with numerous available plugins supports more complex requirements, like database integration, form validation, and authentication. Flask’s robustness and versatility, combined with a strong community and comprehensive documentation, has cemented its status as a go-to framework for web development in Python.

Qdrant: Qdrant is an open-source vector search engine designed to facilitate efficient and scalable similarity search in high-dimensional vector spaces. It is particularly tailored for machine learning applications, such as image or natural language processing, where embedding vectors are commonly used to represent complex data. Qdrant stands out for its performance optimization and ease of use, providing a robust system for storing, managing, and querying large volumes of vector data. It supports various distance metrics, enabling precise and relevant search results for different types of data and use cases. Additionally, Qdrant offers features like filtering and full-text search, allowing for more complex and refined queries. Its architecture is designed to be horizontally scalable, making it well-suited for handling big data scenarios. Qdrant’s user-friendly API and compatibility with popular programming languages like Python further enhance its accessibility to developers and data scientists. As a result, Qdrant is becoming increasingly popular in the field of AI and data-driven applications, where efficient and accurate vector search is crucial.

LangChain: LangChain is an innovative open-source library designed to augment the capabilities of large language models (LLMs), in creating applications that require complex, multi-step reasoning or knowledge retrieval. It primarily focuses on enhancing the ability of LLMs to interface with external knowledge sources, such as databases and search engines, thereby extending their problem-solving and informational capabilities beyond their intrinsic knowledge. LangChain’s architecture is modular, allowing for the integration of various components and tools to tailor the LLM’s performance to specific applications. This modularity also facilitates experimentation with different strategies for knowledge retrieval and reasoning, making it a versatile tool for developers and researchers in the field of AI and natural language processing. By leveraging LangChain, developers can create more sophisticated and intelligent applications that combine the nuanced understanding of human language inherent in LLMs with vast, dynamic external knowledge bases. This combination opens up new possibilities for AI applications in areas such as automated research, complex decision-making, and personalized content generation.

Llama 2: Llama 2, a powerhouse language model developed by Meta and Microsoft, stands as a giant in the world of AI. Trained on a vast ocean of internet data, it possesses the remarkable ability to converse, generate creative text formats, and answer your questions in an informative way. This open-source behemoth, freely available for research and commercial use, marks a significant leap forward in AI accessibility. With its immense potential and commitment to responsible development, Llama 2 paves the way for a future where AI empowers human creativity and understanding.

Using the above 4 components we are going to implement a RAG pipeline into our customer service chatbot.

Setting Up the Environment

Install the following dependencies by creating a requirements.txt file:

Flask
Flask-Session
sentence-transformers
langchain
transformers
scipy
trl
bitsandbytes
peft
accelerate
torch
datasets
langchain
qdrant-client
pypdf
pip install -U -r requirements.txt

Create a file called app.py. That will be the file containing backend flask code, and all the logic for creating the RAG pipeline. First, I’ll paste the entire working code, then I’ll describe each section one-by-one in case you want to understand what the code does.

from flask import Flask, request, session, jsonify
from werkzeug.utils import secure_filename
import os
import os
import uuid
import torch
import transformers
from transformers import (
AutoTokenizer,
AutoModelForCausalLM,
BitsAndBytesConfig,
pipeline
)
from langchain.text_splitter import CharacterTextSplitter
from langchain.document_transformers import Html2TextTransformer
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
from langchain.vectorstores import Qdrant
from langchain.document_loaders import PyPDFLoader
from langchain.llms import HuggingFacePipeline
from langchain.chains import LLMChain, RetrievalQA

app = Flask(__name__)
app.config['UPLOAD_FOLDER'] = 'uploads/'
ALLOWED_EXTENSIONS = {'pdf'}

if not os.path.exists(app.config['UPLOAD_FOLDER']):
    os.makedirs(app.config['UPLOAD_FOLDER'])

#Loading the Llama-2 Model
model_name='meta-llama/Llama-2-7b-chat-hf'
model_config = transformers.AutoConfig.from_pretrained(
model_name,
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

#################################################################
# bitsandbytes parameters
#################################################################

# Activate 4-bit precision base model loading
use_4bit = True
# Compute dtype for 4-bit base models
bnb_4bit_compute_dtype = "float16"
# Quantization type (fp4 or nf4)
bnb_4bit_quant_type = "nf4"
# Activate nested quantization for 4-bit base models (double quantization)
use_nested_quant = False

#################################################################
# Set up quantization config
#################################################################
compute_dtype = getattr(torch, bnb_4bit_compute_dtype)

bnb_config = BitsAndBytesConfig(
load_in_4bit=use_4bit,
bnb_4bit_quant_type=bnb_4bit_quant_type,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=use_nested_quant,
)
# Check GPU compatibility with bfloat16
if compute_dtype == torch.float16 and use_4bit:
    major, _ = torch.cuda.get_device_capability()
    if major >= 8:
        print("=" * 80)
        print("Your GPU supports bfloat16: accelerate training with bf16=True")
        print("=" * 80)

#################################################################
# Load pre-trained config
#################################################################
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
)

# Building a LLM QNA chain
text_generation_pipeline = transformers.pipeline(
model=model,
tokenizer=tokenizer,
task="text-generation",
temperature=0.2,
repetition_penalty=1.1,
return_full_text=True,
max_new_tokens=300,
)

llama_llm = HuggingFacePipeline(pipeline=text_generation_pipeline)
file_id = None
retrieval_chain = None

def allowed_file(filename):
    return '.' in filename and \
           filename.rsplit('.', 1)[1].lower() in ALLOWED_EXTENSIONS

@app.route('/upload', methods=['POST'])
def upload_file():
    global file_id
    if 'file' not in request.files:
        return 'No file part', 400
    file = request.files['file']
    if file.filename == '':
        return 'No selected file', 400
    if file and allowed_file(file.filename):
        filename = secure_filename(file.filename)
        file_id = str(uuid.uuid4())
        filepath = os.path.join(app.config['UPLOAD_FOLDER'], file_id)
        file.save(filepath)

        # Placeholder for your PDF processing logic
        process_pdf(filepath)

        return 'File uploaded & processed successfully. You can begin querying now', 200

def process_pdf(filepath):
    global retrieval_chain
    # Loading the splitting the document #
    filepath = os.path.join(app.config['UPLOAD_FOLDER'], file_id)
    loader = PyPDFLoader(filepath)
    docs = loader.load_and_split()
    # Chunk text
    text_splitter = CharacterTextSplitter(chunk_size=500, chunk_overlap=100)
    chunked_documents = text_splitter.split_documents(docs)

    # Load chunked documents into the Qdrant index
    db = Qdrant.from_documents(chunked_documents, HuggingFaceEmbeddings(model_name='sentence-transformers/all-mpnet-base-v2'), location= ":memory:")
    retriever = db.as_retriever()
    retrieval_chain = RetrievalQA.from_llm(llm= llama_llm, retriever= retriever)

@app.route('/query', methods=['POST'])
def query():
    global retrieval_chain
    data = request.json
    query = data.get('query')

    return jsonify(retrieval_chain.run(query)), 200

if __name__ == '__main__':
    app.run(debug=True)

from flask import Flask, request, session, jsonify
……
from langchain.chains import LLMChain, RetrievalQA

First, here’s a list of all the import statements needed to initialize our project.

2.

app = Flask(__name__)
….
os.makedirs(app.config[‘UPLOAD_FOLDER’])

We initialize our flask app and create the necessary directories.

3.

#Loading the Llama-2 Model
model_name=’TheBloke/Llama-2–7B-Chat-GGUF’
………
quantization_config=bnb_config,)

In this part of the code we set the quantization parameters of our model so it may work speedily and efficiently.

4.

# Building a LLM QNA chain
text_generation_pipeline = transformers.pipeline(
……retrieval_chain = None

Next, a LLM chain pipeline is created and some global variables are initialized, which we’ll later use to save data as the flask server is running.

5.

def allowed_file(filename):
…………
filename.rsplit(‘.’, 1)[1].lower() in ALLOWED_EXTENSIONS

Helper function to grab the name of our pdf file.

6.

@app.route(‘/upload’, methods=[‘POST’])
def upload_file():
…………

return ‘File uploaded & processed successfully. You can begin querying now’, 200

This is where we are creating the API endpoint for uploading the pdf file. The upload file is then saved using a unique id. In-between, we are also using the helper function process_pdf to process our document for the vector store.

7.

def process_pdf(filepath):
……… retrieval_chain = RetrievalQA.from_llm(llm= llm, retriever= retriever)

We chunk the document into smaller parts, then use the Hugging Face embeddings to upsert it into our vector store, and then create a retrieval QA chain which is assigned to a global variable so that it can be used later.

8.

@app.route(‘/query’, methods=[‘POST’])
def query():
………
return jsonify(retrieval_chain.run(query)), 200

Then we create the API endpoint for sending queries to our server and then receiving the returned messages from the retrieval_chain.

Launching the app and interacting with our server (chatbot)

Save the app.py file and, in the terminal, launch it with the following command:

python app.py

You have a flask server (the chatbot) up and running on port 5000 of the localhost.

To upload your pdf file, use the following command in a new terminal. For our article we shall upload Google’s Code of Conduct into our chatbot.

curl -X POST -F 'file=@/path/to/your/file.pdf' http://localhost:5000/upload

If the file is uploaded and processed, you should see the following message:

To send queries, use the following command:

curl -X POST -H "Content-Type: application/json" -d '{"query":"your query text here"}' http://localhost:5000/query

Example queries

curl -X POST -H "Content-Type: application/json" -d '{"query":"Summarize the code of conduct document for me"}' http://localhost:5000/query

curl -X POST -H "Content-Type: application/json" -d '{"query":"What is google’s stance on gender discrimination?"}' http://localhost:5000/query

curl -X POST -H "Content-Type: application/json" -d '{"query":"What is policy for outside employment?"}' http://localhost:5000/query

Conclusion

In conclusion, the provided code sets up a Flask server to deploy a chatbot based on the RAG (Retrieval-Augmented Generation) pipeline. The server allows users to upload PDF documents, which are processed and split into chunks using LangChain. The text chunks are then embedded into a vector store using Hugging Face embeddings and indexed with Qdrant. The RAG pipeline, powered by the Llama 2 model, is utilized for text generation and retrieval. Users can interact with the chatbot by sending queries to the server, receiving responses based on the information stored in the vector store. The code demonstrates a comprehensive integration of various libraries and technologies to create a functional conversational AI system.