Interest in domain-specific retrieval-augmented generation (RAG) chatbots has peaked in recent months. A major challenge in developing them is ensuring they can answer every question accurately.
This problem can be approached in two ways: (1) Add the necessary context in the prompt every time (2) Pre-train the model with the domain knowledge. Each method has its own limitations. When using prompts, we are constrained by the model’s context window—the amount of information it can handle at a time. So documents have to be broken into smaller, manageable chunks while ensuring they’re coherent so that the model does not misunderstand or misinterpret the information.
In this blog post, we’ll explore three strategies to tackle the challenge:
- Maximizing chunk size for larger context windows
- Finding optimal chunking and prompt engineering techniques when the context is limited
- Fine-tuning the model for deep domain knowledge
Building the RAG Chatbot
To create a proof of concept, we built a sample chatbot, a simple question-answer system, using a few domain-specific documents. The steps we took to optimize it are outlined below.
Chunking
The documents were split into smaller, manageable chunks. For this, we used the RecursiveCharacterTextSplitter from LangChain, a popular Python framework for building applications powered by language models. Splitting documents into chunks makes it easier for the chatbot to focus on specific information when answering a query.
Adding Context to Chunks
To make sure each chunk is clear and useful on its own, we added extra information, such as the document title and summary. This was to ensure that the LLM understands the context even if it’s working with just one chunk at a time.
Using Metadata for Filtering
Additional metadata—such as page numbers, maximum pages, or category names—was added to the document chunks. This enabled more targeted filtering during searches based on the project requirements.
Vectorizing and Storing Chunks
Next, the chunks were converted into vectors or numerical representations to help the system understand and compare content. These vectors were then stored in ChromaDB, a popular vector database, so the chatbot could quickly retrieve relevant chunks when we posed a query.
Retrieving Relevant Information
To enable similarity search, queries were converted to vectors. Additional filters were specified to refine this search to only look at chunks from a particular document collection. The search returned the top matches, along with their IDs and any metadata.
JSON Mode in API
Models like GPT-4, GPT-4 Turbo, and GPT-3.5 Turbo support JSON mode if we use OpenAI’s Chat Completions API. We specified the response_format: { "type": "json_object" } in the system message to generate outputs that other applications can process.
Now, let’s dive into the three strategies we tested to tackle the hurdles in context retrieval.
Strategy 1: Maximum Chunk Size for Larger Context Window
To test this strategy out, we used GPT-4 Turbo, which was one of the advanced models with a large context window at the time of our project. This allowed us to include entire documents or substantial sections in each chunk and obtain excellent accuracy in our tests. However, leveraging such an extensive context window also meant higher costs. (If you are exploring cost-effective options, GPT-4o mini is a viable model as it can process extensive context. This lighter model can support complex or lengthy documents without incurring the high fees associated with the latest GPT-4o or the older GPT-4 Turbo.)
Strategy 2: Optimal Chunking and Prompt Engineering
For this, we chose GPT-3.5 Turbo, a low-cost model with a smaller context window. After testing various chunk sizes through trial and error, we identified an optimal size that captures meaningful sections or topics, enabling accurate information retrieval even within a limited context window.
We found that the retrieved chunks took up minimal context space when performing vector similarity searches. This left space for advanced techniques like few-shot prompting and chain-of-thought (CoT) prompting to enhance response quality. In few-shot prompting, examples (or "shots") of the desired output are included in the prompt. In CoT prompting, prompts are structured to guide the model through each step of a complex problem helping it to simulate human-like reasoning. This approach helped maximize value from a budget-friendly LLM without sacrificing accuracy.
Strategy 3: Fine-Tuned Model
In this approach, we fine-tuned GPT-3.5 Turbo model on a specialized dataset. First, we gathered and formatted the data as specified by OpenAI. Once the data was ready, we fine-tuned the model, tested its performance iteratively, and made necessary adjustments to improve accuracy.
Unlike general few-shot prompting, where we provide examples within the prompt, fine-tuning reduced the need for extensive prompts. However, this approach came with extra costs, including $8.00 per million tokens for training. (Note: prices may vary depending on the model’s current rates.) Yet, it allowed us to maximize model efficiency and tailor it for domain-specific tasks.
Recommendations
- If budget is not a concern and you're working with large documents, you can opt for a model with a large context window and set a maximum chunk size. This involves including the entire document in as few chunks as possible. For instance, GPT-4-Turbo and GPT-4o, with their context window of 128,000 tokens, are suitable for handling large documents efficiently.
- For a balanced approach, experiment with determining the optimal chunk size. Choose a model that offers a good balance of cost and performance based on your context window needs. Utilize advanced techniques such as few-shot prompting and chain-of-thought to enhance the model's ability to handle complex tasks with minimal examples.
- If the other two strategies fall short, you could consider fine-tuning a pre-trained model with a custom dataset. Keep in mind that fine-tuning adds extra costs beyond the usual model usage.
Conclusion
Optimizing chunk size and managing context is essential for harnessing the potential of GPT models in RAG chatbots. The strategies we explored in this blog post—adjusting chunk size, applying prompt engineering techniques, and fine-tuning the model—each play a critical role in balancing performance, cost, and response quality. By carefully considering the trade-offs between limitations and available resources, we can optimize the context that GPT models use to deliver precise answers.