At GroundX, we often say GPT and other large language models (LLMs) are Harvard professors of the open Internet and first graders of your private data. They’ve never seen your data before and have a lot of trouble understanding it without significant intervention.
It’s this lack of understanding that creates hallucinations with your data.
LLMs hallucinate when they run out of facts and don’t realize it, a problem that is greatly exacerbated when building retrieval augmented generation (RAG) applications where the LLM is used for reasoning and linguistic functions, but the core answers to a user's questions comes from your own private data stores.
There are several mechanical reasons why this happens and concrete steps you can take to fix them, which I’ll explain below.
One path is to use our GroundX APIs, the only full stack solution for grounded generation available today, meaning we have the only platform that addresses each of the issues and grounds every LLM response in the facts of your private data.
We built this stack out of our own need for accuracy.
We’ve been building language model AIs for more than 15 years. We pioneered many of IBM Watson’s consumer applications, built the Weather Channel’s forecasting AI that served 2M people a day through Alexa, Siri and Facebook Messenger and have been working alongside OpenAI since 2021.
We were developer #20 in OpenAI’s private beta program and have spent the last several years of blood and tears pushing LLMs to provide truthful answers with private data. The EyeLevel platform and our GroundX APIs are the result. We built the tools we needed and realized others could benefit from them too.
Whether you decide to use our tools or build your own, it’s helpful to understand why LLMs fail and what can be done to fix it.
1. Messy documents
The first issue is a classic GIGO problem (Garbage In/Garbage Out). LLMs need to be fed simple text. Formatting like tables, headers, tables of contents, columns and many other things confuse them. Information inside graphics won’t be understood either. Current OCR algorithms mostly don’t solve the problem.
Legal transcripts are a nightmare for LLMs. Corporate filings, heavy with financial tables, are pure noise. Try feeding decision tree documents, common in contact centers, to an LLM. It won't be pretty. Even documents that seem simple to the human eye, for example a magazine article with multiple columns, tends to choke both OCR processes and LLMs.
Some of this might be improved by multimodal capabilities that OpenAI and others are promising. That’s the ability to ingest and interpret images, videos, audio and so on. But those systems aren’t in-market at scale yet and the extent of their future capabilities remains unclear.
To handle this issue, we built a parsing engine that transforms documents into simpler formats for LLMs. We have a library of parsers based on our past work and a scripting language called GroundScript that makes it fairly straightforward to build new ones.
When clients show us new document types or we start to work in a new industry like legal or air travel, we build new parsers and our library improves for others who want to use our APIs.
GroundScript is also an open platform. Anyone can write a parser for their own projects or to help others.
LLMs can only “think” in small blocks of text. So what we do (and competitors) when ingesting documents or databases is chunk the content into small paragraph sized blocks.
However, when you do this you can quickly lose context of what that chunk is about.
Imagine I pull a random paragraph from a book and ask you to tell me what it’s about. You likely won’t know who is talking, what they are talking about or where they are. You won’t know if this is fact or fiction. You won’t even know the content is from a book. Perhaps it’s a medical file, a legal document or a children’s school book.
LLMs are faced with the same dilemma and this context problem is one of the reasons a pure vector approach actually introduces hallucinations into your applications.
To solve this, when we ingest content we run it through a proprietary AI pipeline that classifies, labels and clusters the data as it’s chunking.
That means we extract the relevant context of what the chunks are about and wrap each chunk in metadata that describes it. Each chunk gets wrapped with extra info like what doc is this, who is talking, what are the important issues it’s referencing and so on.
Then we store each chunk in our database with its metadata wrapper and corresponding vector embeddings. It's difficult for competitors to do this step because they are all using a type of database called a vector store.
Vector databases are very good for clustering, which is the process of numerically describing how close ideas are to one another and then storing that nearness as a multidimensional number.
For example, the word apple is near the words orange and fruit in one dimension and near computer in another. It is somewhat near steak (another kind of food) and very far from car, rainbow and machine gun.
This is one of the important methods LLMs like GPT use to understand words and phrases. So it's natural that most developers would use vector databases to store new corporate data. It’s the first thing we tried three years ago. In fact it's what OpenAI and Microsoft tell you to do.
But what they don’t tell you is vector stores can’t hold the metadata that the text chunks need for context later on. Vectors are important, but insufficient to solve the problem.
We use a traditional SQL database structure instead which allows GroundX to store the text chunks, metadata and vectors together. This leads to a much more performant semantic search of your content when a user asks a question.
3. Incomplete Search and Retrieval
Highly performant search and retrieval is at the core of any retrieval augmented generation (RAG) application. Whether you use LangChain, LllamaIndex or some other orchestration layer to build your apps, the concept is the same.
When a user asks a question, your application searches a database of private data for text chunks that likely contain the answer. You then pass the question and the answer blocks to an LLM and prompt it to answer the question but only use text from the answer blocks you have sent it to generate the answer.
The key to doing this well is nailing the search and producing answer blocks that the LLM can use. Pure vector approaches to this are typically incomplete and can return context free text blocks that confuse the LLM, causing hallucination.
With GroundX, search and retrieval is fully aligned with the ingest process, meaning the same contextual entities we’ve generated upon ingest (remember the people, places, things, ideas, and so on that we turn into metadata) are returned with search results.
You get the text blocks, the contextual metadata and the vectors. You also get a proprietary score for each block which ranks them for adherence to the question.
You can then decide whether to send all or some to the LLM for completion.
4. Human Refinement
Even after these steps, LLMs sometimes don’t say what you think they should. They aren’t perfect out of the box. To improve certain responses it's valuable to have a human in the loop, especially as you’re training on new data.
OpenAI built a massive capacity for this. Human refinement or Refined Learning Human Feedback (RLHF), as it’s formally called, is a large part of why GPT4 is so much better than GPT3.
That’s great for OpenAI, but not helpful to companies ingesting new data. They need their own human refinement tools.
To solve this problem, we built a human refinement tool that lets you fire hundreds of questions at the application, audit every question/answer pair and in a single click edit the content the LLM is using to respond if it’s not quite right.
Unfortunately, this feature is only available today in our no code tools offered at www.eyeleve.ai, but it will be coming soon to the GroundX API stack.
4. Hallucination Blocker
If you do all of the steps above, your responses are going to improve dramatically. But mistakes can still sneak through and a human can’t be there 100% of the time to refine every response.
We believe what’s needed is a final check on responses that come from an LLM. We recommend a system that scores every LLM response for fidelity to the private data you have fed it. If the response doesn’t meet a threshold for accuracy, block it.
To be clear, what you’re scoring here isn’t the “Truth” with a capital T. Your application isn’t a philosophy professor. But if the LLM returns a completion with people, places, things and ideas that are not contained in your original answer blocks, you should be able to detect that.
We built a proprietary scoring algorithm that, like the human refinement tool, currently lives in our no code suite at EyeLevel.ai, but we’re considering releasing an API version for GroundX as well.
What we’ve found is controlling hallucinations and pushing LLMs to provide more accurate responses requires the stacking of several techniques on top of each other as shown in the chart below.
The most common stack for building RAG applications with LLMs today (LLM/Vector DB/Orchestrator) handles some of these functions but not all and requires significant development time to perfect.
GroundX APIs provide a one-stop stack for grounded generation, saving developers thousands of hours of blood, sweat and LLM tears.