Is your data more or less important in the age of LLMs?

drawing — Bing Image Creator "A llama carrying a stack of books in a backpack towards the harvard library, digital art"

I remember the first time that I heard the term "Big Data", It was 2014: I was a senior in college working on my degree in electrical engineering, and a company had come on campus for recruiting. They were telling our class about how important it was to know how to manipulate datasets that were Terabytes or even Petabytes in magnitude. At the time I basically had zero clue what they were talking about, but now the topic is a typical part of a workload involving AI and ML.

One of the many discussions happening around Large Language Models (LLMs) is whether it eliminates a company's "data moat". A good definition of data moat is a definition in Kenn So's blog post: "A data moat is a strategic advantage a company gains by accumulating unique data that competitors cannot easily replicate, allowing it to deliver better products to customers."

How ChatGPT & LLMs changed the perception of data moats

Before ChatGPT was released, it was generally understood that the company with the largest quantity of high quality data would be able to train the best model. Aka GMail would have the best email auto-complete system because it has the most email data, Amazon would have the best product recommendation system because it has the most info about what people buy, Spotify would recommend the best songs because it has the most data about what we listen to.

After ChatGPT was released (and with the wave of LLMs that followed it like GPT-4, Claude, Mixtral, Llama, Gemini, etc), it was apparent that these LLMs could already perform a variety of simple tasks and had knowledge that made them useful without the need to explicitly provide the information (often referred to as emergent behavior). For instance, StackOverflow.com used to be the go-to place when anyone needed help solving a computer-science related problem. Few could compete with the amount of contributed information stored on that site, giving it a distinct advantage. An impact of ChatGPT and other coding assistants (like Github Copilot) was that Stack Overflow traffic plummeted. The recent collaboration between OpenAI and Stack Overflow seems to further point to the conclusion that Stack Overflow realized that in order to survive they would need to collaborate more closely with leaders in the AI space.

What now?

For companies that have a large store of proprietary data, the emergent capabilities of LLMs initially can be quite intimidating. However, I think that in many ways, after the dust settles we may find that data moats remain intact and private proprietary data is still a massive advantage. This comes down to two main reasons.

Evaluation

ChatGPT/Gemini/Llama seems intelligent, but can it really perform competitively against models trained with a large amount of specific data and domain knowledge that a company may have? The first thing that makes it a tricky question to answer is that (as I discuss at length in my post about automation bias) in many ways an LLM is optimized to make you believe it, regardless of whether what it's saying is actually true. In other words, to accurately evaluate any LLM, it's not sufficient to "eyeball" it or manually look at a few examples and decide whether it's good enough or not. For one thing, LLMs generally output text using a sampling algorithm which means if you ask it the same question twice you may get two different answers.

If a human evaluation of an LLM isn't sufficient, the next approach for evaluation is some sort of automatic evaluation. There are lots of public datasets commonly used for evaluation purposes (MMLU is the first one that comes to mind). However, because most companies training LLMs don't release the datasets used during training, there's not a conclusive way to prove that the datasets used for evaluation weren't intentionally or accidentally included during training. For example, even though a LLM may pass a medical exam (USMLE), how do we know that it actually learned the material and didn't just memorize the content? If it memorized the content, the benchmark may then only be reporting an LLMs ability for memorization of training data, and not necessarily how it would perform if required to answer a medical question that wasn't in its training data.

This should make the value of a proprietary dataset obvious: if the data is internal and never posted on the public internet, then at least the first time you run the data through an LLM you can be confident that whatever results you get aren't the result of memorization. If testing a proprietary model like GPT-4/Gemini/Claude, there's then the question about whether you can actually test that proprietary data against the model a second time: it involves trust that OpenAI/Anthropic/Google didn't save that data when you ran the test the first time, and that they didn't use that data to train the model a bit more so that it was better when you tested it that second time. I'll leave it to the lawyers to determine how real of a scenario that might be, but it's certainly technically possible.

Tl;dr, proprietary data can help evaluate whether any given LLM is actually as good for a task as it may claim, and helps to be able to rank which LLM is best for your use case.

Fine-Tuning

After using proprietary data to evaluate an existing LLM, this data can then be useful as a traditional source of data to further train the LLM. You may find that an LLM like GPT-4 performs better than if you had no data, but worse than a custom model trained on the large proprietary data you have available. In one of my first blog posts, I trained an LLM to autocomplete emails. Even though ChatGPT may have been able to help me draft a response to my 11th grade math teacher, without some sort of additional training or custom prompting, it would have no way to know that I like to sign my emails as "Thanks, \nNate" instead of "Have a nice day, \nNathan". I know that's a toy example, but it illustrates the point that even though an LLM is usually "better than nothing", using extra data that covers all the nuances of your use case is still instrumental in having an excellent product.

Why does this matter?

The main takeway is that although an LLM may give the appearance of high performance, it's important to validate it with data that you're positive was never a part of the LLM's training data, and also to evaluate at what point an LLM's performance plateaus without benefitting from some additional training using your own data.