Large context vs RAG

My 2 cents on the old debate of which method is better for developers to focus on

Apr 10, 2025

The recent launch of Llama 4 model with its huge context window of 10 million, rekindled the old discussion of RAG vs long context, and if RAG is still needed.

It has been one of the hottest topics in AI recently, especially after models have started going above 16k in context.

“Just put everything in context” is the the most common thing I hear from the long-context camp. On the other side, we have people who are invested in the 20+ years of improvements of RecSys and that are maybe skeptical of brute force extraction of meaning from long contexts by LLMs.

If we look at the history of ML we’ve seen again and again that the answer has been “make a bigger model“. And seeing the great improvements in the past years it’s really hard to argue against models being able to interpret their context and extracting meaning.

But in this article I’ll try and go over some of the limitations of the long context method, how to reason about them and why we might not be there yet:

Bad context retrieval

This is by far the biggest downside of “fill the context“.

It has been proven again and again that models do not handle long context well. Maybe it is a limitation of the attention mechanism, but it’s certain that the more details you put in context the less likely it’ll be that a model will recall a particular detail.

What is a good amount of info to put in context?

That depends a lot on the model. Llama 3 starts dropping in accuracy after 8k, where OpenAI’s GPT 4o after 32k.

A safe range for a context is around 0 to 8k.

What to put in context

But even with a perfect context retrieval, what do you put in the context?

Some proponents say “Well, just put the whole book in there“.
Sure, but what’s the use case? Chat with a single book?

“No, but the user can choose what book to put in context“.
Ok, a bit more useful, but how does the user do that? Manually selecting a book from a list?

“No, he can just mention what book to talk to and then ask the question”.
Sounds good, but that is query routing, and that is the beginning of a beautiful RAG system.

Putting everything in context relies on the assumption that we know what “everything“ means.

Let’s take the “chat with a code base“ use case. With a strong enough model we can put the whole code base in context, and ask questions. Sounds very useful.

But what about it’s dependencies? Are we going to assume the model perfectly knows the documentation for the specific version of a library that we’re using in the project?

If we want to make sure that the model has the latest docs, we can also save and load those as well. Putting aside that that is the beginning of a RAG system, then we have questions about more generic knowledge. What if the model knows about Python 3.7 and not 3.12. We could give him a boost by adding some extra docs in context as well, but that again, is a RAG system.

The idea of just putting everything in context is alluring, but way to simplistic to actually work in reality. A useful system has multiple data sources that can’t just be fully retrieved and stuffed in a context for the LLM to choose from.

But even if models would have perfect recall and almost infinite context. It would still not make sense because of another limitation: economics.

Cost + Speed

If we put a whole book in context, when a user asks a question, the model needs to “read“ the whole book every time. That is very wasteful from a time and cost POV.

As developers, we pay per input token, so stuffing a lot of text in context can get really expensive, really fast. For example, to use the full 1 million tokens context of Gemini 2.5 Pro you would pay $2.5. On every user question.

Now fortunately there is a solution for this: context caching.
This is where model providers offer the option to cache parts of prompt so on subsequent requests, it doesn’t have to compute it again. And this can come with huge savings.

The problem is that not all providers offer this option, it is very model dependent and some times even region dependent. And context caching also has a very small life span, usually around 5 minutes.

What about if I host my own model?
In that case yes, developers have a lot more control over the behavior of the model and how context caching works. Putting aside that starting to host your own models come with its own huge challenges, even then, it is a pretty memory intensive operation. The cache needs to be stored (in RAM most likely) for every session / user and that adds up quicly.

So is RAG the answer?

No. I don’t think it’s helpful to deal in absolutes, and going either 100% on long context or 100% on RAG isn’t ideal. I think both technologies have their advantages but RAG seems to offer more capabilities and flexibility for the time being.

I think a good way of planning regarding this is taking all the best practices of IR and delegating one by one to the model until we see a decrease in quality. With that system in place as models get better we can move more and more responsibility to the model.

I understand the complexity of building something that is more than a RAG demo. The simplicity of just stuffing the context is alluring, but from a model capability we are not there yet. As much as we would like models to just handle everything, we are not there yet.

Andrei’s Substack

Discussion about this post