Advanced RAG techniques

scroll ↓ to Resources

Advanced improvements to RAG

Note

RAG intertwines with the general topic of model evaluation, and adjacent to such things as synthetic data and Challenges with RAG

  • Most probably you have to chunk your context data into smaller pieces. Chunking strategy can have a huge impact on RAG performance.
    • small chunks β‡’ limited context β‡’ incomplete answers
    • large chunks β‡’ noise in data β‡’ poor recall
    • By symbols, sentences, semantic meaning, using dedicated model or an LLM call
    • semantic chunking by detecting where the change of topic has happened
    • Consider inference latency, number of tokens embedding models were trained on
    • Overlapping or not?
    • Use small chunks on embedding stage and large size during the inference, by appending adjacent chunks before feeding to LLM
  • fine-tuning to make models output citations\ref
    • Start with small batches, measure performance, and increase data volume until you reach your desired accuracy level.
    • shuffle the order of retrieved sources to prevent position bias
      • unless sources are sorted by relevance (the model assumes that the 1st chunk is the most relevant)
      • newer models with large context windows are less prone to the Lost in the Middle effect and have improved recall across the whole context window
  • re-ranking
  • query expansion and enhancement
    • Another LLM-call-module can be added to rewrite and expand the initial user query by adding synonyms, rephrasing, complementing with initial LLM output (without RAG context), etc.
  • In addition, to dense embedding models, historically, there are also sparse representation methods. These can and should be used in addition to vector search, resulting in hybrid search ^f44082
    • encoding is supervised (e.g splade) or unsupervised (e.g BM25, TF-IDF)
    • search accelerated with top-k retrieval algorithms like WAND, MaxScore, BM-WAND and more
  • Using hybrid search (at least full-text + vector search) is standard to RAG, but it requires combining several scores into one ^6fd281
  • metadata filtering reduces the search space, hence, improves retrieval and reduces computational burden
    • dates. freshness, source authority (for health datasets), business-relevant tags
    • categories: use named entity recognition models: GliNER
    • if there is no metadata, one can ask LLM to generate it
  • Shuffling context chunks will create randomness in outputs, which is comparable to increasing diversity of the downstream output (as an alternative to hyperparameter tuning using softmax temperature) - e.g. previously purchased items are provided in random order to make recommendation engine output more creative
  • One can generate summary of documents (or questions to each chunk\document) and embed that info too
  • create search tools specialized for your use-cases, rather than search for data types. The question is not whether I am searching for semantic or structured data?, but which tool would be the best to use for this specific search? ^c819e0
    • Generic document search that searches everything, Contact search for finding people, Request for Information search that takes specific RFI codes.
    • Evaluate the tool selection capability separately
    • Make the model write a plan of all the tools it might want to use for a given query. Possibly present the plan for users approval, creates valuable training data based on acceptance rates.
    • The naming of tools significantly impacts how models use them. Naming it grep or else can affect the efficiency.
  • formatting ^9d73c5
  • multi-agent vs single-agent systems
    • communication overhead if agents are NOT read-only, need to align who modifies what
    • if all read-only, for instance, in search of personality info - one may search professional sources, one about personal life, another smth else
    • benefit of multi-agents - token efficiency, Especially if there are more tokens than one agent can consume in the context
      • The performance just increases with the amount of tokens each sub-agent is able to consume. If you have 10 sub-agents, you can use more tokens, and your research quality is better

Not RAG-specific

  • Off-the-shelf bi-encoders embedding models) can be fine-tuned like any other model, but it is barely done on practice by anyone as there are much lower hanging fruits

Other

Resources


table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"