Advanced RAG techniques
scroll β to Resources
Advanced improvements to RAG
Note
RAG intertwines with the general topic of model evaluation, and adjacent to such things as synthetic data and Challenges with RAG
- Most probably you have to chunk your context data into smaller pieces. Chunking strategy can have a huge impact on RAG performance.
- small chunks β limited context β incomplete answers
- large chunks β noise in data β poor recall
- By symbols, sentences, semantic meaning, using dedicated model or an LLM call
- semantic chunking by detecting where the change of topic has happened
- Consider inference latency, number of tokens embedding models were trained on
- Overlapping or not?
- Use small chunks on embedding stage and large size during the inference, by appending adjacent chunks before feeding to LLM
- fine-tuning to make models output citations\ref
- Start with small batches, measure performance, and increase data volume until you reach your desired accuracy level.
- shuffle the order of retrieved sources to prevent position bias
- unless sources are sorted by relevance (the model assumes that the 1st chunk is the most relevant)
- newer models with large context windows are less prone to the Lost in the Middle effect and have improved recall across the whole context window
- re-ranking
- see Re-ranking
- query expansion and enhancement
- Another LLM-call-module can be added to rewrite and expand the initial user query by adding synonyms, rephrasing, complementing with initial LLM output (without RAG context), etc.
- In addition, to dense embedding models, historically, there are also sparse representation methods. These can and should be used in addition to vector search, resulting in hybrid search ^f44082
- Using hybrid search (at least full-text + vector search) is standard to RAG, but it requires combining several scores into one ^6fd281
- use weighted average
- take several top-results from each search module
- use Reciprocal Rank Fusion, mean average precision, NDCG, etc.
- metadata filtering reduces the search space, hence, improves retrieval and reduces computational burden
- dates. freshness, source authority (for health datasets), business-relevant tags
- categories: use named entity recognition models: GliNER
- if there is no metadata, one can ask LLM to generate it
- Shuffling context chunks will create randomness in outputs, which is comparable to increasing diversity of the downstream output (as an alternative to hyperparameter tuning using softmax temperature) - e.g. previously purchased items are provided in random order to make recommendation engine output more creative
- One can generate summary of documents (or questions to each chunk\document) and embed that info too
- create search tools specialized for your use-cases, rather than search for data types. The question is not whether I am searching for semantic or structured data?, but which tool would be the best to use for this specific search? ^c819e0
- Generic document search that searches everything, Contact search for finding people, Request for Information search that takes specific RFI codes.
- Evaluate the tool selection capability separately
- Make the model write a plan of all the tools it might want to use for a given query. Possibly present the plan for users approval, creates valuable training data based on acceptance rates.
- The naming of tools significantly impacts how models use them. Naming it grep or else can affect the efficiency.
- formatting ^9d73c5
- Does Prompt Formatting Have Any Impact on LLM Performance?
- check which format (markdown, json, xml) works best for your application. there are also discussions about token-efficiency
- spaces between tokens in markdown tables (like β| data |β instead of β|data|β) affects how the model processes the information.
- The Impact of Document Formats on Embedding Performance and RAG Effectiveness in Tax Law Application
- Does Prompt Formatting Have Any Impact on LLM Performance?
- multi-agent vs single-agent systems
- communication overhead if agents are NOT read-only, need to align who modifies what
- if all read-only, for instance, in search of personality info - one may search professional sources, one about personal life, another smth else
- benefit of multi-agents - token efficiency, Especially if there are more tokens than one agent can consume in the context
- The performance just increases with the amount of tokens each sub-agent is able to consume. If you have 10 sub-agents, you can use more tokens, and your research quality is better
Not RAG-specific
- Off-the-shelf bi-encoders embedding models) can be fine-tuned like any other model, but it is barely done on practice by anyone as there are much lower hanging fruits
Other
- AutoML tool for RAG - auto-configuring your RAG
- Contextual Retrieval \ Anthropic
- Query Classification / Routing - save resources by pre-defining when the query doesnβt need external context and can be answered directly or using chat history.
- Multi-modal RAG, in case your queries need access to images, tables, video, etc. Then you need a multi-modal embedding model too.
- Self-RAG, Iterative RAG
- Hierarchical Index Retrieval - first search for a relevant book, then chapter, etc.
- Graph-RAG
- Chain-of-Note
- Contextual Document Embeddings
Resources
- GitHub - NirDiamant/RAG_Techniques: This repository showcases various advanced techniques for Retrieval-Augmented Generation (RAG) systems
- Yet another RAG system - implementation details and lessons learned : r/LocalLLaMA
- AI Engineering - Chip Huyen
Links to this File
table file.inlinks, file.outlinks from [[]] and !outgoing([[]]) AND -"Changelog"