synthetic data generation for RAG evaluation

scroll ↓ to Resources

Contents

Note

Steps

  • Chunk filtering
    • pre-filter the documents\chunks based on relevance for users (probability of being queried)
    • Aligned LLM-as-a-judge
      • manually label small part of documents as relevant or irrelevant
      • Iterate on LLM-as-a-judge criteria to align with labeled data perfectly
      • Label the rest of the data
    • Use context, tags, metadata, date ranges
  • contextual chunk rewriting (optional)
    • expensive if is run on every chunk
    • identify chunks which require context such as tables, images, …
  • Query generation
    • generate questions from documents
    • use few-shot learning and context to create realistic queries, both by content and formulation\format
      • what is the purpose of X in Y is too clean and easy to search, rather than X is not working, which is more likely to be asked by a user
    • review and validate by domain experts
  • Ranking generation from questions and chunks
    • Create a good prompt for LLM-as-a-judge to achieve automatic ranking at the same level as your own manual way
  • summarization of ingested documents (optional) ^ea0ca7
    • cost-efficiency drops for modern models with large context windows
    • consider a separate search summaries tool and use summarized chunks as supplement to raw data
    • design summarization prompts with use-case needs in mind
      • good for financial reports, if numbers are crucial make sure the model retains\sums up them
      • multi-media content without text captions

Resources


table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"