BM25
scroll ↓ to Resources
Note
Core differences vs TF-IDF
- In raw TF-IDF Term-Frequency (TF) linearly increases the score, but users perceive relevance with diminishing returns. The 20th mention of a term is less important than the 1st ⇒ Score should saturate
- Achieved with the == parameter== transforming the ⇒
- lower saturates fast, with 0 meaning instant saturation or no Term-Frequency

- lower saturates fast, with 0 meaning instant saturation or no Term-Frequency
- Achieved with the == parameter== transforming the ⇒
- Same for Document Freq (DF): first 10 occurrences are rare, after that it is just a common term
- in TF-IDF transforms ⇒
- Short document (like a tweet) mentioning a term once is more important than the same term mentioned 10 times in a book with 1000 pages.
- BM25 scales to length with the == parameter== from 0 to 1
- higher weights relatively short documents more, gives more influence to the length

- higher weights relatively short documents more, gives more influence to the length
- BM25 scales to length with the == parameter== from 0 to 1
- Combination of multiple fields has several problems
- double-counting of Term-Frequency
Resources
- BM25F from scratch
- Cheat at Search Essentials: BM25 + Lexical
- Remember our good friend BM25 that empowers every RAG hybrid search? We all know it is powerful but …
Transclude of base---related.base
Links to this File
table file.inlinks, filter(file.outlinks, (x) => !contains(string(x), ".jpg") AND !contains(string(x), ".pdf") AND !contains(string(x), ".png")) as "Outlinks" from [[]] and !outgoing([[]]) AND -"Changelog"