BM25

_{scroll ↓ to Resources}

Note

Core differences vs TF-IDF

In raw TF-IDF Term-Frequency (TF) linearly increases the score, but users perceive relevance with diminishing returns. The 20th mention of a term is less important than the 1st ⇒ Score should saturate
- Achieved with the == $k 1$ parameter== transforming the $TF$ ⇒ $TF / (TF + k 1)$
  - lower $k 1$ saturates fast, with 0 meaning instant saturation or no Term-Frequency
Same for Document Freq (DF): first 10 occurrences are rare, after that it is just a common term
- $1/ D F$ in TF-IDF transforms ⇒ $l o g (1 + (N d ocs - D F + 0.5) / (D F + 0.5))$
Short document (like a tweet) mentioning a term once is more important than the same term mentioned 10 times in a book with 1000 pages.
- BM25 scales to length with the == $b$ parameter== from 0 to 1
  - higher $b$ weights relatively short documents more, gives more influence to the length
Combination of multiple fields has several problems
- double-counting of Term-Frequency

Resources

Transclude of base---related.base

Links to this File

table file.inlinks, filter(file.outlinks, (x) => !contains(string(x), ".jpg") AND !contains(string(x), ".pdf") AND !contains(string(x), ".png")) as "Outlinks" from [[]] and !outgoing([[]])  AND -"Changelog"

Fluent Numbers 🌱

On this site

BM25

Note

Core differences vs TF-IDF

Resources

Links to this File

Graph View

On this page

Backlinks

Recent

Tech - Tricky interview questions

parameter efficient fine-tuning

structured output

Normalized Discounted Cumulative Gain

Generative AI System Design Interview - Chapter 1 - Introduction and Overview