BM25

scroll ↓ to Resources

Note

Core differences vs TF-IDF

  • In raw TF-IDF Term-Frequency (TF) linearly increases the score, but users perceive relevance with diminishing returns. The 20th mention of a term is less important than the 1st Score should saturate
    • Achieved with the == parameter== transforming the
      • lower saturates fast, with 0 meaning instant saturation or no Term-Frequency
  • Same for Document Freq (DF): first 10 occurrences are rare, after that it is just a common term
    • in TF-IDF transforms
  • Short document (like a tweet) mentioning a term once is more important than the same term mentioned 10 times in a book with 1000 pages.
    • BM25 scales to length with the == parameter== from 0 to 1
      • higher weights relatively short documents more, gives more influence to the length
  • Combination of multiple fields has several problems
    • double-counting of Term-Frequency

Resources


Transclude of base---related.base


table file.inlinks, filter(file.outlinks, (x) => !contains(string(x), ".jpg") AND !contains(string(x), ".pdf") AND !contains(string(x), ".png")) as "Outlinks" from [[]] and !outgoing([[]])  AND -"Changelog"