log probs

scroll ↓ to Resources

Note

  • can be used for classification certainly estimation or early data drift detection
  • log_probs are conditional on the model’s training and the context it sees. They reflect how likely the model was to generate a token based on patterns learned from its data, not an objective probability in the real world. This makes them a useful signal of model uncertainty, but they shouldn’t be confused with actual chances of an outcome (although, generally LLMs super wide training data allows them to have sufficient accuracy on the vast majority of use cases)
  • Important to remember that Single token log probabilities are not class probabilities because:
    • The probability distribution is over the entire vocabulary, not a single class label, so class probabilities wont sum up to 1.
    • due to Common issues due to tokenization
    • context dependency (as above)

Classification confidence with log_probs

  • The right way to implement classification confidence evaluation with log_probs:
    • Define a fixed enum of possible class labels
    • Compute sequence log probabilities for each complete class label
    • Renormalize these probabilities across only your defined classes
      • P(class_i | input) = exp(log_prob_i) / sum(exp(log_prob_j) for all j in classes)
    • Apply calibration using holdout validation data to ensure the probabilities reflect true confidence

Data and model drift detection

  • accuracy is a step-function, by the time max log-prob token changes it is already late to detect drift
    • continuous metrics are better as they detect changes in distribution even when argmax is still the same
  • entropy
    • When a model becomes less certain about its predictions, the entropy of its output distribution increases, even if the top prediction remains the same
  • Kullback-Leibler divergence 1
    • Maintain a set of stable test prompts that remain consistent, run each prompt N times, calculate statistics, establish a baseline
    • Run these prompts daily through the model
    • Use token-wise KL divergence to compare today’s softmax output against the baseline.
    • Flag when KL divergence exceeds threshold ⇒ model drift is distinguished from potential data drift

Resources


table file.inlinks, file.outlinks from [[]] and !outgoing([[]])  AND -"Changelog"