data labeling

_{scroll ↓ to Resources}

Tabular data
Image data
Text data
Video data
Audio data
Resources

Tabular data

Use https://www.snorkel.org/ to create labeling heuristics
semi-supervised learning: when a small part of the dataset is labeled we can generates labels for unlabeled data by leveraging a model trained on a that small set of labeled data using supervised learning. this is called pseudo-labeling.
unsupervised learning using clustering methods such as K-means
- Metrics
  - inertia: the sum of the distance of all the points from the centroid. If all the points are close to each other, that means they are similar and it is a good cluster. We aim for a small distance for all points from the centroid.
- Dunn’s index

Image data

weak supervision
- create a lot of noisy labels for each data point based on various heuristics and average them
- for images the label can be 0 or 1 depending on whether each object of interest is present
transfer learning
focal loss

Text data

zero-shot learning
- topic generation, summarization
embedding + clustering (instead of classification head)
Q: What makes a good custom interface for reviewing LLM outputs? – Hamel’s Blog

Video data

Audio data

Resources

Links to this File

table file.inlinks, filter(file.outlinks, (x) => !contains(string(x), ".jpg") AND !contains(string(x), ".pdf") AND !contains(string(x), ".png")) as "Outlinks" from [[]] and !outgoing([[]])  AND -"Changelog"

Fluent Numbers 🌱

On this site

data labeling

Contents

Tabular data

Image data

Text data

Video data

Audio data

Resources

Links to this File

Graph View

On this page

Backlinks

Recent

Tech - Tricky interview questions

parameter efficient fine-tuning

structured output

Normalized Discounted Cumulative Gain

Generative AI System Design Interview - Chapter 1 - Introduction and Overview