data labeling
scroll ↓ to Resources
Contents
Tabular data
- Use https://www.snorkel.org/ to create labeling heuristics
- semi-supervised learning: when a small part of the dataset is labeled we can generates labels for unlabeled data by leveraging a model trained on a that small set of labeled data using supervised learning. this is called pseudo-labeling.
- unsupervised learning using clustering methods such as K-means
- Metrics
- inertia: the sum of the distance of all the points from the centroid. If all the points are close to each other, that means they are similar and it is a good cluster. We aim for a small distance for all points from the centroid.
- Dunn’s index
- Metrics
Image data
- weak supervision
- create a lot of noisy labels for each data point based on various heuristics and average them
- for images the label can be 0 or 1 depending on whether each object of interest is present
- transfer learning
- focal loss
Text data
- zero-shot learning
- topic generation, summarization
- embedding + clustering (instead of classification head)
- Q: What makes a good custom interface for reviewing LLM outputs? – Hamel’s Blog
Video data
Audio data
Resources
- Разметка данных в Label Studio при помощи GPT-4: интеграция ML Backend / Хабр
- Data Labeling in Machine Learning with Python - Vijaya Kumar Suda
Links to this File
table file.tags from [[]] and !outgoing([[]]) AND -"Changelog"