Visual Layer's Mislabel Detection Sets New SOTA in Industry & Academia

Today we're sharing head-to-head experiments comparing VL's mislabel detection features to baseline algorithms from highly-cited academic papers and companies offering mislabel detection capabilities.

Detecting mislabeled samples in image datasets is a notoriously difficult task that holds importance for training computer vision models. That’s why at Visual Layer Research, we’ve developed a state-of-the-art mislabel detection algorithm designed to detect label errors with unprecedented precision that generalizes across different data distributions and domains. We call this algorithm LabelRank, hinting at its ability to quantitatively rank the quality of an image or object label.

‍

tl:dr - results summary

In our head-to-head evaluations against seven varieties of well-regarded mislabel detectors, LabelRank consistently outperformed all other offerings. The full experimental setup and methodology are detailed further along this blog post, but for those in a hurry, we have summarized the results in this table. Note that all the algorithms tested were given the same image embeddings (CLIP ViT/B-32) whenever an algorithm utilizes a precomputed embedding in its approach.

Method	5% Mislabeled	30% Mislabeled
Visual Layer	0.990347	0.982041
SEMD	0.989948	0.943281
CleanLab (Confident Learning)	0.987626	0.952348
Voxel51 (RER)	0.980253	0.923429
SimiFeat (voting)	0.988240	0.968336
SimiFeat (ranking)	0.972453	0.939736
SelfClean (Euclidian)	0.960443	0.871432
SelfClean (Cosine)	0.959172	0.870916

‍

Background

Label noise in computer vision datasets can have detrimental effects on model training, with larger datasets being needed to achieve the same post-training accuracy as would be achieved by training on a dataset with lower label noise levels.¹ Unfortunately, such label noise seems to be pervasive in many of the open-source and academic datasets used to train vision models,² and the cost of manually reviewing all labels is onerous. To alleviate the costs associated with label correction, it is necessary to have a method for surfacing potentially-mislabeled samples with high recall and without sacrificing precision. As an important challenge in data-driven machine learning work, significant research has been conducted into developing such methods, and a variety of different approaches exist in the literature.³ Even so, there remains significant room for improvement, and Visual Layer Research undertook efforts in the past few months in pursuit of such improvements.

‍

Evaluation Methodology

To evaluate Labelrank, we compared its performance against other well-regarded mislabel detection algorithms, using the Caltech101 dataset⁴ as our benchmark. The experimental framework was designed to assess performance under different types and degrees of label noise, utilizing CLIP ViT/B-32 as the embedding model for all evaluations.

We benchmarked Labelrank against:

RER, from Voxel51’s Published Algorithm: Class-wise Autoencoders Measure Classification Difficulty and Detect Label Mistakes⁵
‍Confident Learning, from CleanLab’s Published Algorithm: Confident learning: Estimating uncertainty in dataset labels⁶‍
‍SimiFeat, from: Detecting Corrupted Labels Without Training a Model to Predict⁷
‍ Selfclean, from: Intrinsic Self-Supervision for Data Quality Audits⁸‍
‍SEMD, from: An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets⁹‍

Dataset Preparation and Mislabel Seeding

We seeded mislabels with two different noise levels: 5% and 30% mislabeled samples, simulating different patterns of annotation errors that commonly occur in real-world scenarios. We introduce isolated label errors based on embedding similarity. Given a dataset D with n samples, where each sample i has a ground truth label yᵢ, the process follows these steps:

For each sample xᵢ, we compute its embedding eᵢ using CLIP ViT/B-32

A similarity matrix S is constructed with the cosine similarities between all embeddings.

For each class c, we identify candidate mislabels by selecting samples that are most similar to samples from different classes, subject to the constraint:

$$\forall i: |\{j \mid y_j \neq y_i \wedge j \in \mathcal{N}_k(i)\}| \leq \alpha|D_c|$$

Where Nₖ(i) represents the k-nearest neighbors of sample i, Dc is the set of samples in class c, and α is the maximum fraction of samples per class that can be mislabeled (set to 0.4 in our experiments).

This seeding methodology maintains strict control over the total percentage of introduced mislabels (p) while ensuring:

$$\frac{|\{i \mid y_i \neq y'_i\}|}{|D|} = p \pm \varepsilon$$

where y’ᵢ represents the potentially corrupted label and ε is the maximum allowed deviation (set to 0.25).

This setup enabled us to systematically evaluate Labelrank against other mislabel detection algorithms under different noise patterns and intensities.

‍

Results

LabelRank successfully detects the vast majority (>99% in the 5% condition) of mislabeled samples, even in situations where the seeded mislabel is exceedingly similar to the ground truth label. Some examples of exceptional difficulty are provided below:

Our evaluation demonstrated that LabelRank consistently outperforms existing state-of-the-art mislabel detection methods across both benchmark configurations. The results are quantified using the area under the receiver operating characteristic curve (AUROC), a comprehensive metric that evaluates detection performance across all possible detection thresholds. The full results are provided in the table below:

Method	5% Mislabeled	30% Mislabeled
Visual Layer	0.990347	0.982041
SEMD	0.989948	0.943281
CleanLab (Confident Learning)	0.987626	0.952348
Voxel51 (RER)	0.980253	0.923429
SimiFeat (voting)	0.988240	0.968336
SimiFeat (ranking)	0.972453	0.939736
SelfClean (Euclidian)	0.960443	0.871432
SelfClean (Cosine)	0.959172	0.870916

LabelRank achieved AUROC scores of 0.990 and 0.982 for 5% and 30% noise levels respectively, surpassing the next best performing methods, SEMD (for the 5% condition) and SimiFeat (for the 30% condition). This performance gap, particularly evident in the high-noise regime (30%), indicates LabelRank’s superior robustness to detecting errors even in highly-mislabeled datasets.

‍

Conclusion

LabelRank is available immediately as part of the suite of Quality Analysis features in the Visual Layer Platform. Visual Layer's customers are currently applying these capabilities in wide-ranging domains, spanning defense, e-commerce, manufacturing, and biomedical industries. These customers are using the quality analysis tools for wide-ranging applications that span model training, defect inspection, and intelligence analysis. In the near future, we expect to release more dataset evaluations beyond natural scenes, as well as open-sourcing the evaluation datasets for others to use in their benchmarks.

‍

References

[1] Rolnick, D. (2017). Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694.

[2] Northcutt, C. G., Athalye, A., & Mueller, J. (2021). Pervasive label errors in test sets destabilize machine learning benchmarks. arXiv preprint arXiv:2103.14749.

[3] [9] Srikanth, M., Irvin, J., Hill, B. W., Godoy, F., Sabane, I., & Ng, A. Y. (2023). An Empirical Study of Automated Mislabel Detection in Real World Vision Datasets. arXiv preprint arXiv:2312.02200.

[4] Li, F.-F., Andreeto, M., Ranzato, M., & Perona, P. (2022). Caltech 101 (1.0) [Data set]. CaltechDATA. https://doi.org/10.22002/D1.20086

[5] Marks, J., Griffin, B. A., & Corso, J. J. (2024). Class-wise Autoencoders Measure Classification Difficulty And Detect Label Mistakes. arXiv preprint arXiv:2412.02596.

[6] Northcutt, C., Jiang, L., & Chuang, I. (2021). Confident learning: Estimating uncertainty in dataset labels. Journal of Artificial Intelligence Research, 70, 1373–1411.

[7] Zhu, Z., Dong, Z., & Liu, Y. (2022, June). Detecting corrupted labels without training a model to predict. In International conference on machine learning (pp. 27412–27427). PMLR.

[8] Gröger, F., Lionetti, S., Gottfrois, P., Gonzalez-Jimenez, A., Amruthalingam, L., Groh, M., … & Pouly, M. Intrinsic Self-Supervision for Data Quality Audits. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track.

pip install fastdup

Introduction to Image Captioning

h2

h3

‍

Image Captioning is the process of using a deep learning model to describe the content of an image. Most captioning architectures use an encoder-decoder framework, where a convolutional neural network (CNN) encodes the visual features of an image, and a recurrent neural network (RNN) decodes the features into a descriptive text sequence.

VQA

Visual Question Answering (VQA) is the process of asking a question about the contents of an image, and outputting an answer. VQA uses similar architectures to image captioning, except that a text input is also encoded into the same vector space as the image input.

code

Image captioning and VQA are used in a wide array of applications:

point
point
point

Why Captioning With fastdup?

Image captioning can be a computationally-expensive task, requiring many processor hours to conduct. Recent experiments have shown that the free fastdup tool can be used to reduce dataset size without losing training accuracy. By generating captions and VQAs with fastdup, you can save expensive compute hours by filtering out duplicate data and unnecessary inputs.

quote

Getting Started With Captioning in fastdup

To start generating captions with fastdup, you’ll first need to install and import fastdup in your computing environment.

test text

test text 222

Processor Selection and Batching

The captioning method in fastdup enables you to select either a GPU or CPU for computation, and decide your preferred batch size. By default, CPU computation is selected, and batch sizes are set to 8. For GPUs with high-RAM (40GB), a batch size of 256 will enable captioning in under 0.05 seconds per image.

To select a model, processing device, and batch size, the following syntax is used. If no parameters are entered, the fd.caption() method will default to ViT-GPT2, CPU processing, and a batch size of 8.

“The captioning method in fastdup enables you to select either a GPU or CPU for computation, and decide your preferred batch size. By default, CPU computation is selected, and batch sizes are set to 8. For GPUs with high-RAM (40GB), a batch size of 256 will enable captioning in under 0.05 seconds per image.”
Dean Scontras, AVP, Public Sector, Wiz

FedRAMP is a government-wide program that provides a standardized approach to security in the cloud, helping government agencies accelerate cloud adoption with a common security framework. Achieving a FedRAMP Moderate authorization means Wiz has gone under rigorous internal and external security assessment to show it meets the security standards of the Federal Government and complies with required controls from the National Institute of Standards and Technology (NIST) Special Publication 800-53.

Image captioning and VQA are used in a wide array of applications:

⚡ Quickstart: Learn how to install fastdup, load a dataset, and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you’re new, start here!
🧹 Clean Image Folder: Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
🖼 Analyze Image Classification Dataset: Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
🎁 Analyze Object Detection Dataset: Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try.