Visual Layer’s Enriched Datasets Are Now Available on Hugging Face

At Visual Layer, we specialize in empowering AI and data teams to seamlessly organize, explore, enrich, and extract valuable insights from…

At Visual Layer, we specialize in empowering AI and data teams to seamlessly organize, explore, enrich, and extract valuable insights from massive collections of unstructured image and video data. Our platform includes advanced tools for streamlining data curation pipelines and enhancing model performance with features such as smart clustering, quality analysis, semantic search, and visual search. One key aspect of our technology is the ability to enrich visual datasets with object tags, captions, bounding boxes, and other quality insights. Today, we’re providing FREE enriched versions of a selection of well-known academic datasets, adding to them valuable information to help ML researchers and practitioners improve their models.

Enriched ImageNet

Our Imagenet-1K-VL-Enriched dataset adds an additional layer of information to the original ImageNet-1K dataset. With enriched captions, bounding boxes, and label issues, this dataset opens up new possibilities for a variety of machine learning tasks, from image retrieval to visual question answering.

Dataset Overview

This enriched version of ImageNet-1K consists of six columns:

image_id: The original image filename.
image: Image data in PIL Image format.
label: Original label from the ImageNet-1K dataset.
label_bbox_enriched: Bounding box information, including coordinates, confidence scores, and labels generated using object detection models.
caption_enriched: Captions generated using the BLIP2 captioning model.
issues: Identified quality issues (e.g., duplicates, mislabels, outliers).

How to Use the Dataset

The enriched ImageNet dataset is available via the Hugging Face Datasets library:

import datasetsds = datasets.load_dataset("visual-layer/imagenet-1k-vl-enriched")

You can also interactively explore this dataset with our visualization platform. Check it out here. It’s free, and no sign-up is required.

‍

Enriching the COCO 2014 Dataset

The COCO-2014-VL-Enriched dataset is our enhanced version of the popular COCO 2014 dataset, now featuring quality insights. This version introduces a new level of dataset curation by identifying and highlighting issues like duplicates, mislabeling, outliers, and suboptimal images (e.g., dark, blurry, or overly bright).

Dataset Overview

The enriched COCO 2014 dataset now includes six columns:

image_id: The original image filename from the COCO dataset.
image: Image data in the form of PIL Image.
label_bbox: Original bounding box annotations, along with enriched information such as confidence scores and labels generated using object detection models.
issues: Identified quality issues, such as duplicate, mislabeled, dark, blurry, bright, and outlier images.

How to Use the Dataset

You can easily access this dataset using the Hugging Face Datasets library:

import datasetsds = datasets.load_dataset("visual-layer/coco-2014-vl-enriched")

This dataset is also available in our free no-sign-up required visualization platform. Check it out here.

Additional Enriched Datasets

In addition to COCO 2014 and ImageNet-1K, we’ve enriched several other widely used datasets and made them available on Hugging Face:

Fashion Mnist: https://huggingface.co/datasets/visual-layer/fashion-mnist-vl-enriched
Food 101: https://huggingface.co/datasets/visual-layer/food101-vl-enriched
Cifar10: https://huggingface.co/datasets/visual-layer/cifar10-vl-enriched
Oxford Flowers: https://huggingface.co/datasets/visual-layer/oxford-flowers-vl-enriched
Mnist: https://huggingface.co/datasets/visual-layer/mnist-vl-enriched
Human Action Recognition: https://huggingface.co/datasets/visual-layer/human-action-recognition-vl-enriched

Explore the Datasets

Visual Layer offers a no sign-up required platform for interactively visualizing these datasets. You can explore each dataset, identify quality issues, and get hands-on experience with our enrichment capabilities. Take a look here: https://app.visual-layer.com/datasets

pip install fastdup

Introduction to Image Captioning

h2

h3

‍

Image Captioning is the process of using a deep learning model to describe the content of an image. Most captioning architectures use an encoder-decoder framework, where a convolutional neural network (CNN) encodes the visual features of an image, and a recurrent neural network (RNN) decodes the features into a descriptive text sequence.

VQA

Visual Question Answering (VQA) is the process of asking a question about the contents of an image, and outputting an answer. VQA uses similar architectures to image captioning, except that a text input is also encoded into the same vector space as the image input.

code

Image captioning and VQA are used in a wide array of applications:

point
point
point

Why Captioning With fastdup?

Image captioning can be a computationally-expensive task, requiring many processor hours to conduct. Recent experiments have shown that the free fastdup tool can be used to reduce dataset size without losing training accuracy. By generating captions and VQAs with fastdup, you can save expensive compute hours by filtering out duplicate data and unnecessary inputs.

quote

Getting Started With Captioning in fastdup

To start generating captions with fastdup, you’ll first need to install and import fastdup in your computing environment.

test text

test text 222

Processor Selection and Batching

The captioning method in fastdup enables you to select either a GPU or CPU for computation, and decide your preferred batch size. By default, CPU computation is selected, and batch sizes are set to 8. For GPUs with high-RAM (40GB), a batch size of 256 will enable captioning in under 0.05 seconds per image.

To select a model, processing device, and batch size, the following syntax is used. If no parameters are entered, the fd.caption() method will default to ViT-GPT2, CPU processing, and a batch size of 8.

“The captioning method in fastdup enables you to select either a GPU or CPU for computation, and decide your preferred batch size. By default, CPU computation is selected, and batch sizes are set to 8. For GPUs with high-RAM (40GB), a batch size of 256 will enable captioning in under 0.05 seconds per image.”
Dean Scontras, AVP, Public Sector, Wiz

FedRAMP is a government-wide program that provides a standardized approach to security in the cloud, helping government agencies accelerate cloud adoption with a common security framework. Achieving a FedRAMP Moderate authorization means Wiz has gone under rigorous internal and external security assessment to show it meets the security standards of the Federal Government and complies with required controls from the National Institute of Standards and Technology (NIST) Special Publication 800-53.

Image captioning and VQA are used in a wide array of applications:

⚡ Quickstart: Learn how to install fastdup, load a dataset, and analyze it for potential issues such as duplicates/near-duplicates, broken images, outliers, dark/bright/blurry images, and view visually similar image clusters. If you’re new, start here!
🧹 Clean Image Folder: Learn how to analyze and clean a folder of images from potential issues and export a list of problematic files for further action. If you have an unorganized folder of images, this is a good place to start.
🖼 Analyze Image Classification Dataset: Learn how to load a labeled image classification dataset and analyze for potential issues. If you have labeled ImageNet-style folder structure, have a go!
🎁 Analyze Object Detection Dataset: Learn how to load bounding box annotations for object detection and analyze for potential issues. If you have a COCO-style labeled object detection dataset, give this example a try.

Visual Layer's Mislabel Detection Sets New SOTA in Industry & Academia

Today we're sharing head-to-head experiments comparing VL's mislabel detection features to baseline algorithms from highly-cited academic papers and companies offering mislabel detection capabilities.

Guy Singer

January 30, 2025