If you like what we're working on, please  star us on GitHub. This enables us to continue to give back to the community.

The Improtance of Annotated Datasets for AI and Machine Learning

Do you wonder what data annotation is and what it is good for? Do you consider using annotated datasets for your AI and Machine Learning projects? Would you like to learn more about data annotation, the benefits it can bring to your project, and how to integrate it into your Machine Learning workflow?

Machine Learning increasingly becomes a key component in businesses’ everyday offerings and operations, and the performance of these models depends on the quality of data they work with. This dependence highlights the importance of datasets in Machine Learning and the methods by which we collect them.

Often, the data we work with already comes with good quality labels. For example, when predicting stock prices from past values, price acts both as a target label and an input feature.

However, we do not always have labels in our data or with the required quality. Labels can be noisy, limited, and biased (e.g., user-added tags and categories) or entirely missing (e.g., object detection).

To acquire labels or improve labeling quality, we can conduct data annotation. In data annotation, we label or relabel our data with the help of human annotators supported by annotation tools and algorithms. This makes model training possible, improves data quality, or improves model performance. In other use-cases, we can use data annotation to decide about uncertain decide about uncertain predictions (with probabilities close to ) or validate our model.

Data annotation is a widely used practice. Many AI tools and services of big tech companies like Microsoft’s Bing Search or Facebook rely on human-annotated datasets. Producing tools for data annotation is a growing industry forecasted to surpass $13 billion in market size by 2030.

An overview of Facebook’s Human-AI loop (Halo) annotation process (Source)

If you annotate data, you should do it well. Data annotation can be a complex, slow, and expensive process requiring evaluation and quality assessment. If you need to annotate data regularly, you may want to make it an integral component of your Machine Learning workflow. Fortunately, there are methodologies that you can use to make annotation effective and less error-prone.

In this article, you will learn about data annotation, its types, and how involving humans in the process can benefit your Machine Learning model. We will also share a few guidelines you can use to think about your annotation project.

Data annotation often enlists the help of human annotators but can also use algorithms or a combination of the two. In this article, we will focus on human annotation and highlight options where you can use them both.

Open source package for ml validation

Build Test Suites for ML Models & Data with Deepchecks

Get StartedOur GithubOur Github

Data Annotation for AI and Machine Learning

Data Annotation ( sometimes called “Data Labeling”) refers to the active labeling of Machine Learning model training datasets. This often means adding target labels but can also stand for adding feature values or metadata. In some contexts, people may also refer to the validation of model predictions by humans as data annotation as it requires data (re)labeling by annotators.

An example of data annotation in NLP: Names and phrases labeled based on their meaning (Source

Depending on the context, people also refer to this activity as “tagging,” “categorizing,” or “transcribing.” However, in this context, all these terms mean that the annotation extends the data with information used in the modeling process.

Below are the main use-cases of data annotation:

  • Generate labels. There are cases when annotation is the only way to record target labels or features. For example, training a model classifying cats and dogs requires an image dataset containing explicit “cat” and “dog” labels. We need annotators to label these samples.
  • Generate features. Annotated data can highlight relationships in our model that it would not recognize automatically from noisy real-world data.
  • Improve label quality. Relabel noisy, limited, inaccurate, or biased labels.
  • Validate model performance. Compare model generated and human-annotated labels as part of a Human-in-the-Loop Machine Learning approach and review uncertain predictions.
  • Convert unsupervised to supervised. Transform unsupervised or one-class supervised problems into a supervised one (e.g., anomaly detection).

Human-annotated Data in Machine Learning

Human-annotated data is when humans are the primary source of data annotations.

Humans can recognize and understand things Machine Learning models cannot. It is not always clear what these things are since there is a great diversity of models, humans, and business problems. Here are a few things humans might recognize better than models within a specific context:

  • Subjectivity and intent;
  • Uncertainty, ambiguous concepts, and irregular categories;
  • Contexts relevant to the business problem and whether a data point is “meaningful” within that context;
  • Validating model predictions by humans can increase trust in our data and modeling process because models are often opaque and humans can recognize “unrealistic” predictions and link results into their context.

Compliance with regulation may also require the involvement of a human validator in the Machine Learning workflow.

How and at which step you rely on human or automatic annotation is a problem-specific question.

In semi-automated annotation approaches, you combine Machine Learning techniques and manual labeling approaches. For example, you can use models to reduce data annotation time. Or, you can interactively propagate samples for annotation based on classification confidence.

Types of Data Annotation

We can distinguish between different types of data annotation problems and methods based on the type of data annotated or whether data annotation is internal or external to the organization.

Data Annotation Categories According to Data Type

Data type-based categorization is relatively straightforward since they follow common data types used in Machine Learning:

  • Text
  • Image
  • Video
  • Audio

These data types represent data formats humans perceive relatively directly. It is less common to employ humans to annotate tabular, network, or time-series data because human annotators usually have fewer advantages in these areas

The different data formats require different annotation methods. For example, to produce quality computer vision datasets, you can choose between different types of image annotation techniques.

Annotation projects tend to work along the following stages regardless of the data format they use:

  1. Recognize entities within the data and distinguish them from each other.
  2. Identify the elements’ metadata properties. In some cases, this is optional, as the main task is to identify the entities themselves, like in object recognition on images.
  3. Store the element’s metadata properties in a specific form.

The following table summarizes the annotation data types with entities and metadata they work with.

Annotation Data TypeEntitiesMetadataExamples
Textkeywords, phrases, sentences, paragraphs, sections, named entities, queries, documentstopic, semantic role, sentiment, intent, grammatical role, relationship to other entitiesproduct categorization, content and audience tagging, content moderation
Imageobject boundariesobject category, semantic segmentation, relationship with surroundingsfacial recognition, medical diagnostics
Videoframes, video segments, moving objectsobject category, object behavior, eventautonomous cars, robotic vision
Audioparts of speech, audio segments, audio sources, background noisepronunciation, intonation, sonic entity, dialect, demographics, sentimentspeech recognition, call processing systems

Internal and External Data Annotation

Another categorization relies on whether annotation happens inside or outside of an organization.

In the internal case, data annotation can be part of an internal training data creation or model validation. Most of the resources we discuss here describe such situations.

In the external case, an organization uses external resources to label its data. There are different sources to do this, like annotation contests, or professional annotation services.

There are cases where specific components of the data annotation workflow are ‘internal’ and others are ‘external’. For example, we can hire external annotators, who will work within our internal annotation workflow.

Guidelines for Data Annotation in Machine Learning

To do data annotation well, you need to consider it as part of your Machine Learning workflow and build it out as a combination of annotators, algorithms, and software components.

Two big questions of your data annotation project are how to use your limited annotation resources effectively and how to assess the quality of your annotations.

There are different techniques to address these issues. In this section, we will discuss the two:

  • Active learning. Ways of sampling data for annotation
  • Quality assessment. Validating annotation performance

Active Learning: Sampling Data for Annotation

Active Learning is the method of selecting data samples in the context of data annotation.

When you combine human annotation with Machine Learning models, it is critical you decide which part of your data needs to be annotated by humans. You time and finances are limited, so you need to be selective.

Different types of active learning can help you select only the relevant samples for annotation and save time and costs. Here are three popular ones:

  • Random Sampling
  • Uncertainty Sampling
  • Diversity Sampling

Random Sampling

Random sampling is the simplest type of active learning. It can act as a good baseline against which you can compare the other strategies.

However, having a really random sample is not always easy because of the distribution of the received data, and as random sampling can overlook issues other methods actively look for.

Uncertainty Sampling

In uncertainty sampling, you select unlabeled samples nearest to the model’s decision boundary.

The value of this method is that these samples have the highest chance of being wrongly classified, so manually annotating them can correct their possible errors.

A possible issue with uncertainty sampling is that labels selected by it might belong to the same problem space and concentrate only on one specific side of the decision boundary.

Uncertainty sampling is more useful with models that effectively estimate their prediction uncertainty. With other types (e.g., Deep Neural Networks) we can use other confidence estimation methods to improve uncertainty estimation, such as ones offered by monitoring and model validation tools.

Uncertainty sampling in active learning (Source)

Diversity Sampling

Diversity sampling entails annotating samples with feature values underrepresented or even unknown in your model training data. Other names for this tool can be ”Anomaly or Outlier Detection,” “Representative Sampling,” or “Stratified Sampling.”

The main benefit of this tool is teaching your model to consider information it might otherwise ignore because of its low occurrence in the training dataset.

We can use Diversity Sampling to prevent performance loss due to data drift. Data drift occurs because our model receives data with a high proportion of previously poorly predicted sample regions. We can identify and annotate such underrepresented sample regions with diversity sampling and improve predictive power on them. Doing so will limit the data drift’s effect.

Diversity sampling in active learning (Source)

How to Assess the Quality of Your Data Annotation Project

Your annotators can make mistakes and you need to introduce checks and validation points to catch them systematically.

Here are a few aspects that can help you improve your annotation performance:

  • Expertise. Experienced annotators and subject experts can provide high-quality information and do final reviews.
  • Teams. Sometimes, more than one human is needed to increase annotation accuracy and reach a “consensus” about relevancy.
  • Diversification. Insights from team members with different backgrounds, skills, and levels of expertise that can complement each other well and prevent systematic bias.

Bloomberg’s Global Data department collected best practices for managing data annotation projects where they distinguish four main ways to assess data annotation quality. This table summarizes these quality assessment methods with their benefits and constraints:

“Gold” taskPrepare work items to compare directly with annotation  “answer keys”Quick feedback with measurable resultsApplicable only for “objective” answer types, requires preparation work
Annotation Redundancy with Targeted QARun multiple annotations and conduct QA on disagreeing resultsNo need for preparation, highlights anomaliesLonger feedback loops, increased annotation time
Annotation Redundancy with DebriefRun multiple annotations and discuss how annotators applied guidelinesNo need for preparation, can assess subjective data with a wide range of possible answersLonger feedback loops, increased annotation time, debrief takes a long time
Random QASample randomly for quality assessmentAllows reviewing large amounts of annotations, does not require preparation or follow-up discussionsDoes not prioritize likely errors

Integrate Data Annotation Into Your Model Validation Solution

In this article, you have learned what data annotation is and how it can benefit your Machine Learning model.

Annotation can provide you with a labeled dataset, improve your data quality, or validate your model. It can also help your Machine Learning model fight bias and learn about relationships it would not be able to, based on available data.

Even if you have well-trained models, you have to look out for data drift and concept drift affecting their performance. You can use data annotation to do that by rechecking your models as part of your continuous model validation process.

Integrating your annotation processes with a wider MLOps solution like Deepchecks is especially useful because annotation itself benefits from following best practices and quality assessment (e.g., drift detection and improved uncertainty estimation).

Check out our blog to learn how to automate your Machine Learning model validation, testing, and monitoring, or click here to see how Deepchecks can help you with that.

Subscribe to Our Newsletter

Do you want to stay informed? Keep up-to-date with industry news, the latest trends in MLOps, and observability of ML systems.

Related articles

How to Automate Data Drift Thresholding in Machine Learning
How to Automate Data Drift Thresholding in Machine Learning
Best Practices for Computer Vision Model Deployment
Best Practices for Computer Vision Model Deployment
Benefits of MLOps Tools for ML Data
Benefits of MLOps Tools for ML Data