Do you wonder what data annotation is and what it is good for? Do you consider using annotated datasets for your AI and Machine Learning projects? Would you like to learn more about data annotation, the benefits it can bring to your project, and how to integrate it into your machine learning workflow?
Machine learning increasingly becomes a component in businesses’ everyday offerings and operations, and the performance of these models depends on the quality of data they work with. This dependence highlights the importance of datasets in machine learning and the methods by which we collect them.
Often, the data we work with already comes with good quality labels. For example, when predicting stock prices from past values, price acts both as a target label and an input feature.
However, we do not always have labels in our data or with the required quality. Labels can be noisy, limited, and biased (e.g., user-added tags and categories) or entirely missing (e.g., object detection).
To acquire labels or improve labeling quality, we can conduct data annotation. In data annotation, we label or relabel our data with the help of human annotators supported by annotation tools and algorithms. This makes model training possible, improves data quality, or improves model performance. In other use cases, we can use data annotation to decide about uncertain decide about uncertain predictions (with probabilities close to ) or validate our model.
Data annotation is a widely used practice. Many AI tools and services of big tech companies like Microsoft’s Bing Search or Facebook rely on human-annotated datasets. Producing tools for data annotation is a growing industry forecasted to surpass $13 billion in market size by 2030.
An overview of Facebook’s Human-AI loop (Halo) annotation process (Source)
If you annotate data, you should do it well. Data annotation can be a complex, slow, and expensive process requiring evaluation and quality assessment. If you need to annotate data regularly, you may want to make it an integral component of your machine learning workflow. Fortunately, there are methodologies that you can use to make annotation effective and less error-prone.
In this article, you will learn about data annotation, its types, and how involving humans in the process can benefit your machine learning model. We will also share a few guidelines you can use to think about your annotation project.
Data annotation often happens with the help of human annotators but can also use algorithms or a combination of the two. In this article, we mostly focus on human annotation and highlight options where you can use them both.
What is Data Annotation for AI and Machine Learning?
Data annotation (or sometimes called “data labeling”) stands for the active labeling of Machine Learning model training datasets. This means adding target labels most often but can also stand for adding feature values or metadata. In some contexts, people may also refer to the validation of model predictions by humans as data annotation as it requires data (re)labeling by annotators.
An example of data annotation in NLP: Names and phrases labeled based on their meaning (Source)
Depending on the context, people may also refer to this activity as ’tagging’, ’categorizing’, or ‘transcribing’. However, in this context, all these terms mean that the annotation extends the data with information used in the modeling process.
Following are the main uses cases of data annotation:
- Generate labels: There are cases when annotation is the only way to record target labels or features. For example, training a model classifying cats and dogs requires an image dataset containing explicit ‘cat’ and ‘dog’ labels. We need annotators to label these samples.
- Generate features: Annotated data can highlight relationships in our model that it would not recognize automatically from noisy real-world data.
- Improve label quality: Relabel noisy, limited, inaccurate, or biased labels.
- Validate model performance: Compare model generated and human-annotated labels as part of a Human-in-the-Loop machine learning approach and review uncertain predictions.
- Convert unsupervised into supervised: Transform unsupervised or one-class supervised problems into a supervised one (e.g., in anomaly detection).
What is Human-Annotated Data in Machine Learning?
Human-annotated data is when humans are the primary source of data annotations.
Humans can recognize and understand things machine learning models cannot, as for now. It is not always clear what these things are, as there is a great diversity of models, humans, and business problems. Here are a few things humans might recognize better than models within a specific context:
- Subjectivity and intent
- Uncertainty, ambiguous concepts, and irregular categories
- The context relevant to the business problem and whether a data point is ‘meaningful’ within that context
- Validating model predictions by humans can increase trust in our data and modeling process because models are often opaque and humans can recognize ‘unrealistic’ predictions and link results into their context.
Compliance with regulation may also require the involvement of a human validator in the machine learning workflow.
How and at which step you rely on human or automatic annotation is a problem-specific question.
In semi-automated annotation approaches, you combine machine learning techniques and manual labeling approaches. For example, you can use models to reduce data annotation time. Or, you can interactively propagate samples for annotation based on classification confidence.
Types of Data Annotation
We can distinguish between different types of data annotation problems and methods based on the type of data annotated or whether data annotation is internal or external to the organization.
Data Annotation Categories Based on Data Type
Data type-based categorization is relatively straightforward as they follow common data types used in machine learning:
These data types represent data formats humans perceive relatively directly. It is less common to employ humans to annotate tabular, network, or time-series data as human annotators usually have fewer advantages in these areas.
The different data formats require different annotation methods. For example, to produce quality computer vision datasets, you can choose between different types of image annotation techniques.
Annotation projects tend to work along the following stages regardless of the data format they use:
- Recognize entities within the data and distinguish them from each other.
- Identify the elements’ metadata properties. In some cases, this is optional, as the main task is to identify the entities themselves, like in object recognition on images.
- Store the element’s metadata properties in a specific form.
The following table summarizes the annotation data types with entities and metadata they work with.
|Annotation Data Type||Entities||Metadata||Examples|
|Text||keywords, phrases, sentences, paragraphs, sections, named entities, queries, documents||topic, semantic role, sentiment, intent, grammatical role, relationship to other entities||product categorization, content and audience tagging, content moderation|
|Image||object boundaries||object category, semantic segmentation, relationship with surroundings||facial recognition, medical diagnostics|
|Video||frames, video segments, moving objects||object category, object behavior, event||autonomous cars, robotic vision|
|Audio||parts of speech, audio segments, audio sources, background noise||pronunciation, intonation, sonic entity, dialect, demographics, sentiment||speech recognition, call processing systems|
Internal and External Data Annotation
Another categorization relies on whether annotation happens inside or outside of an organization.
In the internal case, data annotation can be part of an internal training data creation or model validation. Most of the resources we discuss here describe such situations.
There are cases where specific components of the data annotation workflow are ‘internal’ and others are ‘external’. For example, we can hire external annotators, who will work within our internal annotation workflow.
Guidelines for Data Annotation in Machine Learning
To do data annotation well, you need to consider it as part of your machine learning workflow and build it out as a combination of annotators, algorithms, and software components.
Two big questions of your data annotation project are how to use your limited annotation resources effectively and how to assess the quality of your annotations.
There are different techniques to address these issues. In this section, we will discuss the two:
- Active learning: Ways of sampling data for annotation
- Quality assessment: Validating annotation performance
Active Learning: Sampling Data for Annotation
Active learning is the method of selecting data samples in the context of data annotation.
When you combine human annotation with machine learning models, a critical issue you need to decide about is what part of your data to annotate by human annotators. You have limited time and finances to spend on data annotation, so you need to be selective.
Different types of active learning can help you select only the relevant samples for annotation and save time and costs. Here are three popular ones:
- Random sampling
- Uncertainty sampling
- Diversity sampling
Random sampling is the simplest type of active learning. It can act as a good baseline against which you can compare the other strategies.
However, having a really random sample is not always easy because of the distribution of the received data, and as random sampling can overlook issues other methods actively look for.
In uncertainty sampling, you select unlabeled samples nearest to the model’s decision boundary.
The value of this method is that these samples have the highest chance of being wrongly classified, so manually annotating them can correct their possible errors.
A possible issue with uncertainty sampling is that labels selected by it might belong to the same problem space and concentrate only on one specific side of the decision boundary.
Uncertainty sampling is more useful with models that effectively estimate their prediction uncertainty. With other types (e.g., Deep Neural Networks) we can use other confidence estimation methods to improve uncertainty estimation, such as ones offered by monitoring and model validation tools.
Uncertainty sampling in active learning (Source)
Diversity sampling entails annotating samples with feature values underrepresented or even unknown in your model training data. Other names for this tool can be anomaly or outlier detection, representative sampling, or stratified sampling.
The main benefit of this tool is teaching your model to consider information it might otherwise ignore because of its low occurrence in the training dataset.
We can use diversity sampling to prevent performance loss due to data drift. Data drift occurs because our model receives data with a high proportion of previously poorly predicted sample regions. We can identify and annotate such underrepresented sample regions with diversity sampling and improve predictive power on them. Doing so will limit the data drift’s effect.
Diversity sampling in active learning (Source)
How to Assess the Quality of Your Data Annotation Project
Your annotators can make mistakes and you need to introduce checks and validation points to catch them systematically.
Here are a few aspects that can help you improve your annotation performance:
- Expertise: Experienced annotators and subject experts can provide high-quality information and do final reviews.
- Teams: Sometimes, more than one human is needed to increase annotation accuracy and reach a ‘consensus’ about relevancy.
- Diversification: Insights from team members with different backgrounds, skills, and levels of expertise that can complement each other well and prevent systematic bias.
Bloomberg’s Global Data department collected best practices for managing data annotation projects where they distinguish four main ways to assess data annotation quality. The following table summarizes these quality assessment methods with their benefits and constraints.
|“Gold” task||prepare work items to compare directly with annotation “answer keys”||quick feedback with measurable results||applicable only for “objective” answer types, requires preparation work|
|Annotation Redundancy with Targeted QA||run multiple annotations and conduct QA on disagreeing results||no need for preparation, highlights anomalies||longer feedback loops, increased annotation time|
|Annotation Redundancy with Debrief||run multiple annotations and discuss how annotators applied guidelines||no need for preparation, can assess subjective data with a wide range of possible answers||longer feedback loops, increased annotation time, debrief takes a long time|
|Random QA||sample randomly for quality assessment||allows reviewing large amounts of annotations, does not require preparation or follow-up discussions||does not prioritize likely errors|
Integrate Data Annotation Into Your Model Validation Solution
In this article, you have learned what data annotation is and how it can benefit your machine learning model.
Annotation can provide you with a labeled dataset, improve your data quality, or validate your model. It can also help your machine learning model fight bias and learn about relationships it would not be able to, based on available data.
Even if you have well-trained models, you have to look out for data drift and concept drift affecting their performance. You can use data annotation to do that by rechecking your models as part of your continuous model validation process.
Integrating your annotation processes with a wider MLOps solution like Deepchecks is especially useful because annotation itself benefits from following best practices and quality assessment (e.g. drift detection and improved uncertainty estimation).