Introduction
Machine learning has revolutionized the way we approach problem-solving, and its potential for solving real-world challenges is undeniable. However, one of the biggest challenges in machine learning is obtaining labeled data to train models. Labeled data is critical for training accurate models, but obtaining it can be expensive and time-consuming. This is where active learning comes in to improve the performance of machine learning models by selecting the most informative examples for labeling, enabling the models to learn from them.
Active learning has gained significant interest in recent years because it offers a cost-effective and efficient way to train models using a minimal amount of labeled data. Different categories of active learning include stream-based selective sampling, pool-based sampling, and membership query synthesis, all of which offer unique benefits and are applicable in different situations.
This blog post aims to understand active learning, its utilization, types, and applications. We hope to show how active learning can help overcome the challenge of obtaining labeled data and how it can lead to more accurate and efficient machine learning models.

Figure: Illustration of a workflow of active learning (Source)
Active Learning Utilization
The most common and straightforward method involves several steps to utilize active learning on an unlabeled dataset. First, a small subsample of the data is manually labeled using random sampling or a scoring method (such as uncertainty sampling or diversity sampling). This labeled data is then used to train the model, identifying the areas of the parameter space that require further labeling for improved accuracy. Next, the model predicts the class of each remaining unlabeled data point. To prioritize the labeling process, scores are assigned to the unlabeled data points based on the model’s prediction using commonly used scoring methods such as uncertainty sampling or diversity sampling.
The best approach is then selected to rank or prioritize data based on their relevance or importance to the problem at hand. This iterative process is repeated, and a novel model is trained on the newly labeled dataset. After the model processes the unlabeled data points, the prioritization scores for further labeling are updated, leading to an enhanced labeling approach as the models advance. Overall, this iterative process of labeling, training, and prediction continues until the model reaches a desirable level of accuracy.
Active Learning Types
The three main types of active learning are stream-based selective sampling, pool-based sampling, and membership query synthesis.
Stream-based selective sampling
Stream-based selective sampling is a type of active learning where the model selects data points for labeling as they arrive in a stream. It is particularly useful for datasets that are too large to be stored in memory. This approach is used in scenarios where data arrives continuously, and the model needs to make predictions in real time. In stream-based selective sampling, the model selects the most informative data points based on the uncertainty of its prediction.
This type of sampling is suited for applications that require immediate model deployment, such as online recommendation systems or fraud detection. This category of active learning is also useful for scenarios where the data distribution changes over time, as it allows the model to adapt to these changes in real time.
For example, in sentiment analysis of social media data, stream-based selective sampling can be used to analyze tweets in real time. The model can select the most informative tweets for labeling based on their uncertainty in predicting the tweet’s sentiment.
Stream-based selective sampling has its own set of challenges, including managing class imbalance and coping with label noise.
Pool-based sampling
Pool-based sampling is a type of active learning where the model selects data points from a pool of unlabeled data. This approach is useful in scenarios where a large amount of unlabeled data is available, and the model needs to select the most informative data points for labeling. Here, the model selects the most informative data points based on criteria such as uncertainty sampling, query-by-committee, and Bayesian active learning. For example, uncertainty sampling selects the data points with the highest prediction uncertainty. On the other hand, the query-by-committee selects the data points that have the most disagreement among a committee of models. Finally, Bayesian active learning selects the data points that maximize the reduction in model uncertainty.
This sort of active learning can also be used to address the class imbalance in the dataset by selecting examples from underrepresented classes. Additionally, it allows for the inclusion of domain-specific knowledge in the model training process by enabling human experts to select examples with specific features of interest. For example, in image classification, pool-based sampling can be used to select the most informative images for labeling from a large pool of unlabeled images.
The challenge related to this approach is that the size of the pool can be a limiting factor, and it is difficult to determine the optimal size of the pool. If the pool size is too small, the model may not be able to learn enough information from the selected samples. The active learning process may become computationally expensive if the pool size is too large.
Membership query synthesis
Membership query synthesis is a type of active learning where the model generates queries for the user to label. This approach is helpful in scenarios where the model is unsure of the best labeling for a data point and requires human intervention. The model generates queries for the user to label based on different criteria, such as uncertainty sampling and query-by-committee. The user labels the data points, and the model updates its training data based on the labeled data.
This method can be particularly useful when the dataset is small or when the cost of labeling is high. It allows for the creation of examples that may not exist in the dataset but are informative for the model’s training. Additionally, it can be used to address the class imbalance in the dataset by generating examples from underrepresented classes. For example, in speech recognition, membership query synthesis can be used to generate queries for the user to label based on the model’s uncertainty in transcribing speech.
An issue with this method is that the quality of the queries generated by the model can be highly dependent on the model architecture and training data.
Conclusion
Active learning is a useful machine learning technique that has emerged as a promising approach to tackle the challenge of limited labeled data in machine learning. It allows for more efficient use of resources by selecting the most informative data points for labeling, resulting in cost savings in data annotation. While there are challenges to active learning, ongoing research is addressing these challenges and making progress toward advancements in the field. Furthermore, active learning helps to enhance model performance by allowing models to learn from the most relevant examples. With the increasing need for machine learning, active learning will continue to be a valuable tool for institutions looking to improve their models.