A Game to Understand Active Learning in Machine Learning

This interactive game teaches you about Active Learning, a machine learning technique where the algorithm actively selects the most informative data points for labeling.

Game Setup and Goal

You'll see a 2D plot with points and two important lines:

Purple line: the current model fitted on the labeled points

Gray dashed line: The ground truth decision boundary

Your goal: Make the purple line match the gray line by labeling as few points as possible.

How to Play

  1. Look at the gray dashed line. This shows where the correct split between red and blue points should be.
  2. The game will highlight a point with a green star.
  3. Label this point red or blue based on which side of the gray line it's on.
  4. Watch the purple line move. It should get closer to the gray line.
  5. Repeat until the purple line closely matches the gray line.

Strategy and Learning Objectives

Understand Uncertainty Sampling

Notice how the game often selects points close to the current decision boundary. These points have the highest uncertainty and provide the most information to the model.

Observe Rapid Improvement

In the early stages, you'll see significant shifts in the decision boundary with each new label. This demonstrates how Active Learning can quickly improve model performance with minimal labeled data.

Experience Diminishing Returns

As you label more points, you'll notice smaller changes in the boundary. This illustrates the efficiency of Active Learning - it front-loads the most impactful data points.

What is Active Learning in Machine Learning?

Active Learning is a subfield of machine learning where the algorithm can interactively query a user (or some other information source) to label new data points.

Unlike traditional supervised learning, where the algorithm learns from a static, pre-labeled dataset, active learning algorithms are designed to be more efficient with their training data.

They aim to achieve high accuracy using as few labeled training instances as possible, thereby minimizing the cost of obtaining labeled data.

The key idea behind active learning is to reduce the labeling effort by selecting the most informative instances for labeling, rather than relying on random sampling or pre-defined sampling strategies.

Strategies to Select the Next Point

There are several strategies used in active learning to select the next point for labeling:

  1. Uncertainty Sampling: This is the most common strategy. The algorithm chooses to label the instance about which it is most uncertain. For example, in a binary classification problem, this might be the instance closest to the decision boundary.
  2. Query by Committee: This approach uses multiple models and chooses the instance where these models disagree the most.
  3. Expected Model Change: This strategy selects the instance that would cause the greatest change to the current model if we knew its label.
  4. Expected Error Reduction: This approach chooses the instance that would most reduce the model's generalization error.
  5. Diversity Sampling: This strategy aims to select a diverse set of instances, avoiding redundancy in the labeled set.

Limitations

While active learning can be very effective, it also has some limitations:

In practice, I like to do some mixing of 80% active learning and 20% random sampling of unlabeled points to ensure that the model is exposed to a diverse set of data points while still benefiting from the efficiency of active learning.

Despite these limitations, active learning remains a powerful tool in many machine learning applications, particularly when labeling costs are high and unlabeled data is plentiful.

Created by Mario Filho with the help of Claude 3.5 Sonnet.