For more details about active learning, we recommend users to read Settles, B. 2009. Active Learning Literature Survey .
In many real-world problems, unlabeled data are often abundant whereas labeled data are scarce. Label acquisition is usually expensive due to the involvement of human experts, and thus, it is important to train an accurate prediction model by a small number of labeled instances. Active learning aims at reducing human efforts on annotating examples in a machine learning system by querying only the most valuable instances for their class assignments, and has been successfully applied into various real tasks.
In a regular active learning process, example querying and model updating are processed iteratively. Here is an example to active learning:
At the beginning, there may be a small number of instances in the labeled training set L. In each iteration, The querying algorithm will select a batch of unlabeled data which are considered valuable; Then, the oracle (annotator) will provide some supervised information for them according to his knowledge. The new labeled instances are simply added to the labeled set L. Then update model after each query. The above procedures will be repeated until the certain stopping criterion is met (e.g. A limited number of queries or limited cost budget).
One of the research directions in active learning is the query strategy. The learner will select some unlabeled data according to the specific query strategy and query their labels from the oracle. In the active learning literature, miscellaneous strategies evaluate how useful an instance is for improving the model from different aspects.
So how do we evaluate the performances of different algorithms?
Active learning algorithms are generally evaluated by constructing learning curves, which plot the evaluation measure of interest (e.g., accuracy) as a function of the number of new instance queries that are labeled and added to L. Different query strategy will select different data for querying and thus produce different learning curves.
Normally, the data partition will be repeated for several times to evaluate different strategies in experiment. the average learning curves are compared to ensure the reliability of the results. Here is an example of learning curves of different query strategies, the data partition is repeated for 10 times:
If the learning curve of a strategy dominates the other for most or all of the points along their learning curves, we can conclude that the active learning query strategy is superior to the other approach (e.g., the QBC vs Random).
In this section, we will give some tips to the newcomer when programming an active learning experiment.
Intermediate results means the useful information produced after each query. Such as the selected instances, the model performance, the model parameters, etc. You can use these results to:
- Evaluate the active learning algorithm
- Give a recurrence of any past queries without re-running the whole program for analyzing
- Re-calculate some other types of measurements without re-running the whole program
Saving the intermediate results is rather important especially your algorithm
is computational expensive. e.g., It takes several days to complete the whole
active learning process. Because there are always some unexpected situations:
- Interruption of power supply
- Minor bugs cause the program to quit halfway
- Misoperations cause the program to quit halfway
If you do not save the intermediate results, you may have to re-run your whole
program when the above accidents happen even some interrupts will not affect
the results of past queries.
Luckily, alipy provide the StateIO tool class to let users achieve this goal easily, you can read more about this in the next tutorial.
In active learning, data querying is repeated until the stopping criterion is reached. Which means, instances are transferred from unlabeled set to the labeled set continuously.
To implement the above process, the most direct way is to split the original dataset into different parts. However, some misoperations or unexpected bugs may not be discovered easily in this way.
Further, it is memory expensive since the data matrix are duplicated between different functions, modules or threads.
Instead of manipulating the data matrix directly, using the index of instances is a safer way. For example, the 3rd instance is selected in this iteration, you can record its index 2 (start from 0) instead of the feature vector. When feature matrix is needed, you can get the labeled set by indexing the original data matrix by the index vector. There are some benefits in doing this:
- Memory efficient
- Easy to check the validity
- Avoid unnecessary bugs
ALiPy provide a tool class to manage the index set with some necessary validity cheking.
See how
alipy.IndexCollection
to manage your indexes
of labeled and unlabeled set.
The active learning program between different folds of experiment is usually independent, and can be paralleled in multi-threads for computational efficiency.
ALiPy provide the aceThreading tools to help you to achieve this goal easily. Learn more about alipy in the next tutorial.
Copyright © 2018, alipy developers (BSD 3 License).