In this tutorial, we will present a simple example to customize your active learning experiment with the tools in alipy.
Considering that some users have less experience in experiment implementation. Thus we will introduce the unified framework of active learning experiment first, then the corresponding tools in alipy will be introduced next. The full introductions to each class can be found at advanced guideline .
As illustrated in the following figure, in normal conditions, the feature matrix X
with the shape
and the corresponding label
matrix with shape
is needed for subsequent operations.
However, if it is not easy to get the specific feature matrix (e.g. A image dataset), it is ok to implement your experiment in alipy. Because alipy only operates on the indexes of the instances.
Secondly, you should split your data into training/testing set for experiment. The data partition should be repeated randomly for several times. In active learning, you should further split your training set into initially labeled set and unlabeled pool for querying. Note that, the initially labeled set is usually small in most active learning settings.
Then, you can start the querying process for each fold of experiment and record their results. In each querying iteration, a subset of unlabeled data will be queried and added to the labeled set; after that, the model will be re-trained based on the updated labeled set and tested to evaluate the query.
After all folds are finished, the learning curve of this query strategy can be obtained by averaging the performance curve of each fold.
The tool classes provided by alipy cover as many components in the above figure as possible. Note that, each independent module can be replaced by your own implementation (without inheriting). Because the modules in alipy will not influence each other and thus can be substituted freely.
A part of commonly used tools are:
* Using alipy.data_manipulate to preprocess and split your data sets for experiments.
* Using alipy.query_strategy to invoke traditional and state-of-the-art methods.
* Using alipy.index.IndexCollection to manage your labeled indexes and unlabeled indexes.
* Using alipy.metric to calculate your model performances.
* Using alipy.experiment.stopping_criteria to get some example stopping criteria.
* Using alipy.experiment.experiment_analyser to gathering, process and visualize your experiment results.
The rest of the tutorial is organized as follows. we first present a complete example of implementing the experiment with alipy below for experienced users. Then, we will explain the code alone with the introduction to the commmonly used methods in the above tools. Introduction to additional tools and supported variant settings can be found at advanced guidelines .
import copy from sklearn.datasets import load_iris from alipy import ToolBox X, y = load_iris(return_X_y=True) alibox = ToolBox(X=X, y=y, query_type='AllLabels', saving_path='.') # Split data alibox.split_AL(test_ratio=0.3, initial_label_rate=0.1, split_count=10) # Use the default Logistic Regression classifier model = alibox.get_default_model() # The cost budget is 50 times querying stopping_criterion = alibox.get_stopping_criterion('num_of_queries', 50) # Use pre-defined strategy uncertainStrategy = alibox.get_query_strategy(strategy_name='QueryInstanceUncertainty') unc_result =  for round in range(10): # Get the data split of one fold experiment train_idx, test_idx, label_ind, unlab_ind = alibox.get_split(round) # Get intermediate results saver for one fold experiment saver = alibox.get_stateio(round) while not stopping_criterion.is_stop(): # Select a subset of Uind according to the query strategy # Passing any sklearn models with proba_predict method are ok select_ind = uncertainStrategy.select(label_ind, unlab_ind, model=model, batch_size=1) # or pass your proba predict result # prob_pred = model.predict_proba(x[unlab_ind]) # select_ind = uncertainStrategy.select_by_prediction_mat(unlabel_index=unlab_ind, predict=prob_pred, batch_size=1) label_ind.update(select_ind) unlab_ind.difference_update(select_ind) # Update model and calc performance according to the model you are using model.fit(X=X[label_ind.index, :], y=y[label_ind.index]) pred = model.predict(X[test_idx, :]) accuracy = alibox.calc_performance_metric(y_true=y[test_idx], y_pred=pred, performance_metric='accuracy_score') # Save intermediate results to file st = alibox.State(select_index=select_ind, performance=accuracy) saver.add_state(st) saver.save() # Passing the current progress to stopping criterion object stopping_criterion.update_information(saver) # Reset the progress in stopping criterion object stopping_criterion.reset() unc_result.append(copy.deepcopy(saver)) analyser = alibox.get_experiment_analyser(x_axis='num_of_queries') analyser.add_method(method_name='uncertainty', method_results=unc_result) print(analyser) analyser.plot_learning_curves(title='Example of AL', std_area=True)
When using alipy, instead of importing each module independently, here's a more convenient way: Creating an ToolBox object and specifying the query type of your experiment - query all the labels of an instance, for example:
from alipy import ToolBox alibox = ToolBox(X=X, y=y, query_type='AllLabels')
Once initializing, you can get all available tools by the ToolBox object without passing redundant parameters.
is a list-like container to manage your labeled and unlabeled indexes.
You can create an IndexCollection object easily by passing a
(Note that, other data types will be cheated as only one element.)
a = [1,2,3] a_ind = alibox.IndexCollection(a) # Or create by importing the module from alipy.index import IndexCollection a_ind = IndexCollection(a)
This class will detect the validity of the indexes operations automatically. e.g., adding repeated element, deleting inexistent elements, type continuity, etc.
Commonly used methods of IndexCollection are:
to get the list type of the indexes for matrix indexing.
to add a batch of indexes to the IndexCollection object.
to remove a batch of indexes from the IndexCollection object
Here we only introduce the split methods in toolbox object. To split data independently, you can read the data manipulate module in advanced guideline.
There are two ways to split the data by toolbox object.
1. You can use
to split the data by specifying some options:
alibox.split_AL(test_ratio=0.3, initial_label_rate=0.1, split_count=10)
The above code will split the dataset into training, testing, labeled, unlabeled set randomly for 10 times. It will enforce each initially labeled set to contain at least one instance for each class by default. The split results will be stored inside the object, you can get one fold of split by:
train_0, test_0, label_0, unlabel_0 = alibox.get_split(round=0) train_1, test_1, label_1, unlabel_1 = alibox.get_split(round=1)
Note that, the returned labeled and unlabeled indexes of
have already converted into an IndexCollection object. (e.g., label_0, unlabel_0)
The whole split setting will also be returned; you may use them elsewhere.
Each returned value has the shape
train_idx, test_idx, label_idx, unlabel_idx = alibox.split_AL(test_ratio=0.3, initial_label_rate=0.1, split_count=10)
2. You can also use your own split function and set the indexes of train_idx, test_idx, label_idx,
unlabel_idx when initializing the ToolBox object. (Note that, in each split, labeled and
unlabeled set should be a subset of trainining set)
Each parameter should have the shape
train_idx, test_idx, label_idx, unlabel_idx = my_own_split_fun(X, y) alibox = alipy.ToolBox(X=X, y=y, query_type='AllLabels', train_idx=train_idx, test_idx=test_idx, label_idx=label_idx, unlabel_idx=unlabel_idx)
One of the core algorithms in active learning may be the query strategy.
ALiPy provides several classical and state-of-the-art strategies for now, and more strategies will be added in the later updates. The implemented strategies can be found in alipy overview .
You can get a query strategy object from alipy.ToolBox object by only providing the strategy name:
uncertainStrategy = alibox.get_query_strategy(strategy_name='QueryInstanceUncertainty')
Once initializing, you can select data by providing the labeled, unlabeled indexes and batch size.
Assume that you are using
to manage your indexes, the labeled index container is
and unlabeled one is
the example usage of a pre-defined strategy may be like this (provide list type is ok):
select_ind = uncertainStrategy.select(label_index=Lind, unlabel_index=Uind, batch_size=1)
Some strategies need the prediction model for evaluating the unlabeled data. (e.g., Uncertainty, QBC, etc.) Since alipy is model independent, we provide several solutions for such methods and introduce them in advanced tutorial for query strategy .
ALiPy is a model independent active learning toolbox, so this part is implemented by users.
However, we provide various performance calculating function for regression and multi-class, multi-label classification.
Available functions include:
'accuracy_score', 'roc_auc_score', 'get_fps_tps_thresholds', 'hamming_loss', 'one_error', 'coverage_error', 'label_ranking_loss', 'label_ranking_average_precision_score'
There are two ways to use them:
1. Import the
module and invoke the tool functions:
from alipy.metric import accuracy_score acc = accuracy_score(y_true=y, y_pred=model.predict(X))
method of ToolBox object:
acc = alibox.calc_performance_metric(y_true=y, y_pred=model.predict(X), performance_metric='accuracy_score')
object is a class to save and load your intermediate results.
This object implements several crucial functions:
- Save intermediate results to files
- Recover workspace (label set and unlabel set) at any iterations
- Recover program from the breakpoint in case the program exits unexpectedly
- Print the active learning progress: current_iteration, current_mean_performance, current_cost, etc.
It is strongly recommended to use this tool class to manage your intermediate results. Because many other components in alipy support StateIO object directly (e.g.,
If you are going to use those tool classes too, it can save some time on processing the data types.
you can get a StateIO object from ToolBox object by simply providing the fold number (The saving path will be inherited from the ToolBox object):
saver = alibox.get_stateio(round=0)
When adding query into the StateIO object, it is required to use a State object which is a dict like container to save some necessary information of one query (The state of current iteration). Such as cost, performance, selected indexes, and so on.
You need to set the queried indexes and performance when initializing a State object, the cost and queried_labels are optional:
st = alibox.State(select_index=select_ind, performance=accuracy, cost=cost, queried_label=queried_label)
You can also add some other entries as you need:
After you put all useful information into a State object,
you should add the state to the StateIO object, and use
method to save the
intermediate results to file:
alipy implement some commonly used stopping criteria:
* No unlabeled samples available (default)
* Preset number of queries is reached
* Preset limitation of cost is reached
* Preset percent of unlabeled pool is labeled
* Preset running time (CPU time) is reached
To use the above criteria, You should get a stopping criterion object by
stopping_criterion = alibox.get_stopping_criterion(stopping_criteria='num_of_queries', value=50)
The legal stopping_criteria can be one of
[None, 'num_of_queries', 'cost_limit', 'percent_of_unlabel', 'time_limit']
which are corresponding to the above 5 criteria. The value is the preset budget.
Once set the stopping condition, you can use
if the condition is met.
Note that, you should update the stopping_criterion object by providing a StateIO object,
it will read necessary information from it and update the current state.
Once the stopping condition is met, you should reset the object for re-using.
Otherwise, it will always return
while not stopping_criterion.is_stop(): #... Query some examples and update the StateIO object # Use the StateIO object to update stopping_criterion object stopping_criterion.update_information(saver) # The condition is met and break the loop. # Reset the object for another fold. stopping_criterion.reset()
is a tool class to gathering, processsing and
visualizing your experiment results.
when initializing, you need to specify the x_axis type of your result data which should be 'num_of_queries' if your result data is aligned by number of queries; or 'cost' if you are performing a cost-sensitive experiment.
analyser = alibox.get_experiment_analyser(x_axis='num_of_queries') # Or import the module from alipy.experiment import ExperimentAnalyser analyser = ExperimentAnalyser(x_axis='num_of_queries')
First thing you need to do is put all the results of compared mathods to the Analyzer
Analyser object accept 3 types of results data for 2 different active learning setting ('num_of_queries', 'cost'). Normally, the results should be a list which contains k elements. Each element represents one fold experiment result. Legal result object includes:
- StateIO object.
- A list contains n performances for n queries.
- A list contains n tuples with 2 elements, in which, the first element is the x_axis (e.g., iteration, accumulative_cost), and the second element is the y_axis (e.g., the performance)
In our example code, it is a list of k StateIO object.
Finally, you can show the learning curves by invoking
Copyright © 2018, alipy developers (BSD 3 License).