In this tutorial, we will present a simple example to customize your active learning experiment with the tools in alipy.
Considering that some users have less experience in experiment implementation. Thus we will introduce the unified framework of active learning experiment first, then the corresponding tools in alipy will be introduced next. The full introductions to each class can be found at advanced guideline .
As illustrated in the following figure, in normal conditions, the feature matrix X
with the shape
[n_samples, n_features]
and the corresponding label
matrix with shape
[n_samples]
or
[n_samples, n_labels]
is needed for subsequent operations.
However, if it is not easy to get the specific feature matrix (e.g. A image dataset), it is ok to implement your experiment in alipy. Because alipy only operates on the indexes of the instances.
Secondly, you should split your data into training/testing set for experiment. The data partition should be repeated randomly for several times. In active learning, you should further split your training set into initially labeled set and unlabeled pool for querying. Note that, the initially labeled set is usually small in most active learning settings.
Then, you can start the querying process for each fold of experiment and record their results. In each querying iteration, a subset of unlabeled data will be queried and added to the labeled set; after that, the model will be re-trained based on the updated labeled set and tested to evaluate the query.
After all folds are finished, the learning curve of this query strategy can be obtained by averaging the performance curve of each fold.
The tool classes provided by alipy cover as many components in the above figure as possible. Note that, each independent module can be replaced by your own implementation (without inheriting). Because the modules in alipy will not influence each other and thus can be substituted freely.
A part of commonly used tools are:
* Using alipy.data_manipulate to preprocess and split your data sets for experiments.
* Using alipy.query_strategy to invoke traditional and state-of-the-art methods.
* Using alipy.index.IndexCollection to manage your labeled indexes and unlabeled indexes.
* Using alipy.metric to calculate your model performances.
* Using alipy.experiment.state and alipy.experiment.state_io to save the intermediate results after each query and recover the program from the breakpoints.
* Using alipy.experiment.stopping_criteria to get some example stopping criteria.
* Using alipy.experiment.experiment_analyser to gathering, process and visualize your experiment results.
The rest of the tutorial is organized as follows. we first present a complete example of implementing the experiment with alipy below for experienced users. Then, we will explain the code alone with the introduction to the commmonly used methods in the above tools. Introduction to additional tools and supported variant settings can be found at advanced guidelines .
import copy
from sklearn.datasets import load_iris
from alipy import ToolBox
X, y = load_iris(return_X_y=True)
alibox = ToolBox(X=X, y=y, query_type='AllLabels', saving_path='.')
# Split data
alibox.split_AL(test_ratio=0.3, initial_label_rate=0.1, split_count=10)
# Use the default Logistic Regression classifier
model = alibox.get_default_model()
# The cost budget is 50 times querying
stopping_criterion = alibox.get_stopping_criterion('num_of_queries', 50)
# Use pre-defined strategy
uncertainStrategy = alibox.get_query_strategy(strategy_name='QueryInstanceUncertainty')
unc_result = []
for round in range(10):
# Get the data split of one fold experiment
train_idx, test_idx, label_ind, unlab_ind = alibox.get_split(round)
# Get intermediate results saver for one fold experiment
saver = alibox.get_stateio(round)
# Set initial performance point
model.fit(X=X[label_ind.index, :], y=y[label_ind.index])
pred = model.predict(X[test_idx, :])
accuracy = alibox.calc_performance_metric(y_true=y[test_idx],
y_pred=pred,
performance_metric='accuracy_score')
saver.set_initial_point(accuracy)
while not stopping_criterion.is_stop():
# Select a subset of Uind according to the query strategy
# Passing any sklearn models with proba_predict method are ok
select_ind = uncertainStrategy.select(label_ind, unlab_ind, model=model, batch_size=1)
# or pass your proba predict result
# prob_pred = model.predict_proba(x[unlab_ind])
# select_ind = uncertainStrategy.select_by_prediction_mat(unlabel_index=unlab_ind, predict=prob_pred, batch_size=1)
label_ind.update(select_ind)
unlab_ind.difference_update(select_ind)
# Update model and calc performance according to the model you are using
model.fit(X=X[label_ind.index, :], y=y[label_ind.index])
pred = model.predict(X[test_idx, :])
accuracy = alibox.calc_performance_metric(y_true=y[test_idx],
y_pred=pred,
performance_metric='accuracy_score')
# Save intermediate results to file
st = alibox.State(select_index=select_ind, performance=accuracy)
saver.add_state(st)
saver.save()
# Passing the current progress to stopping criterion object
stopping_criterion.update_information(saver)
# Reset the progress in stopping criterion object
stopping_criterion.reset()
unc_result.append(copy.deepcopy(saver))
analyser = alibox.get_experiment_analyser(x_axis='num_of_queries')
analyser.add_method(method_name='uncertainty', method_results=unc_result)
print(analyser)
analyser.plot_learning_curves(title='Example of AL', std_area=True)
When using alipy, instead of importing each module independently, here's a more convenient way:
Creating an
ToolBox
object and specifying the query type of your experiment -
query all the labels of an instance, for example:
from alipy import ToolBox
alibox = ToolBox(X=X, y=y, query_type='AllLabels')
Once initializing, you can get all available tools by the ToolBox object without passing redundant parameters.
alipy.index.IndexCollection
is a list-like container to manage your labeled and unlabeled indexes.
You can create an IndexCollection object easily by passing a
list
or
numpy.ndarray
object.
(Note that, other data types will be cheated as only one element.)
a = [1,2,3]
a_ind = alibox.IndexCollection(a)
# Or create by importing the module
from alipy.index import IndexCollection
a_ind = IndexCollection(a)
This class will detect the validity of the indexes operations automatically. e.g., adding repeated element, deleting inexistent elements, type continuity, etc.
Commonly used methods of IndexCollection are:
- Using
a_ind.index
to get the list type of the indexes for matrix indexing.
- Using
a_ind.update()
to add a batch of indexes to the IndexCollection object.
- Using
a_ind.difference_update()
to remove a batch of indexes from the IndexCollection object
Here we only introduce the split methods in toolbox object. To split data independently, you can read the data manipulate module in advanced guideline.
There are two ways to split the data by toolbox object.
1. You can use
alibox.split_AL()
to split the data by specifying some options:
alibox.split_AL(test_ratio=0.3, initial_label_rate=0.1, split_count=10)
The above code will split the dataset into training, testing, labeled, unlabeled set randomly for 10 times. It will enforce each initially labeled set to contain at least one instance for each class by default. The split results will be stored inside the object, you can get one fold of split by:
train_0, test_0, label_0, unlabel_0 = alibox.get_split(round=0)
train_1, test_1, label_1, unlabel_1 = alibox.get_split(round=1)
Note that, the returned labeled and unlabeled indexes of
alibox.get_split(round)
have already converted into an IndexCollection object. (e.g., label_0, unlabel_0)
The whole split setting will also be returned; you may use them elsewhere.
Each returned value has the shape
[n_split_count, n_indexes]
train_idx, test_idx, label_idx, unlabel_idx = alibox.split_AL(test_ratio=0.3,
initial_label_rate=0.1,
split_count=10)
2. You can also use your own split function and set the indexes of train_idx, test_idx, label_idx,
unlabel_idx when initializing the ToolBox object. (Note that, in each split, labeled and
unlabeled set should be a subset of trainining set)
Each parameter should have the shape
[n_split_count, n_indexes]
:
train_idx, test_idx, label_idx, unlabel_idx = my_own_split_fun(X, y)
alibox = alipy.ToolBox(X=X, y=y, query_type='AllLabels',
train_idx=train_idx, test_idx=test_idx,
label_idx=label_idx, unlabel_idx=unlabel_idx)
One of the core algorithms in active learning may be the query strategy.
ALiPy provides several classical and state-of-the-art strategies for now, and more strategies will be added in the later updates. The implemented strategies can be found in alipy overview .
You can get a query strategy object from alipy.ToolBox object by only providing the strategy name:
uncertainStrategy = alibox.get_query_strategy(strategy_name='QueryInstanceUncertainty')
For the other strategies, please import them directly to use. Note that, the GraphDensity and QUIRE method need additional parameters, please refer to the API reference.
Once initializing, you can select data by providing the labeled, unlabeled indexes and batch size.
Assume that you are using
alipy.IndexCollection
to manage your indexes, the labeled index container is
Lind
and unlabeled one is
Uind
,
the example usage of a pre-defined strategy may be like this (provide list type is ok):
select_ind = uncertainStrategy.select(label_index=Lind,
unlabel_index=Uind,
batch_size=1)
Some strategies need the prediction model for evaluating the unlabeled data. (e.g., Uncertainty, QBC, etc.) Since alipy is model independent, we provide several solutions for such methods and introduce them in advanced tutorial for query strategy .
ALiPy is a model independent active learning toolbox, so this part is implemented by users.
However, we provide various performance calculating function for regression and multi-class, multi-label classification.
Available functions include:
'accuracy_score', 'roc_auc_score', 'get_fps_tps_thresholds',
'hamming_loss', 'one_error', 'coverage_error',
'label_ranking_loss', 'label_ranking_average_precision_score'
There are two ways to use them:
1. Import the
alipy.metrics
module and invoke the tool functions:
from alipy.metric import accuracy_score
acc = accuracy_score(y_true=y, y_pred=model.predict(X))
2. Use
calc_performance_metric()
method of ToolBox object:
acc = alibox.calc_performance_metric(y_true=y, y_pred=model.predict(X),
performance_metric='accuracy_score')
alipy.experiment.StateIO
object is a class to save and load your intermediate results.
This object implements several crucial functions:
- Save intermediate results to files
- Recover workspace (label set and unlabel set) at any iterations
- Recover program from the breakpoint in case the program exits unexpectedly
- Print the active learning progress: current_iteration, current_mean_performance, current_cost, etc.
It is strongly recommended to use this tool class to manage your intermediate results.
Because many other components in alipy support StateIO object directly (e.g.,
Analyser
,
StoppingCriteria
).
If you are going to use those tool classes too, it can save some time on processing the data types.
you can get a StateIO object from ToolBox object by simply providing the fold number (The saving path will be inherited from the ToolBox object):
saver = alibox.get_stateio(round=0)
When adding query into the StateIO object, it is required to use a State object which is a dict like container to save some necessary information of one query (The state of current iteration). Such as cost, performance, selected indexes, and so on.
You need to set the queried indexes and performance when initializing a State object, the cost and queried_labels are optional:
st = alibox.State(select_index=select_ind, performance=accuracy,
cost=cost, queried_label=queried_label)
You can also add some other entries as you need:
st.add_element(key='my_entry', value=my_value)
After you put all useful information into a State object,
you should add the state to the StateIO object, and use
save()
method to save the
intermediate results to file:
saver.add_state(st)
saver.save()
alipy implement some commonly used stopping criteria:
* No unlabeled samples available (default)
* Preset number of queries is reached
* Preset limitation of cost is reached
* Preset percent of unlabeled pool is labeled
* Preset running time (CPU time) is reached
To use the above criteria, You should get a stopping criterion object by
stopping_criterion = alibox.get_stopping_criterion(stopping_criteria='num_of_queries', value=50)
The legal stopping_criteria can be one of
[None, 'num_of_queries', 'cost_limit', 'percent_of_unlabel', 'time_limit']
which are corresponding to the above 5 criteria. The value is the preset budget.
Once set the stopping condition, you can use
stopping_criterion.is_stop()
to judge
if the condition is met.
Note that, you should update the stopping_criterion object by providing a StateIO object,
it will read necessary information from it and update the current state.
Once the stopping condition is met, you should reset the object for re-using.
Otherwise, it will always return
True
when invoking
stopping_criterion.is_stop()
.
while not stopping_criterion.is_stop():
#... Query some examples and update the StateIO object
# Use the StateIO object to update stopping_criterion object
stopping_criterion.update_information(saver)
# The condition is met and break the loop.
# Reset the object for another fold.
stopping_criterion.reset()
alipy.experiment.Analyser
is a tool class to gathering, processsing and
visualizing your experiment results.
when initializing, you need to specify the x_axis type of your result data which should be 'num_of_queries' if your result data is aligned by number of queries; or 'cost' if you are performing a cost-sensitive experiment.
analyser = alibox.get_experiment_analyser(x_axis='num_of_queries')
# Or import the module
from alipy.experiment import ExperimentAnalyser
analyser = ExperimentAnalyser(x_axis='num_of_queries')
First thing you need to do is put all the results of compared mathods to the Analyzer
Analyser object accept 3 types of results data for 2 different active learning setting ('num_of_queries', 'cost'). Normally, the results should be a list which contains k elements. Each element represents one fold experiment result. Legal result object includes:
- StateIO object.
- A list contains n performances for n queries.
- A list contains n tuples with 2 elements, in which, the first element is the x_axis (e.g., iteration, accumulative_cost), and the second element is the y_axis (e.g., the performance)
In our example code, it is a list of k StateIO object.
analyser.add_method(method_name='uncertainty', method_result=unc_result)
Finally, you can show the learning curves by invoking
plot_learning_curves()
.
analyser.plot_learning_curves()
Copyright © 2018, alipy developers (BSD 3 License).