alipy.ToolBox
is a class to provide all available tool classes. You can get them without passing redundant parameters by a ToolBox object.
Instead of importing each module solely and initializing each class independently, It is more conveniently to get them by initializing a single toolbox object.
In this tutorial, we will first introduce how to initialize an
alipy.ToolBox
object.
Then, the available tools you can get from the the object is presented.
When initializing a ToolBox object, you need to provide the feature and label matrices of your whole dataset which are needed in many tools' initialization. Note that, the required data matrix is used as a reference which will NOT use additional memory.
Besides, the query type should be given. The available query types are
['AllLabels', 'PartLabels', 'Features']
, which correspond to query all labels of an instance; query an
instance-labels pair; query an instance-features pair, respectively.
from sklearn.datasets import load_iris
from alipy import ToolBox
X, y = load_iris(return_X_y=True)
alibox = ToolBox(X=X, y=y, query_type='AllLabels', saving_path='.')
Finally, you can pass the
train_idx
,
test_idx
,
label_idx
,
unlabel_idx
optionally, in case that you have your own data split setting. Otherwise, you can use the ToolBox object to create a random split.
train_idx, test_idx, label_idx, unlabel_idx = my_own_split_fun(X, y)
alibox = alipy.ToolBox(X=X, y=y, query_type='AllLabels',
train_idx=train_idx, test_idx=test_idx,
label_idx=label_idx, unlabel_idx=unlabel_idx)
ALiPy provides the Logistic Regression model with default parameters which is implemented by sklearn . You can get the model object by:
lr_model = alipy.get_default_model()
To train and test the model, you can use
lr_model.fit(X, y)
pred = lr_model.predict(X)
# get probabilistic output
pred = lr_model.predict_proba(X)
To learn more about the model, please refer to the Logistic Regression in sklearn.
There are two ways to split the data by toolbox object.
1. You can use
alibox.split_AL()
to split the data by specifying some options:
alibox.split_AL(test_ratio=0.3, initial_label_rate=0.1, split_count=10)
The above code will split the dataset into training, testing, labeled, unlabeled set randomly for 10 times. It will enforce each initially labeled set to contain at least one instance for each class by default. The split results will be stored inside the object, you can get one fold of split by:
train_0, test_0, label_0, unlabel_0 = alibox.get_split(round=0)
train_1, test_1, label_1, unlabel_1 = alibox.get_split(round=1)
Note that, the returned labeled and unlabeled indexes of
alibox.get_split(round)
have already converted into an IndexCollection object. (e.g., label_0, unlabel_0)
The whole split setting will also be returned; you may use them elsewhere.
Each returned value has the shape
[n_split_count, n_indexes]
train_idx, test_idx, label_idx, unlabel_idx = alibox.split_AL(test_ratio=0.3,
initial_label_rate=0.1,
split_count=10)
alipy.index.IndexCollection
is a tool class in alipy used for index management.
You can create this class by:
a = [1,2,3]
a_ind = alibox.IndexCollection(a)
The basic usage of IndexCollection is as follows:
- Using
a_ind.index
to get the list type of the indexes for matrix indexing.
- Using
a_ind.update()
to add a batch of indexes to the IndexCollection object.
- Using
a_ind.difference_update()
to remove a batch of indexes from the IndexCollection object
The detailed usage of IndexCollection can be found at here .
Toolbox class provides initialization of
clean oracle
. The data matrix will not be passed by default
to save memory. If you need to query by feature vector, you can set
query_by_example=True
to achive this goal.
clean_oracle = alibox.get_clean_oracle(query_by_example=False, cost_mat=None)
Normally, you can query from the oracle by providing a single or a list of indexes, The returned label is the corresponding label when initializing the oracle object. And the cost will be 1 by default if not specified in initialization, otherwise, you can set the cost matrix which should have the same shape of label matrix for cost sensitive querying:
label, cost = clean_oracle.query_by_index([1])
To get a
repository
which is a tool to save the queried information,
you can invoke
get_repository(round, instance_flag=False)
:
alibox.get_repository(round=0, instance_flag=False)
The
round
parameter is the fold number of the current experiment.
And to save the feature vectors of selected instances, you can set
instance_flag=True
.
alipy.experiment.StateIO
object is a class to save and load your intermediate results.
This object implements several crucial functions:
- Save intermediate results to files
- Recover workspace (label set and unlabel set) at any iterations
- Recover program from the breakpoint in case the program exits unexpectedly
- Print the active learning progress: current_iteration, current_mean_performance, current_cost, etc.
It is convenient to get a StateIO object by ToolBox. All you need is to specify the round number.
saver = alibox.get_stateio(round=1)
The split setting will be passed by ToolBox object automatically, and the saving path will be inherited from it too.
When adding query into the StateIO object, it is required to use a State object which is a dict like container to save some necessary information of one query (The state of current iteration). Such as cost, performance, selected indexes, and so on.
You need to set the queried indexes and performance when initializing a State object, the cost and queried_labels are optional:
st = alibox.State(select_index=select_ind, performance=accuracy,
cost=cost, queried_label=queried_label)
You can also add some other entries as you need:
st.add_element(key='my_entry', value=my_value)
After you put all useful information into a State object,
you should add the state to the StateIO object, and use
save()
method to save the
intermediate results to file:
saver.add_state(st)
saver.save()
One of the core algorithms in active learning may be the query strategy.
ALiPy provides several classical and state-of-the-art strategies for now, and more strategies will be added in the later updates. The implemented strategies include: Query-By-Committee (QBC), Uncertainty, Random, ExpectedErrorReduction, GraphDensity and QUIRE. (hyperlink to api).
You can get a query strategy object from alipy.ToolBox object by only providing the strategy name:
QBCStrategy = alibox.get_query_strategy(strategy_name='QueryInstanceQBC')
the list of legal strategy names are
['QueryInstanceQBC', 'QueryInstanceUncertainty',
'QueryRandom', 'QureyExpectedErrorReduction', 'QueryInstanceGraphDensity', 'QueryInstanceQUIRE']
.
Note that, the GraphDensity and QUIRE method need additional parameters,
please refer to the API reference.
Once initializing, you can select data by providing the labeled, unlabeled indexes and batch size.
Assume that you are using
alipy.IndexCollection
to manage your indexes, the labeled index container is
Lind
and unlabeled one is
Uind
,
the example usage of a pre-defined strategy may be like this (provide list type is ok):
select_ind = uncertainStrategy.select(label_index=Lind,
unlabel_index=Uind,
batch_size=1)
Some strategies need the prediction model for evaluating the unlabeled data. (e.g., Uncertainty, QBC, etc.) Since alipy is model independent, we provide several solutions for such methods and introduce them in advanced guideline .
ALiPy provides various performance calculating function for regression and multi-class, multi-label classification.
Available functions include:
'accuracy_score', 'roc_auc_score', 'get_fps_tps_thresholds',
'hamming_loss', 'one_error', 'coverage_error',
'label_ranking_loss', 'label_ranking_average_precision_score'
To calculate the performance, you can specify a metric name, and pass the ground truth and your
predicted labels to
calc_performance_metric()
method
Here is an example to use
calc_performance_metric()
method of ToolBox object:
acc = alibox.calc_performance_metric(y_true=y, y_pred=model.predict(X),
performance_metric='accuracy_score')
alipy implement some commonly used stopping criteria:
* No unlabeled samples available (default)
* Preset number of queries is reached
* Preset limitation of cost is reached
* Preset percent of unlabeled pool is labeled
* Preset running time (CPU time) is reached
To use the above criteria, You can get a stopping criterion object by
stopping_criterion = alibox.get_stopping_criterion(stopping_criteria='num_of_queries', value=50)
The legal stopping_criteria can be one of
[None, 'num_of_queries', 'cost_limit', 'percent_of_unlabel', 'time_limit']
which are corresponding to the above 5 criteria. The value is the preset budget.
Once set the stopping condition, you can use
stopping_criterion.is_stop()
to judge
if the condition is met.
Note that, you should update the stopping_criterion object by providing a StateIO object,
it will read necessary information from it and update the current state.
Once the stopping condition is met, you should reset the object for re-using.
Otherwise, it will always return
True
when invoking
stopping_criterion.is_stop()
.
while not stopping_criterion.is_stop():
#... Query some examples and update the StateIO object
# Use the StateIO object to update stopping_criterion object
stopping_criterion.update_information(saver)
# The condition is met and break the loop.
# Reset the object for another fold.
stopping_criterion.reset()
alipy.experiment.Analyser
is a tool class to gathering, processsing and
visualizing your experiment results.
To get an Analyzer object by ToolBox, you need to specify the x_axis type of your result data which should be 'num_of_queries' if your result data is aligned by number of queries; or 'cost' if you are performing a cost-sensitive experiment.
analyser = alibox.get_experiment_analyser(x_axis='num_of_queries')
First thing you need to do is put all the results of compared mathods to the Analyzer
Analyser object accept 3 types of results data for 2 different active learning setting ('num_of_queries', 'cost'). Normally, the results should be a list which contains k elements. Each element represents one fold experiment result. Legal result object includes:
- StateIO object.
- A list contains n performances for n queries.
- A list contains n tuples with 2 elements, in which, the first element is the x_axis (e.g., iteration, accumulative_cost), and the second element is the y_axis (e.g., the performance)
You can add it by
analyser.add_method(method_name='QBC', method_result=QBC_result)
Finally, you can show the learning curves by invoking
plot_learning_curves()
.
analyser.plot_learning_curves()
alipy.utils.aceThreading is a class to parallel your k-fold experiments and print the status of each thread.
To get an
aceThreading
object from alibox, you need not pass additional parameters. The split setting and data matrices are passed automatically in reference.
acethread = alibox.get_ace_threading()
You can also set some options.
acethread = alibox.get_ace_threading(max_thread=5, refresh_interval=1, saving_path='.')
The introductions of the target function for parallel and each option can be found at here.
You can save and load the alibox object which contains the data matrices and split setting for additional usages (e.g., comparing different strategies, analysing, etc.).
You can achive this goal simply by
alibox.save()
alibox = ToolBox.load('./al_settings.pkl')
Copyright © 2018, alipy developers (BSD 3 License).