ToolBox

alipy.ToolBox

alipy.ToolBox is a class to provide all available tool classes. You can get them without passing redundant parameters by a ToolBox object.

Instead of importing each module solely and initializing each class independently, It is more conveniently to get them by initializing a single toolbox object.

In this tutorial, we will first introduce how to initialize an alipy.ToolBox object. Then, the available tools you can get from the the object is presented.

Initialize a ToolBox object

When initializing a ToolBox object, you need to provide the feature and label matrices of your whole dataset which are needed in many tools' initialization. Note that, the required data matrix is used as a reference which will NOT use additional memory.

Besides, the query type should be given. The available query types are ['AllLabels', 'PartLabels', 'Features'] , which correspond to query all labels of an instance; query an instance-labels pair; query an instance-features pair, respectively.

from sklearn.datasets import load_iris
from alipy import ToolBox

X, y = load_iris(return_X_y=True)
alibox = ToolBox(X=X, y=y, query_type='AllLabels', saving_path='.')

Finally, you can pass the train_idx , test_idx , label_idx , unlabel_idx optionally, in case that you have your own data split setting. Otherwise, you can use the ToolBox object to create a random split.

train_idx, test_idx, label_idx, unlabel_idx = my_own_split_fun(X, y)
alibox = alipy.ToolBox(X=X, y=y, query_type='AllLabels',
                       train_idx=train_idx, test_idx=test_idx,
                       label_idx=label_idx, unlabel_idx=unlabel_idx)

Get default model

ALiPy provides the Logistic Regression model with default parameters which is implemented by sklearn . You can get the model object by:

lr_model = alipy.get_default_model()

To train and test the model, you can use

lr_model.fit(X, y)
pred = lr_model.predict(X)
# get probabilistic output
pred = lr_model.predict_proba(X)

To learn more about the model, please refer to the Logistic Regression in sklearn.

Split data

There are two ways to split the data by toolbox object.

1. You can use alibox.split_AL() to split the data by specifying some options:

alibox.split_AL(test_ratio=0.3, initial_label_rate=0.1, split_count=10)

The above code will split the dataset into training, testing, labeled, unlabeled set randomly for 10 times. It will enforce each initially labeled set to contain at least one instance for each class by default. The split results will be stored inside the object, you can get one fold of split by:

train_0, test_0, label_0, unlabel_0 = alibox.get_split(round=0)
train_1, test_1, label_1, unlabel_1 = alibox.get_split(round=1)

Note that, the returned labeled and unlabeled indexes of alibox.get_split(round) have already converted into an IndexCollection object. (e.g., label_0, unlabel_0)

The whole split setting will also be returned; you may use them elsewhere. Each returned value has the shape [n_split_count, n_indexes]

train_idx, test_idx, label_idx, unlabel_idx = alibox.split_AL(test_ratio=0.3,
                                                              initial_label_rate=0.1,
                                                              split_count=10)

Create IndexCollection object

alipy.index.IndexCollection is a tool class in alipy used for index management.

You can create this class by:

a = [1,2,3]
a_ind = alibox.IndexCollection(a)

The basic usage of IndexCollection is as follows:

- Using a_ind.index to get the list type of the indexes for matrix indexing.

- Using a_ind.update() to add a batch of indexes to the IndexCollection object.

- Using a_ind.difference_update() to remove a batch of indexes from the IndexCollection object

The detailed usage of IndexCollection can be found at here .

Get Oracle & Repository object

Toolbox class provides initialization of clean oracle . The data matrix will not be passed by default to save memory. If you need to query by feature vector, you can set query_by_example=True to achive this goal.

clean_oracle = alibox.get_clean_oracle(query_by_example=False, cost_mat=None)

Normally, you can query from the oracle by providing a single or a list of indexes, The returned label is the corresponding label when initializing the oracle object. And the cost will be 1 by default if not specified in initialization, otherwise, you can set the cost matrix which should have the same shape of label matrix for cost sensitive querying:

label, cost = clean_oracle.query_by_index([1])

To get a repository which is a tool to save the queried information, you can invoke get_repository(round, instance_flag=False) :

alibox.get_repository(round=0, instance_flag=False)

The round parameter is the fold number of the current experiment. And to save the feature vectors of selected instances, you can set instance_flag=True .

Get State & StateIO object

alipy.experiment.StateIO object is a class to save and load your intermediate results. This object implements several crucial functions:

- Save intermediate results to files

- Recover workspace (label set and unlabel set) at any iterations

- Recover program from the breakpoint in case the program exits unexpectedly

- Print the active learning progress: current_iteration, current_mean_performance, current_cost, etc.

It is convenient to get a StateIO object by ToolBox. All you need is to specify the round number.

saver = alibox.get_stateio(round=1)

The split setting will be passed by ToolBox object automatically, and the saving path will be inherited from it too.

When adding query into the StateIO object, it is required to use a State object which is a dict like container to save some necessary information of one query (The state of current iteration). Such as cost, performance, selected indexes, and so on.

You need to set the queried indexes and performance when initializing a State object, the cost and queried_labels are optional:

st = alibox.State(select_index=select_ind, performance=accuracy,
                  cost=cost, queried_label=queried_label)

You can also add some other entries as you need:

st.add_element(key='my_entry', value=my_value)

After you put all useful information into a State object, you should add the state to the StateIO object, and use save() method to save the intermediate results to file:

saver.add_state(st)
saver.save()

Get pre-defined QueryStrategy object

One of the core algorithms in active learning may be the query strategy.

ALiPy provides several classical and state-of-the-art strategies for now, and more strategies will be added in the later updates. The implemented strategies include: Query-By-Committee (QBC), Uncertainty, Random, ExpectedErrorReduction, GraphDensity and QUIRE. (hyperlink to api).

You can get a query strategy object from alipy.ToolBox object by only providing the strategy name:

QBCStrategy = alibox.get_query_strategy(strategy_name='QueryInstanceQBC')

the list of legal strategy names are ['QueryInstanceQBC', 'QueryInstanceUncertainty', 'QueryRandom', 'QureyExpectedErrorReduction', 'QueryInstanceGraphDensity', 'QueryInstanceQUIRE'] . Note that, the GraphDensity and QUIRE method need additional parameters, please refer to the API reference.

Once initializing, you can select data by providing the labeled, unlabeled indexes and batch size.

Assume that you are using alipy.IndexCollection to manage your indexes, the labeled index container is Lind and unlabeled one is Uind , the example usage of a pre-defined strategy may be like this (provide list type is ok):

select_ind = uncertainStrategy.select(label_index=Lind,
                                      unlabel_index=Uind,
                                      batch_size=1)

Some strategies need the prediction model for evaluating the unlabeled data. (e.g., Uncertainty, QBC, etc.) Since alipy is model independent, we provide several solutions for such methods and introduce them in advanced guideline .

Calculate performance

ALiPy provides various performance calculating function for regression and multi-class, multi-label classification.

Available functions include:

'accuracy_score', 'roc_auc_score', 'get_fps_tps_thresholds', 
'hamming_loss', 'one_error', 'coverage_error',
'label_ranking_loss', 'label_ranking_average_precision_score'

To calculate the performance, you can specify a metric name, and pass the ground truth and your predicted labels to calc_performance_metric() method

Here is an example to use calc_performance_metric() method of ToolBox object:

acc = alibox.calc_performance_metric(y_true=y, y_pred=model.predict(X),
                                     performance_metric='accuracy_score')

Get StoppingCriteria object

alipy implement some commonly used stopping criteria:

* No unlabeled samples available (default)

* Preset number of queries is reached

* Preset limitation of cost is reached

* Preset percent of unlabeled pool is labeled

* Preset running time (CPU time) is reached

To use the above criteria, You can get a stopping criterion object by

stopping_criterion = alibox.get_stopping_criterion(stopping_criteria='num_of_queries', value=50)

The legal stopping_criteria can be one of [None, 'num_of_queries', 'cost_limit', 'percent_of_unlabel', 'time_limit'] which are corresponding to the above 5 criteria. The value is the preset budget.

Once set the stopping condition, you can use stopping_criterion.is_stop() to judge if the condition is met.

Note that, you should update the stopping_criterion object by providing a StateIO object, it will read necessary information from it and update the current state. Once the stopping condition is met, you should reset the object for re-using. Otherwise, it will always return True when invoking stopping_criterion.is_stop() .

while not stopping_criterion.is_stop():
    #... Query some examples and update the StateIO object
    # Use the StateIO object to update stopping_criterion object
    stopping_criterion.update_information(saver)
# The condition is met and break the loop. 
# Reset the object for another fold.
stopping_criterion.reset()

Get ExperimentAnalyser

alipy.experiment.Analyser is a tool class to gathering, processsing and visualizing your experiment results.

To get an Analyzer object by ToolBox, you need to specify the x_axis type of your result data which should be 'num_of_queries' if your result data is aligned by number of queries; or 'cost' if you are performing a cost-sensitive experiment.

analyser = alibox.get_experiment_analyser(x_axis='num_of_queries')

First thing you need to do is put all the results of compared mathods to the Analyzer

Analyser object accept 3 types of results data for 2 different active learning setting ('num_of_queries', 'cost'). Normally, the results should be a list which contains k elements. Each element represents one fold experiment result. Legal result object includes:

- StateIO object.

- A list contains n performances for n queries.

- A list contains n tuples with 2 elements, in which, the first element is the x_axis (e.g., iteration, accumulative_cost), and the second element is the y_axis (e.g., the performance)

You can add it by

analyser.add_method(method_name='QBC', method_result=QBC_result)

Finally, you can show the learning curves by invoking plot_learning_curves() .

analyser.plot_learning_curves()

Get aceThreading object

alipy.utils.aceThreading is a class to parallel your k-fold experiments and print the status of each thread.

To get an aceThreading object from alibox, you need not pass additional parameters. The split setting and data matrices are passed automatically in reference.

acethread = alibox.get_ace_threading()

You can also set some options.

acethread = alibox.get_ace_threading(max_thread=5, refresh_interval=1, saving_path='.')

The introductions of the target function for parallel and each option can be found at here.

Save & load ToolBox object

You can save and load the alibox object which contains the data matrices and split setting for additional usages (e.g., comparing different strategies, analysing, etc.).

You can achive this goal simply by

alibox.save()
alibox = ToolBox.load('./al_settings.pkl')