Query strategy

alipy.query_strategy is a module which implemented various classical and state-of-the-art query strategies for 7 different active learning settings. Please see advanced guidelines for the introduction of each setting.

The implemented strategies include the fllowing catagories:

AL with Instance Selection: Uncertainty (SIGIR 1994) , Graph Density (CVPR 2012) , QUIRE (TPAMI 2014) , SPAL (AAAI 2019) , Query By Committee (ICML 1998) , Random , BMDR (KDD 2013) , LAL (NIPS 2017) , Expected Error Reduction (ICML 2001)

AL for Multi-Label Data: AUDI (ICDM 2013) , QUIRE (TPAMI 2014) , Random , MMC (KDD 2009) , Adaptive (IJCAI 2013)

AL by Querying Features: AFASMC (KDD 2018) , Stability (ICDM 2013) , Random

AL with Different Costs: HALC (IJCAI 2018) , Random , Cost performance

AL with Noisy Oracles: CEAL (IJCAI 2017) , IEthresh (KDD 2009) , All , Random

AL with Novel Query Types: AURO (IJCAI 2015)

AL for Large Scale Tasks: Subsampling

Next, we will introduce the common usage of all strategies. Note that, different strategies have different parameters in initialization and selection. Please read their api references before using.

Usage

Initialize

All the pre-defined query strategies (except Random) in alipy need the features and labels matrix of the whole data set when initializing. In this way, it can select instances by only providing the indexes of labeled and unlabeled set. Note that, the required data matrix is used as a reference which will NOT use additional memory.

To get the strategy object, you can import the alipy.query_strategy directly:

from alipy.query_strategy import (QueryInstanceQBC, QueryInstanceGraphDensity,
                                  QueryInstanceUncertainty, QueryRandom)
QBCStrategy = QueryInstanceQBC(X, y)
uncertainStrategy = QueryInstanceUncertainty(X, y)

One more important thing: each method has its own options when initializing. e.g., in uncertainty, the meric of uncertainty can be ['least_confident', 'margin', 'entropy', 'distance_to_boundary']` . And the GraphDensity and QUIRE need train_idx for one fold experiment in initializing:

densityStrategy = QueryInstanceGraphDensity(X, y, train_idx=train_idx)

That is because these 2 methods need to construct kernel matrix or something like that which may use the information of the test set. Thus, you have to specify the indexes of training set.

For the other methods, please refer to the API reference for detailed introduction.

Select

All strategies implement select method. This method need the indexes of unlabeled pool, and will return a subset of it according to different metrics.

Some strategies need the prediction model for evaluating the unlabeled data in active selection. Since alipy is model independent, we provide several solutions for such methods:

1. Use select(label_index, unlabel_index, model=None, batch_size=1) if you are using sklearn model.

model.fit(X[Lind.index], y[Lind.index])
select_ind = uncertainStrategy.select(Lind, Uind, batch_size=1, model=model)

It will using the given model object to get the necessary infomation from the unlabeled data. Note that, in most cases, the strategy may need the probabilistic output, so please make sure your model has the predict_proba() method.

2. Use the default logistic regression model to choose the instances by passing None to the model parameter. (It will train a logistic regression model from the labeled set for evaluating the unlabeled data.)

select_ind = uncertainStrategy.select(Lind, Uind, batch_size=1, model=None)

3. Use select_by_prediction_mat() by providing the probabilistic prediction matrix of your own model, shape is usually like [n_samples, n_classes]. (Some other types of output may be required according to different strategies, learn more in API reference.)

predict_result = my_model.get_prediction_of_data(X[Uind.index])
select_ind = QBCStrategy.select_by_prediction_mat(unlabel_index=Uind,
                                                  predict=predict_result,
                                                  batch_size=1)

Note that, not every strategy implements the above methods. Also, the parameter requirements are different too. Please read the API reference before using each strategy.

Implement your own strategy

If you are using a strategy implemented by your own, the only requirement is that the selection by your strategy should be a subset of unlabeled indexes.

select_ind = my_query(Uind, **kwargs)
assert set(select_ind) < set(Uind)