alipy.query_strategy
is a module which implemented various classical and
state-of-the-art query strategies for 7 different active learning settings. Please see
advanced guidelines
for the introduction of each setting.
The implemented strategies include the fllowing catagories:
AL with Instance Selection: Uncertainty (SIGIR 1994) , Graph Density (CVPR 2012) , QUIRE (TPAMI 2014) , SPAL (AAAI 2019) , Query By Committee (ICML 1998) , Random , BMDR (KDD 2013) , LAL (NIPS 2017) , Expected Error Reduction (ICML 2001)
AL for Multi-Label Data: AUDI (ICDM 2013) , QUIRE (TPAMI 2014) , Random , MMC (KDD 2009) , Adaptive (IJCAI 2013)
AL by Querying Features: AFASMC (KDD 2018) , Stability (ICDM 2013) , Random
AL with Different Costs: HALC (IJCAI 2018) , Random , Cost performance
AL with Noisy Oracles: CEAL (IJCAI 2017) , IEthresh (KDD 2009) , All , Random
AL with Novel Query Types: AURO (IJCAI 2015)
AL for Large Scale Tasks: Subsampling
Next, we will introduce the common usage of all strategies. Note that, different strategies have different parameters in initialization and selection. Please read their api references before using.
All the pre-defined query strategies (except Random) in alipy need the features and labels matrix of the whole data set when initializing. In this way, it can select instances by only providing the indexes of labeled and unlabeled set. Note that, the required data matrix is used as a reference which will NOT use additional memory.
To get the strategy object, you can import the alipy.query_strategy directly:
from alipy.query_strategy import (QueryInstanceQBC, QueryInstanceGraphDensity,
QueryInstanceUncertainty, QueryRandom)
QBCStrategy = QueryInstanceQBC(X, y)
uncertainStrategy = QueryInstanceUncertainty(X, y)
One more important thing: each method has its own options when initializing. e.g., in uncertainty,
the meric of uncertainty can be
['least_confident', 'margin', 'entropy', 'distance_to_boundary']`
.
And the GraphDensity and QUIRE need
train_idx
for
one fold experiment in initializing:
densityStrategy = QueryInstanceGraphDensity(X, y, train_idx=train_idx)
That is because these 2 methods need to construct kernel matrix or something like that which may use the information of the test set. Thus, you have to specify the indexes of training set.
For the other methods, please refer to the API reference for detailed introduction.
All strategies implement
select
method. This method need the indexes of unlabeled pool, and will return a subset of it according to different metrics.
Some strategies need the prediction model for evaluating the unlabeled data in active selection. Since alipy is model independent, we provide several solutions for such methods:
1. Use
select(label_index, unlabel_index, model=None, batch_size=1)
if you are using sklearn model.
model.fit(X[Lind.index], y[Lind.index])
select_ind = uncertainStrategy.select(Lind, Uind, batch_size=1, model=model)
It will using the given model object to get the necessary infomation from
the unlabeled data. Note that, in most cases, the strategy may need the
probabilistic output, so please make sure your model has the
predict_proba()
method.
2. Use the default logistic regression model to choose the instances by passing None to the model parameter. (It will train a logistic regression model from the labeled set for evaluating the unlabeled data.)
select_ind = uncertainStrategy.select(Lind, Uind, batch_size=1, model=None)
3. Use
select_by_prediction_mat()
by providing the probabilistic prediction
matrix of your own model, shape is usually like [n_samples, n_classes]. (Some other types
of output may be required according to different strategies, learn more in API reference.)
predict_result = my_model.get_prediction_of_data(X[Uind.index])
select_ind = QBCStrategy.select_by_prediction_mat(unlabel_index=Uind,
predict=predict_result,
batch_size=1)
Note that, not every strategy implements the above methods. Also, the parameter requirements are different too. Please read the API reference before using each strategy.
If you are using a strategy implemented by your own, the only requirement is that the selection by your strategy should be a subset of unlabeled indexes.
select_ind = my_query(Uind, **kwargs)
assert set(select_ind) < set(Uind)
Copyright © 2018, alipy developers (BSD 3 License).