query_strategy.query_labels. QueryInstanceQBC

QueryInstanceQBC(X=None, y=None, method='query_by_bagging', disagreement='vote_entropy')

The Query-By-Committee (QBC) algorithm.

QBC minimizes the version space, which is the set of hypotheses that are consistent with the current labeled training data.

This class implement the query-by-bagging method. Which uses the bagging in sklearn to construct the committee. So your model should be a sklearn model. If not, you may using the default logistic regression model by passing None model.

There are 3 ways to select instances in the data set.

1. use select if you are using sklearn model.

2. use the default logistic regression model to choose the instances by passing None to the model parameter.

3. use select_by_prediction_mat by providing the prediction matrix for each committee. Each committee predict matrix should have the shape [n_samples, n_classes] for probabilistic output or [n_samples] for class output.

References

----------

[1] H.S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of the ACM Workshop on Computational Learning Theory, pages 287-294, 1992.

[2] N. Abe and H. Mamitsuka. Query learning strategies using boosting and bagging. In Proceedings of the International Conference on Machine Learning (ICML), pages 1\–9. Morgan Kaufmann, 1998.

Methods

init

init(self, X=None, y=None, method='query_by_bagging', disagreement='vote_entropy')

Parameters:	X: 2D array, optional (default=None) Feature matrix of the whole dataset. It is a reference which will not use additional memory. y: array-like, optional (default=None) Label matrix of the whole dataset. It is a reference which will not use additional memory. method: str, optional (default=query_by_bagging) Method name. This class only implement query_by_bagging for now. disagreement: str method to calculate disagreement of committees. should be one of ['vote_entropy', 'KL_divergence']

Parameters:

X: 2D array, optional (default=None): Feature matrix of the whole dataset. It is a reference which will not use additional memory.
y: array-like, optional (default=None): Label matrix of the whole dataset. It is a reference which will not use additional memory.
method: str, optional (default=query_by_bagging): Method name. This class only implement query_by_bagging for now.
disagreement: str: method to calculate disagreement of committees. should be one of ['vote_entropy', 'KL_divergence']

select

select(self, label_index, unlabel_index, model=None, batch_size=1, n_jobs=None)

Select indexes from the unlabel_index for querying.

Parameters:	label_index: {list, np.ndarray, IndexCollection} The indexes of labeled samples. unlabel_index: {list, np.ndarray, IndexCollection} The indexes of unlabeled samples. model: object, optional (default=None) Current classification model, should have the 'predict_proba' method for probabilistic output. If not provided, LogisticRegression with default parameters implemented by sklearn will be used. batch_size: int, optional (default=1) Selection batch size. n_jobs: int, optional (default=None) How many threads will be used in training bagging.
Returns:	selected_idx: list The selected indexes which is a subset of unlabel_index.

Parameters:

label_index: {list, np.ndarray, IndexCollection}: The indexes of labeled samples.
unlabel_index: {list, np.ndarray, IndexCollection}: The indexes of unlabeled samples.
model: object, optional (default=None): Current classification model, should have the 'predict_proba' method for probabilistic output.
If not provided, LogisticRegression with default parameters implemented by sklearn will be used.
batch_size: int, optional (default=1): Selection batch size.
n_jobs: int, optional (default=None): How many threads will be used in training bagging.

Returns:

selected_idx: list: The selected indexes which is a subset of unlabel_index.

select_by_prediction_mat

select_by_prediction_mat(self, unlabel_index, predict, batch_size=1)

Select indexes from the unlabel_index for querying.

Parameters:	unlabel_index: {list, np.ndarray, IndexCollection} The indexes of unlabeled samples. Should be one-to-one correspondence to the prediction matrix. predict: list The prediction matrix for each committee. Each committee predict matrix should have the shape [n_samples, n_classes] for probabilistic output or [n_samples] for class output. batch_size: int, optional (default=1) Selection batch size.
Returns:	selected_idx: list The selected indexes which is a subset of unlabel_index.

Parameters:

unlabel_index: {list, np.ndarray, IndexCollection}: The indexes of unlabeled samples. Should be one-to-one
correspondence to the prediction matrix.
predict: list: The prediction matrix for each committee.
Each committee predict matrix should have the shape [n_samples, n_classes] for probabilistic output
or [n_samples] for class output.
batch_size: int, optional (default=1): Selection batch size.

Returns:

selected_idx: list: The selected indexes which is a subset of unlabel_index.

calc_vote_entropy

calc_vote_entropy(cls, predict_matrices)

Calculate the vote entropy for measuring the level of disagreement in QBC.

References

[1] I. Dagan and S. Engelson. Committee-based sampling for training probabilistic classifiers. In Proceedings of the International Conference on Machine Learning (ICML), pages 150–157. Morgan Kaufmann, 1995.

Parameters:	predict_matrices: list The prediction matrix for each committee. Each committee predict matrix should have the shape [n_samples, n_classes] for probabilistic output or [n_samples] for class output.
Returns:	score: list Score for each instance. Shape [n_samples]

calc_avg_KL_divergence

calc_avg_KL_divergence(cls, predict_matrices)

Calculate the average Kullback-Leibler (KL) divergence for measuring the level of disagreement in QBC.

References

[1] A. McCallum and K. Nigam. Employing EM in pool-based active learning for text classification. In Proceedings of the International Conference on Machine Learning (ICML), pages 359–367. Morgan Kaufmann, 1998.

Parameters:	predict_matrices: list The prediction matrix for each committee. Each committee predict matrix should have the shape [n_samples, n_classes] for probabilistic output or [n_samples] for class output.
Returns:	score: list Score for each instance. Shape [n_samples]