query_strategy.query_labels. QueryInstanceUncertainty

QueryInstanceUncertainty(X=None, y=None, measure='entropy')

Uncertainty query strategy. The implement of uncertainty measure includes:

1. margin sampling

2. least confident

3. entropy

The above measures need the probabilistic output of the model.

There are 3 ways to select instances in the data set.

1. use select if you are using sklearn model.

2. use the default logistic regression model to choose the instances by passing None to the model parameter.

3. use select_by_prediction_mat by providing the probabilistic prediction matrix of your own model, shape [n_samples, n_classes].

Methods

init(self, X=None, y=None, measure='entropy')

Parameters:	X: 2D array, optional (default=None) Feature matrix of the whole dataset. It is a reference which will not use additional memory. y: array-like, optional (default=None) Label matrix of the whole dataset. It is a reference which will not use additional memory. measure: str, optional (default='entropy') measurement to calculate uncertainty, should be one of ['least_confident', 'margin', 'entropy', 'distance_to_boundary'] --'least_confident' x* = argmax 1-P(y_hat\|x) ,where y_hat = argmax P(yi\|x) --'margin' x* = argmax P(y_hat1\|x) - P(y_hat2\|x), where y_hat1 and y_hat2 are the first and second most probable class labels under the model, respectively. --'entropy' x* = argmax -sum(P(yi\|x)logP(yi\|x)) --'distance_to_boundary' Only available in binary classification, x* = argmin \|f(x)\|, your model should have 'decision_function' method which will return a 1d array.

Parameters:

X: 2D array, optional (default=None): Feature matrix of the whole dataset. It is a reference which will not use additional memory.
y: array-like, optional (default=None): Label matrix of the whole dataset. It is a reference which will not use additional memory.
measure: str, optional (default='entropy'): measurement to calculate uncertainty, should be one of
['least_confident', 'margin', 'entropy', 'distance_to_boundary']
--'least_confident' x* = argmax 1-P(y_hat|x) ,where y_hat = argmax P(yi|x)
--'margin' x* = argmax P(y_hat1|x) - P(y_hat2|x), where y_hat1 and y_hat2 are the first and second
most probable class labels under the model, respectively.
--'entropy' x* = argmax -sum(P(yi|x)logP(yi|x))
--'distance_to_boundary' Only available in binary classification, x* = argmin |f(x)|,
your model should have 'decision_function' method which will return a 1d array.

select(self, label_index, unlabel_index, model=None, batch_size=1)

Select indexes from the unlabel_index for querying.

Parameters:	label_index: {list, np.ndarray, IndexCollection} The indexes of labeled samples. unlabel_index: {list, np.ndarray, IndexCollection} The indexes of unlabeled samples. model: object, optional (default=None) Current classification model, should have the 'predict_proba' method for probabilistic output. If not provided, LogisticRegression with default parameters implemented by sklearn will be used. batch_size: int, optional (default=1) Selection batch size.
Returns:	selected_idx: list The selected indexes which is a subset of unlabel_index.

Parameters:

label_index: {list, np.ndarray, IndexCollection}: The indexes of labeled samples.
unlabel_index: {list, np.ndarray, IndexCollection}: The indexes of unlabeled samples.
model: object, optional (default=None): Current classification model, should have the 'predict_proba' method for probabilistic output.
If not provided, LogisticRegression with default parameters implemented by sklearn will be used.
batch_size: int, optional (default=1): Selection batch size.

Returns:

selected_idx: list: The selected indexes which is a subset of unlabel_index.

select_by_prediction_mat(self, unlabel_index, predict, batch_size=1)

Select indexes from the unlabel_index for querying.

Parameters:	unlabel_index: {list, np.ndarray, IndexCollection} The indexes of unlabeled samples. Should be one-to-one correspondence to the prediction matrix. predict: 2d array, shape [n_samples, n_classes] or [n_samples] The probabilistic prediction matrix for the unlabeled set. batch_size: int, optional (default=1) Selection batch size.
Returns:	selected_idx: list The selected indexes which is a subset of unlabel_index.

@classmethod
calc_entropy(cls, predict_proba)

Calc the entropy for each instance.

Parameters:	predict_proba: array-like, shape [n_samples, n_class] Probability prediction for each instance.
Returns:	entropy: list 1d array, entropy for each instance.