alipy.query_strategy.query_labels. QueryInstanceLAL

The key idea of LAL is to train a regressor that predicts the expected error reduction for a candidate sample in a particular learning state.

The regressor is trained on 2D datasets and can score unseen data from real datasets. The method yields strategies that work well on real data from a wide range of domains.

In alipy, LAL will use a pre-extracted data provided by the authors to train the regressor. It will download the data file if no accepted file is found. You can also download 'LAL-iterativetree-simulatedunbalanced-big.npz' and 'LAL-randomtree-simulatedunbalanced-big.npz' from https://github.com/ksenia-konyushkova/LAL. and specify the dir to the file for training.

The implementation is refer to the https://github.com/ksenia-konyushkova/LAL/ directly.

References

----------

[1] Ksenia Konyushkova, and Sznitman Raphael. 2017. Learning Active Learning from Data. In The 31st Conference on Neural Information Processing Systems (NIPS 2017), 4228-4238.

Methods

init

__init__(self, X, y, mode='LAL_iterative', data_path='.', cls_est=50, train_slt=True, **kwargs)

Parameters:	X: 2D array, optional (default=None) Feature matrix of the whole dataset. It is a reference which will not use additional memory. y: array-like, optional (default=None) Label matrix of the whole dataset. It is a reference which will not use additional memory. mode: str, optional (default='LAL_iterative') The mode of data sampling. must be one of 'LAL_iterative', 'LAL_independent'. data_path: str, optional (default='.') Path to store the data file for training. The path should be a dir, and the file name should be 'LAL-iterativetree-simulatedunbalanced-big.npz' or 'LAL-randomtree-simulatedunbalanced-big.npz'. If no accepted files are detected, it will download the pre-extracted data file to the given path. cls_est: int, optional (default=50) The number of estimator used for training the random forest whose role is calculating the features for selector. train_slt: bool, optional (default=True) Whether to train a selector in initializing.

Parameters:

X: 2D array, optional (default=None): Feature matrix of the whole dataset. It is a reference which will not use additional memory.
y: array-like, optional (default=None): Label matrix of the whole dataset. It is a reference which will not use additional memory.
mode: str, optional (default='LAL_iterative'): The mode of data sampling. must be one of 'LAL_iterative', 'LAL_independent'.
data_path: str, optional (default='.'): Path to store the data file for training.
The path should be a dir, and the file name should be
'LAL-iterativetree-simulatedunbalanced-big.npz' or 'LAL-randomtree-simulatedunbalanced-big.npz'.
If no accepted files are detected, it will download the pre-extracted data file to the given path.
cls_est: int, optional (default=50): The number of estimator used for training the random forest whose role
is calculating the features for selector.
train_slt: bool, optional (default=True): Whether to train a selector in initializing.

download_data

download_data(self)

Download the training data for training the regressor to evaluate unlabeled data.

train_selector_from_file

train_selector_from_file(self, file_path=None, reg_est=2000, reg_depth=40, feat=6)

Train a random forest as the instance selector. Note that, if the parameters of the forest is too high to your computer, it will take a lot of time for training.

Parameters:	file_path: str, optional (default=None) The path to the specific data file. reg_est: int, optional (default=2000) The number of estimators of the forest. reg_depth: int, optional (default=40) The depth of the forest. feat: int, optional (default=6) The feat of the forest.

select

select(self, label_index, unlabel_index, batch_size=1, **kwargs)

Select indexes from the unlabel_index for querying.

Parameters:	label_index: {list, np.ndarray, IndexCollection} The indexes of labeled samples. unlabel_index: {list, np.ndarray, IndexCollection} The indexes of unlabeled samples. batch_size: int, optional (default=1) Selection batch size.
Returns:	selected_idx: list The selected indexes which is a subset of unlabel_index.