Large scale active learning

Many effective active learning query strategies are expensive and can not be applied to large scale data because the time complexity increases rapidly with the size of unlabeled pool due to the need for matrix inversion and other operations. simple baseline methods, e.g., uncertainty and random, are efficient but less effective. As a result, active learning in large scale data remains a problem.

One simple way to deal with the large scale data is to find a way to use the existing methods instead of proposing a new approach. Fortunately, alipy provides a solution to achieve this goal. Since the time complexity is related to the sample complexity, we can simply sample a subset from the whole unlabeled pool (e.g., 10\% instances of all), and then use the existing methods to select instance for querying. Here we call this method subsampling . And the method that does not sample is called fullset .

Apparently, subsampling will reduce the performance. However, in our experiment, the learning curves of subsampling and fullset are very similar, but the subsampling is much more efficient than fullset. This result makes subsampling a quite practical method when we need an approximate result in a limited time.

In alipy, we implement random_sampling(self, rate=0.3) method for the alipy.index.IndexCollection class and its variant (e.g., MultiLabelIndexCollection). It will return a subset of the container with the specific sampling rate.

>>> from alipy.index import IndexCollection
>>> a = [1,2,3]
>>> a_ind = IndexCollection(a)
>>> print(a_ind.random_sampling(rate=0.7))
[1, 3]

To query from the subset, you can simply passing the sub-sampled unlabeled set:

select_ind = QBCStrategy.select(label_ind, unlab_ind.random_sampling(rate=0.2), model=None, batch_size=1)
label_ind.update(select_ind)
unlab_ind.difference_update(select_ind)