Feature querying

Feature missing is a serious problem in many applications, which may lead to low quality of training data and further significantly degrade the learning performance. While feature acquisition usually involves special devices or complex processes, it is expensive to acquire all feature values for the whole dataset. Thus, some works try to solve this problem by active learning.

To support the feature querying setting, alipy also provides various tool classes.

In alipy, the feature matrix is treated as a special multi label matrix. The usage of the tools for feature querying can be very similar with multi label setting .

Feature index

In alipy, we only record the index of data, so here we first define the feature index.

Actually, feature index is the same as multi label index. Each index should be a tuple with 2 elements. The first element represents the index of instance, while the second one represents the indexes of features.

Some examples of valid feature indexes include:

queried_index = (1, [3,4])	# query the 4th, 5th labels of the 2nd instance
queried_index = (1, [3])
queried_index = (1, 3)
queried_index = (1, (3))
queried_index = (1, (3,4))

Split

To split the datasets for feature querying, you can use alipy.data_manipulate.split_features function. It will split the dataset into training, testing set, and in training set, there are some missing features in your feature matrix.

Note that, the returned indexes of label and unlabel set are the feature indexes we define above.

import numpy as np
from alipy.data_manipulate import split_features

X = np.random.rand(10, 2)  # 10 instances with 2 features
train, test, lab, unlab = split_features(feature_matrix=X, test_ratio=0.5, missing_rate=0.5,
                                         split_count=1)

The values in train_idx, test_idx, label_idx, unlabel_idx are:

[array([7, 5, 4, 3, 9])]
[array([2, 8, 1, 6, 0])]
[[(3, 0), (5, 0), (7, 1), (9, 0), (9, 1)]]
[[(7, 0), (5, 1), (4, 0), (4, 1), (3, 1)]]

Index manager

alipy.index.FeatureIndexCollection is the tool class for feature index management. Actually, it has the same usage and interface with MultiLabelIndexCollection Here is an example usage of this class.

>>> fea_ind1 = FeatureIndexCollection([(0, 1), (0, 2), (0, (3, 4)), (1, (0, 1))], feature_size=5)
{(0, 1), (1, 1), (0, 4), (1, 0), (0, 2), (0, 3)}
>>> fea_ind1.update((0, 0))
{(0, 1), (0, 0), (1, 1), (0, 4), (1, 0), (0, 2), (0, 3)}
>>> fea_ind1.update([(1, 2), (1, (3, 4))])
{(0, 1), (1, 2), (0, 0), (1, 3), (1, 4), (1, 1), (0, 4), (1, 0), (0, 2), (0, 3)}
>>> fea_ind1.difference_update([(0, [3, 4, 2])])
{(0, 1), (1, 2), (0, 0), (1, 3), (1, 4), (1, 1), (1, 0)}

Oracle

alipy.oracle.OracleQueryFeatures is the oracle class for feature querying. Actually, it has the same usage and interface with OracleQueryMultiLabel . You can initialize this class by providing the feature matrix.

from alipy.oracle import OracleQueryFeatures
oracle = OracleQueryFeatures(feature_mat=X)

When querying, you need to provide a single or list of valid feature index we define above.

feature, cost = oracle.query_by_index((1, 2)) # query the 3rd feature of 2nd instance
features, cost = oracle.query_by_index([(1, 2), (0, 1)])

Matrix completion

Just like some instance querying methods need to evaluate the unlabeled data. The feature querying methods also need to evaluate the missing entries. One direct way is to use the matrix completion methods.

ALiPy implement 2 completion algorithms which are the by-products when implementing the query strategies.

AFASMC_mc (KDD'18) : This method completes the matrix with the supervised information.

SVD_mc (KDD'18) : The classical SVD method for matrix completion.

Feature querying methods

ALiPy provides some existing algorithms for experiment comparing:

AFASMC (KDD 2018) : This method completes the matrix with the proposed method first, and then selects missing entries based on the variance of the completion results.

Stability (ICDM 2013) : This method uses SVD matrix completion algorithms with different rank parameters to construct a committee. Then, it selects missing entries based on the variance of the completion results.

Random : This method selects missing entries for querying randomly.

For the detailed usage of them, please refer to their api pages and the following example code.

Feature querying experiment example

import copy
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression
from alipy.data_manipulate.al_split import split_features
from alipy.query_strategy.query_features import QueryFeatureAFASMC, QueryFeatureRandom, QueryFeatureStability, \
    AFASMC_mc, IterativeSVD_mc
from alipy.index import MultiLabelIndexCollection
from alipy.experiment.stopping_criteria import StoppingCriteria
from alipy.experiment import StateIO, State, ExperimentAnalyser
from alipy.metrics import accuracy_score
from alipy.index import map_whole_index_to_train

# load and split data
X, y = make_classification(n_samples=800, n_features=20, n_informative=2, n_redundant=2,
                           n_repeated=0, n_classes=2, n_clusters_per_class=1, weights=None, flip_y=0.01,
                           hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
tr, te, lab, unlab = split_features(feature_matrix=X, test_ratio=0.3, missing_rate=0.5,
                                    split_count=10, saving_path=None)

# Use the default Logistic Regression classifier
model = LogisticRegression()

# The cost budget is 50 times querying
stopping_criterion = StoppingCriteria('num_of_queries', 50)

AFASMC_result = []
rand_result =[]
Stable_result = []

# AFASMC
for i in range(5):
    train_idx = tr[i]
    test_idx = te[i]
    label_ind = MultiLabelIndexCollection(lab[i], label_size=X.shape[1])
    unlab_ind = MultiLabelIndexCollection(unlab[i], label_size=X.shape[1])
    saver = StateIO(i, train_idx, test_idx, label_ind, unlab_ind)
    strategy = QueryFeatureAFASMC(X=X, y=y, train_idx=train_idx)

    while not stopping_criterion.is_stop():
        # query
        selected_feature = strategy.select(observed_entries=label_ind, unkonwn_entries=unlab_ind)

        # update index
        label_ind.update(selected_feature)
        unlab_ind.difference_update(selected_feature)

        # train/test
        lab_in_train = map_whole_index_to_train(train_idx, label_ind)
        X_mc = AFASMC_mc(X=X[train_idx], y=y[train_idx], omega=lab_in_train)
        model.fit(X_mc, y[train_idx])
        pred = model.predict(X[test_idx])
        perf = accuracy_score(y_true=y[test_idx], y_pred=pred)

        # save
        st = State(select_index=selected_feature, performance=perf)
        saver.add_state(st)
        # saver.save()

        stopping_criterion.update_information(saver)

    stopping_criterion.reset()
    AFASMC_result.append(copy.deepcopy(saver))

SVD_mc = IterativeSVD_mc(rank=4)
# Stablility
for i in range(5):
    train_idx = tr[i]
    test_idx = te[i]
    label_ind = MultiLabelIndexCollection(lab[i], label_size=X.shape[1])
    unlab_ind = MultiLabelIndexCollection(unlab[i], label_size=X.shape[1])
    saver = StateIO(i, train_idx, test_idx, label_ind, unlab_ind)
    strategy = QueryFeatureStability(X=X, y=y, train_idx=train_idx, rank_arr=[4, 6, 8])

    while not stopping_criterion.is_stop():
        # query
        selected_feature = strategy.select(observed_entries=label_ind, unkonwn_entries=unlab_ind)

        # update index
        label_ind.update(selected_feature)
        unlab_ind.difference_update(selected_feature)

        # train/test
        lab_in_train = map_whole_index_to_train(train_idx, label_ind)
        X_mc = SVD_mc.impute(X[train_idx], observed_mask=lab_in_train.get_matrix_mask(mat_shape=(len(train_idx), X.shape[1]), sparse=False))
        model.fit(X_mc, y[train_idx])
        pred = model.predict(X[test_idx])
        perf = accuracy_score(y_true=y[test_idx], y_pred=pred)

        # save
        st = State(select_index=selected_feature, performance=perf)
        saver.add_state(st)

        stopping_criterion.update_information(saver)

    stopping_criterion.reset()
    Stable_result.append(copy.deepcopy(saver))

# rand
for i in range(5):
    train_idx = tr[i]
    test_idx = te[i]
    label_ind = MultiLabelIndexCollection(lab[i], label_size=X.shape[1])
    unlab_ind = MultiLabelIndexCollection(unlab[i], label_size=X.shape[1])
    saver = StateIO(i, train_idx, test_idx, label_ind, unlab_ind)
    strategy = QueryFeatureRandom()

    while not stopping_criterion.is_stop():
        # query
        selected_feature = strategy.select(observed_entries=label_ind, unkonwn_entries=unlab_ind)

        # update index
        label_ind.update(selected_feature)
        unlab_ind.difference_update(selected_feature)

        # train/test
        lab_in_train = map_whole_index_to_train(train_idx, label_ind)
        X_mc = SVD_mc.impute(X[train_idx], observed_mask=lab_in_train.get_matrix_mask(mat_shape=(len(train_idx), X.shape[1]), sparse=False))
        model.fit(X_mc, y[train_idx])
        pred = model.predict(X[test_idx])
        perf = accuracy_score(y_true=y[test_idx], y_pred=pred)

        # save
        st = State(select_index=selected_feature, performance=perf)
        saver.add_state(st)

        stopping_criterion.update_information(saver)

    stopping_criterion.reset()
    rand_result.append(copy.deepcopy(saver))

analyser = ExperimentAnalyser()
analyser.add_method(method_results=AFASMC_result, method_name='AFASMC')
analyser.add_method(method_results=Stable_result, method_name='Stability')
analyser.add_method(method_results=rand_result, method_name='Random')
print(analyser)
analyser.plot_learning_curves()