Multi label in alipy

In multi label setting, an instance is associated with multiple labels simultaneously.

In active learning literature, there are 2 ways to query labels for multi label datasets:

1. Query all labels of an instance.

2. Query an instance-label pair at a time.

For the 1st situation, it is the same with the single label setting in implementation.

And for the 2nd situation, alipy provides many tools for supporting this setting.

Next, we will introduce the tools for multi label setting in alipy.

Multi label index

We first give a definition of a multi-label index:

Each index should be a tuple with 2 elements. The first element represents the index of instance, while the second one represents the indexes of labels. If you want to query all labels of an instance, your index should only have 1 element: (example_index, ). Otherwise, set 2 elements (example_index, [label_indexes]) to query specific labels.

Some examples of valid multi-label indexes include:

queried_index = (1, [3,4])	# query the 4th, 5th labels of the 2nd instance
queried_index = (1, [3])
queried_index = (1, 3)
queried_index = (1, (3))
queried_index = (1, (3,4))
queried_index = (1, )   # query all labels

Data split

To split the multi label datasets, you can use alipy.data_manipulate.split_multi_label function. It will split the dataset into training, testing set, and in training set, there are a small fully labeled set and a large unlabeled pool.

Note that, the returned indexes of label and unlabel set are the multi-label indexes we define above.

from alipy.data_manipulate import split_multi_label

mult_y = [[1, 1, 1], [0, 1, 1], [0, 1, 0]]  # 3 instances with 3 labels.
train_idx, test_idx, label_idx, unlabel_idx = split_multi_label(
    y=mult_y, split_count=1, all_class=False,
    test_ratio=0.3, initial_label_rate=0.5,
    saving_path=None
)

The values in train_idx, test_idx, label_idx, unlabel_idx are:

[array([0, 1])]
[array([2])]
[[(0,)]]
[[(1,)]]

MultiLabelIndexCollection

ALiPy provide an another IndexCollection class for multi label setting. The interfaces of this class is mainly the same with the IndexCollection in single label setting. However, we add many useful functions to support the multi label settings. These functions include : 1. Accept different types of ndexes. 2. Accept mask matrix. 3. Provide retrieving methods.

Since the introductions to MultiLabelIndexCollection will take up a lot of space. We refer users to the introduction toMultiLabelIndexCollection page for more details.

Multi label oracle

ALiPy provides alipy.oracle.OracleQueryMultiLabel for instance-label pair querying. The initialization of this class is the same as the Oracle .

from alipy.oracle import OracleQueryMultiLabel
oracle = OracleQueryMultiLabel(labels=mult_y)

When querying, you need to provide a single or list of valid multi label index we define above.

label, cost = oracle.query_by_index((1, 2)) # query the 3rd label of 2nd instance
labels, cost = oracle.query_by_index([(1, 2), (0, 1)])

Multi label metrics

The available multi label metrics in alipy are accuracy_score, hamming_loss, one_error, coverage_error, label_ranking_loss, average_precision_score, label_ranking_average_precision_score, micro_auc_score .

You can use them by import the metrics module:

from alipy.metrics import hamming_loss
hl = hamming_loss(y_true=[[0, 1, 0]], y_pred=[[1, 1, 0]])

Multi label query strategies

ALiPy provides some existing algorithms for experiment comparing:

AUDI (ICDM 2013) : Select an instance-label pair based on uncertainty and diversity.

QUIRE (TPAMI 2014) : Select an instance-label pair based on the informativeness and representativeness.

MMC (KDD 2009) : Select instance to query all of its labels based on maximum loss reduction with maximal confidence.

Adaptive (IJCAI 2013) : Select instance to query all of its labels based on max margin uncertainty and label cardinality inconsistency.

Random : Select instances or instance-label pairs randomly.

The usages of these methods are mainly the same with the normal setting. Note that, the returned indexes is a list of multi label index we define above.

Multi label experiment examples

import copy
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder
from alipy.query_strategy.multi_label import *
from alipy.index.multi_label_tools import get_Xy_in_multilabel
from alipy import ToolBox

X, y = load_iris(return_X_y=True)
mlb = OneHotEncoder()
mult_y = mlb.fit_transform(y.reshape((-1,1)))
mult_y = np.asarray(mult_y.todense())
mult_y[mult_y == 0] = -1

alibox = ToolBox(X=X, y=mult_y, query_type='PartLabels')
alibox.split_AL(test_ratio=0.2, initial_label_rate=0.05, all_class=False)

def main_loop(alibox, round, strategy):
    train_idx, test_idx, label_ind, unlab_ind = alibox.get_split(round)
    # Get intermediate results saver for one fold experiment
    saver = alibox.get_stateio(round)
    # base model
    model = LabelRankingModel()

    while len(label_ind) <= 120:
        # query and update
        select_labs = strategy.select(label_ind, unlab_ind)
        # use cost to record the amount of queried instance-label pairs
        if len(select_labs[0]) == 1:
            cost = mult_y.shape[1]
        else:
            cost = len(select_labs)
        label_ind.update(select_labs)
        unlab_ind.difference_update(select_labs)

        # train/test
        X_tr, y_tr, _ = get_Xy_in_multilabel(label_ind, X=X, y=mult_y)
        model.fit(X=X_tr, y=y_tr)
        pres, pred = model.predict(X[test_idx])

        perf = alibox.calc_performance_metric(y_true=mult_y[test_idx], y_pred=pred, performance_metric='hamming_loss')

        # save
        st = alibox.State(select_index=select_labs, performance=perf, cost=cost)
        saver.add_state(st)

    return copy.deepcopy(saver)

audi_result = []
quire_result = []
random_result = []
mmc_result = []
adaptive_result = []

for round in range(5):
    # init strategies
    audi = QueryMultiLabelAUDI(X, mult_y)
    quire = QueryMultiLabelQUIRE(X, mult_y)
    mmc = QueryMultiLabelMMC(X, mult_y)
    adaptive = QueryMultiLabelAdaptive(X, mult_y)
    random = QueryMultiLabelRandom()

    audi_result.append(main_loop(alibox, round, strategy=audi))
    quire_result.append(main_loop(alibox, round, strategy=quire))
    mmc_result.append(main_loop(alibox, round, strategy=mmc))
    adaptive_result.append(main_loop(alibox, round, strategy=adaptive))
    random_result.append(main_loop(alibox, round, strategy=random))

analyser = alibox.get_experiment_analyser(x_axis='cost')
analyser.add_method(method_name='AUDI', method_results=audi_result)
analyser.add_method(method_name='QUIRE', method_results=quire_result)
analyser.add_method(method_name='RANDOM', method_results=random_result)
analyser.add_method(method_name='MMC', method_results=mmc_result)
analyser.add_method(method_name='Adaptive', method_results=adaptive_result)
analyser.plot_learning_curves()