Deal with cost-effective dataset

In some applications,The cost of querying different labels can be different. The cost information should be considered in active selection.

In alipy, we provide the following tool classes to support your cost-effective expeirment.

Oracle with cost information

To assign different cost to different labels, one way is to set the cost parameter when initializing an oracle class.

labels = [0, 1 ,0]
cost = [2, 1, 2]
from alipy.oracle import Oracle
oracle = Oracle(labels=labels, cost=cost)

In this way, you can get the corresponding cost when querying from the oracle.

labels, cost = oracle.query_by_index(indexes=[1])

Save the cost after each query

Note that, you should save the queried cost after each query in the cost-effective experiment for analysing.

from alipy.experiment import State
st = State(select_index=select_ind, performance=accuracy, cost=cost)

You can use your own data structure to save the results. however, if you are going to use the Analyser in alipy, your results should satisfy some constraints which are introduced below.

Cost sensitive plotting

You can analyse the cost-effective results by alipy.experiment.ExperimentAnalyser .

In cost-effective setting, your results should be one of:

- StateIO object which stores k State object which contain cost entries for k queries.

- A list contains n tuples with 2 elements, in which, the first element is the x_axis (e.g., accumulative cost), and the second element is the y_axis (e.g., the performance)

Here is an example:

radom_result = [[(1, 0.6), (2, 0.7), (2, 0.8), (1, 0.9)],
                [(1, 0.7), (1, 0.7), (1.5, 0.75), (2.5, 0.85)]]  # 2 folds, 4 queries for each fold.
uncertainty_result = [saver1, saver2]  # each State object in the saver must have the 'cost' entry.
from alipy.experiment import ExperimentAnalyser

analyser = ExperimentAnalyser(x_axis='cost')
analyser.add_method('random', radom_result)
analyser.add_method('uncertainty', uncertainty_result)

Due to the length of results in the cost sensitive setting are different. A interpolate will be performed automatically to align each result of compared methods. All you need is to make sure the budgets of different methods are the same.

analyser.plot_learning_curves(title='Learning curves example', std_area=True)

Cost sensitive query strategies

ALiPy provides some existing algorithms for experiment comparing:

HALC (IJCAI 2018) : This method designs a strategy for hierarchical multi-label classification problem. It evaluates the informativeness of instances by incorporating the potential contribution of ancestor and descendant labels.

Cost performance : This method selects the instance-label pair with the highest cost performance.

Random : This method selects the instance-label pairs randomly.


Note that, these methods need the cost budget and cost value of each unlabeled entry when selecting data. For the detailed usage of them, please refer to their api pages and the following example code.

Cost sensitive experiment example

import numpy as np 
import copy

from sklearn.datasets import make_multilabel_classification
from sklearn.ensemble import RandomForestClassifier

from alipy import ToolBox
from alipy.index.multi_label_tools import get_Xy_in_multilabel, check_index_multilabel
from alipy.query_strategy.cost_sensitive import QueryCostSensitiveHALC, QueryCostSensitivePerformance, QueryCostSensitiveRandom
from alipy.query_strategy.cost_sensitive import hierarchical_multilabel_mark
from alipy.metrics.performance import type_of_target

X, y = make_multilabel_classification(n_samples=2000, n_features=20, n_classes=5,
                                   n_labels=3, length=50, allow_unlabeled=True,
                                   sparse=False, return_indicator='dense',
                                   return_distributions=False,
                                   random_state=None)
y[y == 0] = -1

# the cost of each class
cost = [1, 3, 3, 7, 10]

# if node_i is the parent of node_j , then label_tree(i,j)=1 else 0
label_tree = np.zeros((5,5),dtype=np.int)
label_tree[0, 1] = 1
label_tree[0, 2] = 1
label_tree[1, 3] = 1
label_tree[2, 4] = 1

alibox = ToolBox(X=X, y=y, query_type='PartLabels')

# Split data
alibox.split_AL(test_ratio=0.3, initial_label_rate=0.1, split_count=10)

# baseclassifier model use RFC
model = RandomForestClassifier()

# The budget of query
budget = 40

# The cost budget is 500
stopping_criterion = alibox.get_stopping_criterion('cost_limit', 500)

performance_result = []
halc_result = []
random_result = []

def main_loop(alibox, strategy, round):
    # Get the data split of one fold experiment
    train_idx, test_idx, label_ind, unlab_ind = alibox.get_split(round)
    # Get intermediate results saver for one fold experiment
    saver = alibox.get_stateio(round)
    while not stopping_criterion.is_stop():
        # Select a subset of Uind according to the query strategy
        select_ind = strategy.select(label_ind, unlab_ind, cost=cost, budget=budget)
        # 
        select_ind = hierarchical_multilabel_mark(select_ind, label_ind, label_tree, y)

        label_ind.update(select_ind)
        unlab_ind.difference_update(select_ind)
            
        # Update model and calc performance according to the model you are using
        X_tr, y_tr, _ = get_Xy_in_multilabel(label_ind, X=X, y=y)
        model.fit(X_tr, y_tr)
        pred = model.predict(X[test_idx, :])
        pred[pred == 0] = 1

        performance = alibox.calc_performance_metric(y_true=y[test_idx], y_pred=pred, performance_metric='hamming_loss')

        # Save intermediate results to file
        st = alibox.State(select_index=select_ind.index, performance=performance, cost=budget)
        saver.add_state(st)
        # Passing the current progress to stopping criterion object
        stopping_criterion.update_information(saver)
    # Reset the progress in stopping criterion object
    stopping_criterion.reset()
    return saver

for round in range(5):
    train_idx, test_idx, label_ind, unlab_ind = alibox.get_split(round)
    # Use pre-defined strategy
    random = QueryCostSensitiveRandom(X,y)
    perf = QueryCostSensitivePerformance(X, y)
    halc = QueryCostSensitiveHALC(X, y,label_tree=label_tree)

    random_result.append(copy.deepcopy(main_loop(alibox, random, round)))
    performance_result.append(copy.deepcopy(main_loop(alibox, perf, round)))
    halc_result.append(copy.deepcopy(main_loop(alibox, halc, round)))

analyser = alibox.get_experiment_analyser(x_axis='cost')
analyser.add_method(method_name='random', method_results=random_result)
analyser.add_method(method_name='performance', method_results=performance_result)
analyser.add_method(method_name='HALC', method_results=halc_result)

print(analyser)
analyser.plot_learning_curves(title='Example of cost-sensitive', std_area=False)

Copyright © 2018, alipy developers (BSD 3 License).