Noisy oracles

Noisy oracles setting

In reality, the labels given by human is not always correct. For one hand, there are some inevitable noise comes from the instrumentation of experimental setting. On the other hand, people can become distracted or fatigued over time, introducing variability in the quality of their annotations.

ALiPy implements several strategies in noisy oracles settings. Some of then mainly evaluate the quality or expertise of each oracle, and the rest tries to obtain the accurate label for each instance whose labels are provided by several noisy oracles.

In the following content, we will first introduce the tools designed for noisy oracles and usage of this kind of strategies, and then an example of noisy oracles experiment will be presented.

Noisy oracle

In alipy, it is very easy to contruct a noisy oracle. We have already found that, the returned values of an oracle is determined by the knowledge pool when initializing. You can use a corrupted label vector to initialize a noisy oracle.

clean_oracle = Oracle(labels=[1, 0, 1])
noisy_oracle = Oracle(labels=[0, 0, 0])

Noisy oracles

Sometimes, there are several available oracles working together. Each of them has different specialty.

ALiPy provide the alipy.oracle.Oracles for this setting.

You can add several oracle objects to this container:

from alipy.oracle import Oracles
oracles = Oracles()
oracles.add_oracle(oracle_name='Tom', oracle_object=clean_oracle)
oracles.add_oracle(oracle_name='Amy', oracle_object=noisy_oracle)

And query from a certain oracle by its name:

labels, cost = oracles.query_from(index_for_querying=[23], oracle_name='Tom')

This class will store the query history of the oracles. You can obtained the full history for further usages.

Knowledge repository

alipy.oracle.ElementRepository is a class to store the queried information (e.g., the queried labels, selected indexes, cost) which is a supporting tool for oracle.

This class is usually unnecessary for a clean oracle, because you can get the queried labels by indexing your ground-truth label matrix with the selected indexes. However, in some special settings (e.g., noisy oracles), users may want to store the queried information for further analysis.

Functions of knowledge repository include:

1. Retrieving queried information without cost

2. History recording

3. Get labeled set for training model

Here we introduce the basic usage of this class.

There are 2 catagories of knowledge repository in alipy:

1. ElementRepository for fine-grained (element-wise) data

2. MatrixRepository for instance-wise data

You can get them by:

from alipy.oracle import ElementRepository, MatrixRepository
ele_rep = ElementRepository(labels=y[label_ind], indexes=label_ind, examples=X[label_ind])
mat_rep = MatrixRepository(labels=y[label_ind], indexes=label_ind, examples=X[label_ind])

Note that, in ElementRepository , the labels and examples can be complex object. But in MatrixRepository , they must be an array. And one more thing, the example parameter is optional, but if you want to get the feature matrix of labeled data, you must provide the feature matrix in initializing and each updating of query.

Some commonly used methods are:

selected_ind = [23]
labels, cost = oracle.query_by_index(indexes=selected_ind)
ele_rep.update_query(labels=labels, indexes=selected_ind, cost=cost, examples=X[selected_ind])
labels, cost = ele_rep.retrieve_by_indexes(indexes=selected_ind)    # retrieve the queried instance
X_lab, y_lab, ind_lab = ele_rep.get_training_data()       # get the feature and label matrix of labeled set

You can also store the feature vector of the queried instances by specifying the examples parameters when initializing and updating query.

Noisy oracles query strategies

ALiPy provides some existing algorithms for experiment comparing:

CEAL (IJCAI 2017) : This method will select an instance-labeler pair (x, a), and queries the label of x from a, where the selection of both the instance and labeler is based on a evaluation function Q(x, a).

IEthresh (KDD 2009) : This method will select a batch of oracles to label the selected instance. It will score for each oracle according to the difference between their labeling results and the majority vote results.

All : A baseline method to select instance by uncertainty and query from all oracles and return the majority vote result.

Random : A baseline method to select oracles randomly.

Note that, these methods will return the index of selected instance and the names of selected oracles, they will not execute the query process. For the detailed usage of them, please refer to their api pages and the following example code.

Noisy oracles experiment example

from alipy.toolbox import ToolBox
from alipy.oracle import Oracle, Oracles
from alipy.utils.misc import randperm
from alipy.query_strategy.noisy_oracles import QueryNoisyOraclesCEAL, QueryNoisyOraclesAll, \
    QueryNoisyOraclesIEthresh, QueryNoisyOraclesRandom, get_majority_vote
from sklearn.datasets import make_classification
import copy
import numpy as np

X, y = make_classification(n_samples=800, n_features=20, n_informative=2, n_redundant=2,
                           n_repeated=0, n_classes=2, n_clusters_per_class=1, weights=None, flip_y=0.01,
                           hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)

alibox = ToolBox(X=X, y=y, query_type='AllLabels', saving_path='.')

# Split data
alibox.split_AL(test_ratio=0.3, initial_label_rate=0.15, split_count=10)

# Use the default Logistic Regression classifier
model = alibox.get_default_model()

# The cost budget is 50 times querying
stopping_criterion = alibox.get_stopping_criterion('cost_limit', 30)

# initialize noisy oracles with different noise level
n_samples = len(y)
y1 = y.copy()
y2 = y.copy()
y3 = y.copy()
y4 = y.copy()
y5 = y.copy()
perms = randperm(n_samples-1)
y1[perms[0:round(n_samples*0.1)]] = 1-y1[perms[0:round(n_samples*0.1)]]
perms = randperm(n_samples-1)
y2[perms[0:round(n_samples*0.2)]] = 1-y2[perms[0:round(n_samples*0.2)]]
perms = randperm(n_samples-1)
y3[perms[0:round(n_samples*0.3)]] = 1-y3[perms[0:round(n_samples*0.3)]]
perms = randperm(n_samples-1)
y4[perms[0:round(n_samples*0.4)]] = 1-y4[perms[0:round(n_samples*0.4)]]
perms = randperm(n_samples-1)
y5[perms[0:round(n_samples*0.5)]] = 1-y5[perms[0:round(n_samples*0.5)]]
oracle1 = Oracle(labels=y1, cost=np.zeros(y.shape)+1.2)
oracle2 = Oracle(labels=y2, cost=np.zeros(y.shape)+.8)
oracle3 = Oracle(labels=y3, cost=np.zeros(y.shape)+.5)
oracle4 = Oracle(labels=y4, cost=np.zeros(y.shape)+.4)
oracle5 = Oracle(labels=y5, cost=np.zeros(y.shape)+.3)
oracle6 = Oracle(labels=[0]*n_samples, cost=np.zeros(y.shape)+.3)
oracle7 = Oracle(labels=[1]*n_samples, cost=np.zeros(y.shape)+.3)
oracles = Oracles()
oracles.add_oracle(oracle_name='o1', oracle_object=oracle1)
oracles.add_oracle(oracle_name='o2', oracle_object=oracle2)
oracles.add_oracle(oracle_name='o3', oracle_object=oracle3)
oracles.add_oracle(oracle_name='o4', oracle_object=oracle4)
# oracles.add_oracle(oracle_name='o5', oracle_object=oracle5)
oracles.add_oracle(oracle_name='oa0', oracle_object=oracle6)
oracles.add_oracle(oracle_name='oa1', oracle_object=oracle7)

# oracles_list = [oracle1, oracle2]

# def main loop
def al_loop(strategy, alibox, round):
    # Get the data split of one fold experiment
    train_idx, test_idx, label_ind, unlab_ind = alibox.get_split(round)
    # Get intermediate results saver for one fold experiment
    saver = alibox.get_stateio(round)
    # Get repository to store noisy labels
    repo = alibox.get_repository(round)

    while not stopping_criterion.is_stop():
        # Query
        select_ind, select_ora = strategy.select(label_ind, unlab_ind)
        vote_count, vote_result, cost = get_majority_vote(selected_instance=select_ind, oracles=oracles, names=select_ora)
        repo.update_query(labels=vote_result, indexes=select_ind)

        # update ind
        label_ind.update(select_ind)
        unlab_ind.difference_update(select_ind)

        # Train/test
        _, y_lab, indexes_lab = repo.get_training_data()
        model.fit(X=X[indexes_lab], y=y_lab)
        pred = model.predict(X[test_idx])
        perf = alibox.calc_performance_metric(y_true=y[test_idx], y_pred=pred)

        # save
        st = alibox.State(select_index=select_ind, performance=perf, cost=cost)
        saver.add_state(st)

        stopping_criterion.update_information(saver)

    stopping_criterion.reset()
    return saver

ceal_result = []
iet_result = []
all_result = []
rand_result = []

for round in range(5):
    train_idx, test_idx, label_ind, unlab_ind = alibox.get_split(round)
    # init strategies
    ceal = QueryNoisyOraclesCEAL(X, y, oracles=oracles, initial_labeled_indexes=label_ind)
    iet = QueryNoisyOraclesIEthresh(X=X, y=y, oracles=oracles, initial_labeled_indexes=label_ind)
    all = QueryNoisyOraclesAll(X=X, y=y, oracles=oracles)
    rand = QueryNoisyOraclesRandom(X=X, y=y, oracles=oracles)

    ceal_result.append(copy.deepcopy(al_loop(ceal, alibox, round)))
    iet_result.append(copy.deepcopy(al_loop(iet, alibox, round)))
    all_result.append(copy.deepcopy(al_loop(all, alibox, round)))
    rand_result.append(copy.deepcopy(al_loop(rand, alibox, round)))

print(oracles.full_history())
analyser = alibox.get_experiment_analyser(x_axis='cost')
analyser.add_method(method_results=ceal_result, method_name='ceal')
analyser.add_method(method_results=iet_result, method_name='iet')
analyser.add_method(method_results=all_result, method_name='all')
analyser.add_method(method_results=rand_result, method_name='rand')
analyser.plot_learning_curves()