In reality, the labels given by human is not always correct. For one hand, there are some inevitable noise comes from the instrumentation of experimental setting. On the other hand, people can become distracted or fatigued over time, introducing variability in the quality of their annotations.
ALiPy implements several strategies in noisy oracles settings. Some of then mainly evaluate the quality or expertise of each oracle, and the rest tries to obtain the accurate label for each instance whose labels are provided by several noisy oracles.
In the following content, we will first introduce the tools designed for noisy oracles and usage of this kind of strategies, and then an example of noisy oracles experiment will be presented.
In alipy, it is very easy to contruct a noisy oracle. We have already found that, the returned values of an oracle is determined by the knowledge pool when initializing. You can use a corrupted label vector to initialize a noisy oracle.
clean_oracle = Oracle(labels=[1, 0, 1])
noisy_oracle = Oracle(labels=[0, 0, 0])
Sometimes, there are several available oracles working together. Each of them has different specialty.
ALiPy provide the
alipy.oracle.Oracles
for this setting.
You can add several oracle objects to this container:
from alipy.oracle import Oracles
oracles = Oracles()
oracles.add_oracle(oracle_name='Tom', oracle_object=clean_oracle)
oracles.add_oracle(oracle_name='Amy', oracle_object=noisy_oracle)
And query from a certain oracle by its name:
labels, cost = oracles.query_from(index_for_querying=[23], oracle_name='Tom')
This class will store the query history of the oracles. You can obtained the full history for further usages.
alipy.oracle.ElementRepository
is a class to store the queried information
(e.g., the queried labels, selected indexes, cost) which is a supporting
tool for oracle.
This class is usually unnecessary for a clean oracle, because you can get the queried labels by indexing your ground-truth label matrix with the selected indexes. However, in some special settings (e.g., noisy oracles), users may want to store the queried information for further analysis.
Functions of knowledge repository include:
1. Retrieving queried information without cost
2. History recording
3. Get labeled set for training model
Here we introduce the basic usage of this class.
There are 2 catagories of knowledge repository in alipy:
1.
ElementRepository
for fine-grained (element-wise) data
2.
MatrixRepository
for instance-wise data
You can get them by:
from alipy.oracle import ElementRepository, MatrixRepository
ele_rep = ElementRepository(labels=y[label_ind], indexes=label_ind, examples=X[label_ind])
mat_rep = MatrixRepository(labels=y[label_ind], indexes=label_ind, examples=X[label_ind])
Note that, in
ElementRepository
, the labels and examples can be complex object.
But in
MatrixRepository
, they must be an array. And one more thing,
the
example
parameter is optional, but if you want to
get the feature matrix of labeled data, you must provide the feature matrix in initializing
and each updating of query.
Some commonly used methods are:
selected_ind = [23]
labels, cost = oracle.query_by_index(indexes=selected_ind)
ele_rep.update_query(labels=labels, indexes=selected_ind, cost=cost, examples=X[selected_ind])
labels, cost = ele_rep.retrieve_by_indexes(indexes=selected_ind) # retrieve the queried instance
X_lab, y_lab, ind_lab = ele_rep.get_training_data() # get the feature and label matrix of labeled set
You can also store the feature vector of the queried instances by specifying the
examples
parameters
when initializing and updating query.
ALiPy provides some existing algorithms for experiment comparing:
CEAL (IJCAI 2017) : This method will select an instance-labeler pair (x, a), and queries the label of x from a, where the selection of both the instance and labeler is based on a evaluation function Q(x, a).
IEthresh (KDD 2009) : This method will select a batch of oracles to label the selected instance. It will score for each oracle according to the difference between their labeling results and the majority vote results.
All : A baseline method to select instance by uncertainty and query from all oracles and return the majority vote result.
Random : A baseline method to select oracles randomly.
Note that, these methods will return the index of selected instance and the names of selected oracles, they will not execute the query process. For the detailed usage of them, please refer to their api pages and the following example code.
from alipy.toolbox import ToolBox
from alipy.oracle import Oracle, Oracles
from alipy.utils.misc import randperm
from alipy.query_strategy.noisy_oracles import QueryNoisyOraclesCEAL, QueryNoisyOraclesAll, \
QueryNoisyOraclesIEthresh, QueryNoisyOraclesRandom, get_majority_vote
from sklearn.datasets import make_classification
import copy
import numpy as np
X, y = make_classification(n_samples=800, n_features=20, n_informative=2, n_redundant=2,
n_repeated=0, n_classes=2, n_clusters_per_class=1, weights=None, flip_y=0.01,
hypercube=True, shift=0.0, scale=1.0, shuffle=True, random_state=None)
alibox = ToolBox(X=X, y=y, query_type='AllLabels', saving_path='.')
# Split data
alibox.split_AL(test_ratio=0.3, initial_label_rate=0.15, split_count=10)
# Use the default Logistic Regression classifier
model = alibox.get_default_model()
# The cost budget is 50 times querying
stopping_criterion = alibox.get_stopping_criterion('cost_limit', 30)
# initialize noisy oracles with different noise level
n_samples = len(y)
y1 = y.copy()
y2 = y.copy()
y3 = y.copy()
y4 = y.copy()
y5 = y.copy()
perms = randperm(n_samples-1)
y1[perms[0:round(n_samples*0.1)]] = 1-y1[perms[0:round(n_samples*0.1)]]
perms = randperm(n_samples-1)
y2[perms[0:round(n_samples*0.2)]] = 1-y2[perms[0:round(n_samples*0.2)]]
perms = randperm(n_samples-1)
y3[perms[0:round(n_samples*0.3)]] = 1-y3[perms[0:round(n_samples*0.3)]]
perms = randperm(n_samples-1)
y4[perms[0:round(n_samples*0.4)]] = 1-y4[perms[0:round(n_samples*0.4)]]
perms = randperm(n_samples-1)
y5[perms[0:round(n_samples*0.5)]] = 1-y5[perms[0:round(n_samples*0.5)]]
oracle1 = Oracle(labels=y1, cost=np.zeros(y.shape)+1.2)
oracle2 = Oracle(labels=y2, cost=np.zeros(y.shape)+.8)
oracle3 = Oracle(labels=y3, cost=np.zeros(y.shape)+.5)
oracle4 = Oracle(labels=y4, cost=np.zeros(y.shape)+.4)
oracle5 = Oracle(labels=y5, cost=np.zeros(y.shape)+.3)
oracle6 = Oracle(labels=[0]*n_samples, cost=np.zeros(y.shape)+.3)
oracle7 = Oracle(labels=[1]*n_samples, cost=np.zeros(y.shape)+.3)
oracles = Oracles()
oracles.add_oracle(oracle_name='o1', oracle_object=oracle1)
oracles.add_oracle(oracle_name='o2', oracle_object=oracle2)
oracles.add_oracle(oracle_name='o3', oracle_object=oracle3)
oracles.add_oracle(oracle_name='o4', oracle_object=oracle4)
# oracles.add_oracle(oracle_name='o5', oracle_object=oracle5)
oracles.add_oracle(oracle_name='oa0', oracle_object=oracle6)
oracles.add_oracle(oracle_name='oa1', oracle_object=oracle7)
# oracles_list = [oracle1, oracle2]
# def main loop
def al_loop(strategy, alibox, round):
# Get the data split of one fold experiment
train_idx, test_idx, label_ind, unlab_ind = alibox.get_split(round)
# Get intermediate results saver for one fold experiment
saver = alibox.get_stateio(round)
# Get repository to store noisy labels
repo = alibox.get_repository(round)
while not stopping_criterion.is_stop():
# Query
select_ind, select_ora = strategy.select(label_ind, unlab_ind)
vote_count, vote_result, cost = get_majority_vote(selected_instance=select_ind, oracles=oracles, names=select_ora)
repo.update_query(labels=vote_result, indexes=select_ind)
# update ind
label_ind.update(select_ind)
unlab_ind.difference_update(select_ind)
# Train/test
_, y_lab, indexes_lab = repo.get_training_data()
model.fit(X=X[indexes_lab], y=y_lab)
pred = model.predict(X[test_idx])
perf = alibox.calc_performance_metric(y_true=y[test_idx], y_pred=pred)
# save
st = alibox.State(select_index=select_ind, performance=perf, cost=cost)
saver.add_state(st)
stopping_criterion.update_information(saver)
stopping_criterion.reset()
return saver
ceal_result = []
iet_result = []
all_result = []
rand_result = []
for round in range(5):
train_idx, test_idx, label_ind, unlab_ind = alibox.get_split(round)
# init strategies
ceal = QueryNoisyOraclesCEAL(X, y, oracles=oracles, initial_labeled_indexes=label_ind)
iet = QueryNoisyOraclesIEthresh(X=X, y=y, oracles=oracles, initial_labeled_indexes=label_ind)
all = QueryNoisyOraclesAll(X=X, y=y, oracles=oracles)
rand = QueryNoisyOraclesRandom(X=X, y=y, oracles=oracles)
ceal_result.append(copy.deepcopy(al_loop(ceal, alibox, round)))
iet_result.append(copy.deepcopy(al_loop(iet, alibox, round)))
all_result.append(copy.deepcopy(al_loop(all, alibox, round)))
rand_result.append(copy.deepcopy(al_loop(rand, alibox, round)))
print(oracles.full_history())
analyser = alibox.get_experiment_analyser(x_axis='cost')
analyser.add_method(method_results=ceal_result, method_name='ceal')
analyser.add_method(method_results=iet_result, method_name='iet')
analyser.add_method(method_results=all_result, method_name='all')
analyser.add_method(method_results=rand_result, method_name='rand')
analyser.plot_learning_curves()
Copyright © 2018, alipy developers (BSD 3 License).