Oracle

alipy.oracle is a module which implemented the basic definition of oracle used in experiment.

The oracle (e.g., a human annotator) in active learning can provide supervised information for the instances with some cost.

In the follwing of the instruction, we will first introduce the general usage of oracle. Then, some different types of oracles will be presented. At last, the knowledge repository class which is a supporting tool for oracle will be introduced.

General usage

In alipy, the oracle is a class for retrieving. when initializing, the knowledge pool of the oracle should be given which usually is the labels of instances. Then, the learner can query from the oracle by instance or index. The returned values of an oracle include the supervised information corresponds to the queried_index and the cost incurred by this query.

Initialize

The oracle in alipy can be initialized with a list of object (e.g., labels), and the corresponding indexes of the object:

indexes = [34, 56, 74]
labels = [0, 1 ,0]
from alipy.oracle import Oracle
oracle = Oracle(labels=labels, indexes=indexes)

You can also provide the labels only, the indexes will be constructed from 0 automatically:

>>>oracle = Oracle(labels=labels)
>>>print(oracle.index_keys)
[0, 1, 2]

The oracle also accept cost parameters in initializing to specify different cost for different querying. The cost is a list which have the same length and is one-to-one correspondence of the labels. Note that, if the cost is not provided, it is set to 1 for each label automatically.

cost = [2, 1, 2]
oracle = Oracle(labels=labels, indexes=indexes, cost=cost)

Finally, if you want to query the supervised information of an instance by its feature vector, you should provide the corresponding feature matrix either:

feature_mat = [[1,1,1], [2,2,2], [3,3,3]]
oracle = Oracle(labels=labels, indexes=indexes, cost=cost, examples=feature_mat)

Add knowledge

The oracle object accept adding entries after initializing. There is one thing to pay attention, if a parameter is given in initializing (e.g., feature matrix), it should also be provided when adding entries.

Here is an example to add 2 entries to the oracle.

oracle.add_knowledge(labels=[1,0], indexes=[23,33], examples=[[4,4,4],[5,5,5]])

Query

There are two ways to query from the oracle.


1. Query by index. The queried_index should be a subset of the indexes when initializing the oracle.

# query one or more instances at one time
queried_index=23
labels, cost = oracle.query_by_index(indexes=queried_index)
queried_index=[23,33]
labels, cost = oracle.query_by_index(indexes=queried_index)

The returned values are lists, the length is the same as the queried_index.

2. Query by example. If you passed the feature matrix to the oracle when initializing, you can use feature vectors which should be a subset in the feature matrix to query.

labels, cost = oracle.query_by_example(queried_examples=[1,1,1])

Oracles in different settings

There are several different types of oracles in alipy.


1. Noisy oracle

In reality, the labels given by human is not always correct. If labels come from an empirical experiment (e.g., in biological, chemical, or clinical studies), then one can usually expect some noise to result from the instrumentation of experimental setting. Even if labels come from human experts, they may not always be reliable, for several reasons. First, some instances are implicitly difficult for people and machines, and second, people can become distracted or fatigued over time, introducing variability in the quality of their annotations.

It is thus important to consider these factors when designing query strategy.

In alipy, it is very easy to contruct a noisy oracle. We have already found that, the returned values of an oracle is determined by the knowledge pool when initializing. You can use a corrupted label vector to initialize a noisy oracle.

clean_oracle = Oracle(labels=[1, 0, 1])
noisy_oracle = Oracle(labels=[0, 0, 0])


2. Oracle for Multi-label

ALiPy provides alipy.oracle.OracleQueryMultiLabel for instance-label pair querying. To initialize this class, you should provide the multi-label matrix with the shape [n_samples, n_classes]. Please see multi label settiing for more information about multi label.

from alipy.oracle import OracleQueryMultiLabel
oracle = OracleQueryMultiLabel(labels=mult_y)

When querying, you need to provide a single or list of valid multi label index we define above.

label, cost = oracle.query_by_index((1, 2)) # query the 3rd label of 2nd instance
labels, cost = oracle.query_by_index([(1, 2), (0, 1)])

3. Multi oracles

Sometimes, there are several available oracles working together. Each of them has different specialty.

ALiPy provide the alipy.oracle.Oracles for this setting.

You can add several oracle objects to this container:

from alipy.oracle import Oracles
oracles = Oracles()
oracles.add_oracle(oracle_name='Tom', oracle_object=clean_oracle)
oracles.add_oracle(oracle_name='Amy', oracle_object=noisy_oracle)

And query from a certain oracle by its name:

labels, cost = oracles.query_from(index_for_querying=[23], oracle_name='Tom')

This class will store the query history of the oracles. You can obtained the full history for further usages.

Knowledge repository

alipy.oracle.ElementRepository is a class to store the queried information (e.g., the queried labels, selected indexes, cost) which is a supporting tool for oracle.

This class is usually unnecessary for a clean oracle, because you can get the queried labels by indexing your ground-truth label matrix with the selected indexes. However, in some special settings (e.g., noisy oracles), users may want to store the queried information for further analysis.

Functions of knowledge repository include:

1. Retrieving queried information without cost

2. History recording

3. Get labeled set for training model

Here we introduce the basic usage of this class.


There are 2 catagories of knowledge repository in alipy:

1. ElementRepository for fine-grained (element-wise) data

2. MatrixRepository for instance-wise data

You can get them by:

from alipy.oracle import ElementRepository, MatrixRepository
ele_rep = ElementRepository(labels=y[label_ind], indexes=label_ind, examples=X[label_ind])
mat_rep = MatrixRepository(labels=y[label_ind], indexes=label_ind, examples=X[label_ind])

Note that, in ElementRepository , the labels and examples can be complex object. But in MatrixRepository , they must be an array. And one more thing, the example parameter is optional, but if you want to get the feature matrix of labeled data, you must provide the feature matrix in initializing and each updating of query.

Some commonly used methods are:

selected_ind = [23]
labels, cost = oracle.query_by_index(indexes=selected_ind)
ele_rep.update_query(labels=labels, indexes=selected_ind, cost=cost, examples=X[selected_ind])
labels, cost = ele_rep.retrieve_by_indexes(indexes=selected_ind)	# retrieve the queried instance
X_lab, y_lab, ind_lab = ele_rep.get_training_data()		# get the feature and label matrix of labeled set

You can also store the feature vector of the queried instances by specifying the examples parameters when initializing and updating query.

Copyright © 2018. All rights reserved.