is a module which implemented the basic definition of oracle used in experiment.
The oracle (e.g., a human annotator) in active learning can provide supervised information for the instances with some cost.
In the follwing of the instruction, we will first introduce the general usage of oracle. Then, some different types of oracles will be presented. At last, the knowledge repository class which is a supporting tool for oracle will be introduced.
In alipy, the oracle is a class for retrieving. when initializing, the knowledge pool of the oracle should be given which usually is the labels of instances. Then, the learner can query from the oracle by instance or index. The returned values of an oracle include the supervised information corresponds to the queried_index and the cost incurred by this query.
The oracle in alipy can be initialized with a list of object (e.g., labels), and the corresponding indexes of the object:
indexes = [34, 56, 74] labels = [0, 1 ,0] from alipy.oracle import Oracle oracle = Oracle(labels=labels, indexes=indexes)
You can also provide the labels only, the indexes will be constructed from 0 automatically:
>>>oracle = Oracle(labels=labels) >>>print(oracle.index_keys) [0, 1, 2]
The oracle also accept cost parameters in initializing to specify different cost for different querying. The cost is a list which have the same length and is one-to-one correspondence of the labels. Note that, if the cost is not provided, it is set to 1 for each label automatically.
cost = [2, 1, 2] oracle = Oracle(labels=labels, indexes=indexes, cost=cost)
Finally, if you want to query the supervised information of an instance by its feature vector, you should provide the corresponding feature matrix either:
feature_mat = [[1,1,1], [2,2,2], [3,3,3]] oracle = Oracle(labels=labels, indexes=indexes, cost=cost, examples=feature_mat)
The oracle object accept adding entries after initializing. There is one thing to pay attention, if a parameter is given in initializing (e.g., feature matrix), it should also be provided when adding entries.
Here is an example to add 2 entries to the oracle.
oracle.add_knowledge(labels=[1,0], indexes=[23,33], examples=[[4,4,4],[5,5,5]])
There are two ways to query from the oracle.
1. Query by index. The queried_index should be a subset of the indexes when initializing the oracle.
# query one or more instances at one time queried_index=23 labels, cost = oracle.query_by_index(indexes=queried_index) queried_index=[23,33] labels, cost = oracle.query_by_index(indexes=queried_index)
The returned values are lists, the length is the same as the queried_index.
2. Query by example. If you passed the feature matrix to the oracle when initializing, you can use feature vectors which should be a subset in the feature matrix to query.
labels, cost = oracle.query_by_example(queried_examples=[1,1,1])
There are several different types of oracles in alipy.
1. Noisy oracle
In reality, the labels given by human is not always correct. If labels come from an empirical experiment (e.g., in biological, chemical, or clinical studies), then one can usually expect some noise to result from the instrumentation of experimental setting. Even if labels come from human experts, they may not always be reliable, for several reasons. First, some instances are implicitly difficult for people and machines, and second, people can become distracted or fatigued over time, introducing variability in the quality of their annotations.
It is thus important to consider these factors when designing query strategy.
In alipy, it is very easy to contruct a noisy oracle. We have already found that, the returned values of an oracle is determined by the knowledge pool when initializing. You can use a corrupted label vector to initialize a noisy oracle.
clean_oracle = Oracle(labels=[1, 0, 1]) noisy_oracle = Oracle(labels=[0, 0, 0])
2. Oracle for Multi-label
for instance-label pair querying. To initialize this
class, you should provide the multi-label matrix with the shape [n_samples, n_classes].
multi label settiing
for more information about multi label.
from alipy.oracle import OracleQueryMultiLabel oracle = OracleQueryMultiLabel(labels=mult_y)
When querying, you need to provide a single or list of valid multi label index we define above.
label, cost = oracle.query_by_index((1, 2)) # query the 3rd label of 2nd instance labels, cost = oracle.query_by_index([(1, 2), (0, 1)])
3. Multi oracles
Sometimes, there are several available oracles working together. Each of them has different specialty.
ALiPy provide the
for this setting.
You can add several oracle objects to this container:
from alipy.oracle import Oracles oracles = Oracles() oracles.add_oracle(oracle_name='Tom', oracle_object=clean_oracle) oracles.add_oracle(oracle_name='Amy', oracle_object=noisy_oracle)
And query from a certain oracle by its name:
labels, cost = oracles.query_from(index_for_querying=, oracle_name='Tom')
This class will store the query history of the oracles. You can obtained the full history for further usages.
is a class to store the queried information
(e.g., the queried labels, selected indexes, cost) which is a supporting
tool for oracle.
This class is usually unnecessary for a clean oracle, because you can get the queried labels by indexing your ground-truth label matrix with the selected indexes. However, in some special settings (e.g., noisy oracles), users may want to store the queried information for further analysis.
Functions of knowledge repository include:
1. Retrieving queried information without cost
2. History recording
3. Get labeled set for training model
Here we introduce the basic usage of this class.
There are 2 catagories of knowledge repository in alipy:
for fine-grained (element-wise) data
for instance-wise data
You can get them by:
from alipy.oracle import ElementRepository, MatrixRepository ele_rep = ElementRepository(labels=y[label_ind], indexes=label_ind, examples=X[label_ind]) mat_rep = MatrixRepository(labels=y[label_ind], indexes=label_ind, examples=X[label_ind])
Note that, in
, the labels and examples can be complex object.
, they must be an array. And one more thing,
parameter is optional, but if you want to
get the feature matrix of labeled data, you must provide the feature matrix in initializing
and each updating of query.
Some commonly used methods are:
selected_ind =  labels, cost = oracle.query_by_index(indexes=selected_ind) ele_rep.update_query(labels=labels, indexes=selected_ind, cost=cost, examples=X[selected_ind]) labels, cost = ele_rep.retrieve_by_indexes(indexes=selected_ind) # retrieve the queried instance X_lab, y_lab, ind_lab = ele_rep.get_training_data() # get the feature and label matrix of labeled set
You can also store the feature vector of the queried instances by specifying the
when initializing and updating query.
Copyright © 2018. All rights reserved.