alipy.index
is a module to manage the indexes of instances and labels in experiment.
The tools in this module implement several necessary validity checking methods. They will be invoked automatically when you are adding or deleting elements in the container in order to avoid some simple but fatal mistakes.
In the follwing of the instruction, we will first introduce the general usage of
IndexCollection
class. Then the index toolkits for multi-label setting is
presented next. Note that, The usages of multi-label tools are different because they also store the label index for
instance-label pair querying which is different with the common setting.
For the feature querying setting, the class
FeatureIndexCollection
is
mainly the same as
MultiLabelIndexCollection
. Please see
feature querying
for more details.
alipy.index.IndexCollection
is a set-like container to manage your labeled
and unlabeled indexes. However, the index is stored in an ordered way.
You can create an IndexCollection object easily by
passing a
list
or
numpy.ndarray
object. (Note that, other data types will
be cheated as only one element.)
a = [1,2,3]
a_ind = alibox.IndexCollection(a)
# Or create by importing the module
from alipy.index import IndexCollection
a_ind = IndexCollection(a)
The commonly usages of
IndexCollection
include
add()
,
discard()
,
update()
,
difference_update()
. In which, the first 2 methods are for a single index only, and
the others are for multiple indexes updating.
# add a single index, warn if there is a repeated element.
a_ind.add(4)
# discard a single index, warn if not exist.
a_ind.discard(4)
# add a batch of indexes.
a_ind.update([4,5])
# discard a batch of indexes.
a_ind.difference_update([1,2])
This class stores indexes in multi-label. We implment various tool functions for multi label setting to adapt to different needs. Due to the particularity of multi-label setting, here we have a definition of a multi-label index:
Each index should be a tuple with 2 elements. The first element represents the index of instance, while the second one represents the indexes of labels. If you want to query all labels of an instance, your index should only have 1 element: (example_index, ). Otherwise, set 2 elements (example_index, [label_indexes]) to query specific labels.
Some examples of valid multi-label indexes include:
queried_index = (1, [3,4]) # query the 4th, 5th labels of the 2nd instance
queried_index = (1, [3])
queried_index = (1, 3)
queried_index = (1, (3))
queried_index = (1, (3,4))
queried_index = (1, ) # query all labels
Normally, you should pass a list of index we define above to initialize a
MultiLabelIndexCollection
. However, we also support initializing from a 1d array in matlab style, or an element mask matrix. please see the next subsection for more details.
The initialization of
MultiLabelIndexCollection
needs an additional
parameter:
label_size
, to specify the size of your label space used for
validity checking for the other operations.
And usages of
MultiLabelIndexCollection
are
the same of
IndexCollection
.
>>> from alipy.index import MultiLabelIndexCollection
>>> multi_lab_ind1 = MultiLabelIndexCollection([(0, 1), (0, 2), (0, (3, 4)), (1, (0, 1))], label_size=5)
{(0, 1), (1, 1), (0, 4), (1, 0), (0, 2), (0, 3)}
>>> multi_lab_ind1.update((0, 0))
{(0, 1), (0, 0), (1, 1), (0, 4), (1, 0), (0, 2), (0, 3)}
>>> multi_lab_ind1.update([(1, 2), (1, (3, 4))])
{(0, 1), (1, 2), (0, 0), (1, 3), (1, 4), (1, 1), (0, 4), (1, 0), (0, 2), (0, 3)}
>>> multi_lab_ind1.update([(2,)])
{(0, 1), (1, 2), (0, 0), (1, 3), (2, 2), (1, 4), (2, 1), (2, 0), (1, 1), (2, 3), (2, 4), (0, 4), (1, 0), (0, 2), (0, 3)}
>>> multi_lab_ind1.difference_update([(0,)])
{(1, 2), (1, 3), (2, 2), (1, 4), (2, 1), (2, 0), (1, 1), (2, 3), (2, 4), (1, 0)}
MultiLabelIndexCollection
supports initializing and transforming into an 1d index array which may be useful for matlab users. In matlab, the indexes of elements in a matrix are in column major, here is an example of the 1d indexes of matrix:
[[ 0 3 6 9]
[ 1 4 7 10]
[ 2 5 8 11]]
This means, the [1, 4, 11] means the (1, 0), (1, 1), (2, 3) in MultiLabelIndexCollection container. Let's see an example to learn about how does MultiLabelIndexCollection deal with 1d array.
>>> b = [1, 4, 11]
>>> mi = MultiLabelIndexCollection.construct_by_1d_array(array=b, label_mat_shape=(3, 4))
>>> print(mi)
{(1, 0), (2, 3), (1, 1)}
>>> print('col major:', mi.get_onedim_index(order='F', ins_num=3))
col major: [1, 11, 4]
>>> print('row major:', mi.get_onedim_index(order='C'))
row major: [4, 11, 5]
Element mask is a matrix with the same shape as the data array. The value of each entry can only be 1 or 0, in which, 1 means the element is selected.
>>> import numpy as np
>>> mask = np.asarray([
[0, 1],
[1, 0],
[1, 0]
]) # 3 rows, 2 lines
>>> data = np.asarray([
[1, 2],
[3, 4],
[5, 6]
])
>>> data_with_mask = data * mask
>>> print(data_with_mask)
[[0 2]
[3 0]
[5 0]]
MultiLabelIndexCollection
supports initializing and transforming into element mask type. Let's see the following examples directly.
>>> import numpy as np
>>> mask = np.asarray([
[0, 1],
[1, 0],
[1, 0]
]) # 3 rows, 2 lines
>>> mi = MultiLabelIndexCollection.construct_by_element_mask(mask=mask)
>>> print(mi)
{(0, 1), (2, 0), (1, 0)}
You can also get the mask from a MultiLabelIndexCollection object:
>>> mi = MultiLabelIndexCollection([(0, 1), (2, 0), (1, 0)], label_size=2)
>>> print(mi.get_matrix_mask(mat_shape=(3, 2), sparse=False))
[[0 1]
[1 0]
[1 0]]
Since the unit in
MultiLabelIndexCollection
is a single label, it is necessary to provide some retrieving methods. We provide the following method to get the different indexes in the container:
1. get_instance_index()
Get the index of instances contained in this object. If it is a labeled set, it is equivalent to the indexes of fully and partially labeled instances.
1. get_break_instances()
Return the indexes of break instances which have missing entries.
1. get_unbroken_instances()
Return the indexes of unbroken instances whose entries are all known.
>>> ind = [(1, ), (2, 3), (2, 4), (4, 6)]
>>> mi = MultiLabelIndexCollection(ind, label_size=7)
>>> print(mi.get_instance_index())
[1, 2, 4]
>>> print(mi.get_break_instances())
[4, 2]
>>> print(mi.get_unbroken_instances())
[1]
To further support the multi-label operation, we provide several useful functions for multi-label setting.
1.
flattern_multilabel_index(index_arr, label_size=None, check_arr=True)
This function can flattern your folded multi-label index. If the label_size if not specified, an inferrence is attempted.
>>>from alipy.index import flattern_multilabel_index
>>>a_ind = [(1,), (2,[1,2])]
>>>flattern_multilabel_index(a_ind, label_size=3)
[(1,0),(1,1),(1,2),(2,1),(2,2)]
2.
integrate_multilabel_index(index_arr, label_size=None, check_arr=True)
This method has the opposite function of
flattern_multilabel_index()
.
>>>from alipy.index import integrate_multilabel_index
>>>a_ind = [(1,0),(1,1),(1,2),(2,1),(2,2)]
>>>integrate_multilabel_index(a_ind, label_size=3)
[(1,), (2,[1,2])]
3.
get_labelmatrix_in_multilabel(index, data_matrix, unknown_element=0)
This class will index the data matrix by a list of valid multi-label indexes.
Since not all labels of an instance is known, the unknown element will be
filled with the value of
unknown_element
parameter.
>>> from alipy.index import get_labelmatrix_in_multilabel
>>> data_matrix = [[1, 1], [2, 2]]
>>> a_ind = [(0, 1), (1, 1)]
>>> matrix_clip, index_arr = get_labelmatrix_in_multilabel(a_ind,
data_matrix,
unknown_element=-1)
>>> print(index_arr)
[0, 1]
>>> print(matrix_clip)
[[-1, 1], [-1, 2]]
4.
get_Xy_in_multilabel(index, X, y, unknown_element=0)
It basically has the same function of
get_labelmatrix_in_multilabel
.
But it will index both feature and label matrix and return the
matched labeled dataset for model training.
>>>from alipy.index import get_Xy_in_multilabel
>>>X = [[1, 1], [2, 2]]
>>>y = [[3, 3], [4, 4]]
>>>a_ind = [(0, 1), (1, 1)]
>>>X_lab, y_lab = get_Xy_in_multilabel(a_ind, X, y, unknown_element=-1)
>>>print(X_lab)
[[1,1],[2,2]]
>>>print(y_lab)
[[-1,3],[-1,4]]
Copyright © 2018. All rights reserved.