Index manager

alipy.index is a module to manage the indexes of instances and labels in experiment.

The tools in this module implement several necessary validity checking methods. They will be invoked automatically when you are adding or deleting elements in the container in order to avoid some simple but fatal mistakes.

In the follwing of the instruction, we will first introduce the general usage of IndexCollection class. Then the index toolkits for multi-label setting is presented next. Note that, The usages of multi-label tools are different because they also store the label index for instance-label pair querying which is different with the common setting. For the feature querying setting, the class FeatureIndexCollection is mainly the same as MultiLabelIndexCollection . Please see feature querying for more details.

IndexCollection

alipy.index.IndexCollection is a set-like container to manage your labeled and unlabeled indexes. However, the index is stored in an ordered way.

Initialize

You can create an IndexCollection object easily by passing a list or numpy.ndarray object. (Note that, other data types will be cheated as only one element.)

a = [1,2,3]
a_ind = alibox.IndexCollection(a)
# Or create by importing the module
from alipy.index import IndexCollection
a_ind = IndexCollection(a)

Usage

The commonly usages of IndexCollection include add() , discard() , update() , difference_update() . In which, the first 2 methods are for a single index only, and the others are for multiple indexes updating.

# add a single index, warn if there is a repeated element.
a_ind.add(4)
# discard a single index, warn if not exist.
a_ind.discard(4)
# add a batch of indexes.
a_ind.update([4,5])
# discard a batch of indexes.
a_ind.difference_update([1,2])

MultiLabelIndexCollection

This class stores indexes in multi-label. We implment various tool functions for multi label setting to adapt to different needs. Due to the particularity of multi-label setting, here we have a definition of a multi-label index:

Each index should be a tuple with 2 elements. The first element represents the index of instance, while the second one represents the indexes of labels. If you want to query all labels of an instance, your index should only have 1 element: (example_index, ). Otherwise, set 2 elements (example_index, [label_indexes]) to query specific labels.

Some examples of valid multi-label indexes include:

queried_index = (1, [3,4])	# query the 4th, 5th labels of the 2nd instance
queried_index = (1, [3])
queried_index = (1, 3)
queried_index = (1, (3))
queried_index = (1, (3,4))
queried_index = (1, )   # query all labels

Initialization & Usage

Normally, you should pass a list of index we define above to initialize a MultiLabelIndexCollection . However, we also support initializing from a 1d array in matlab style, or an element mask matrix. please see the next subsection for more details.

The initialization of MultiLabelIndexCollection needs an additional parameter: label_size , to specify the size of your label space used for validity checking for the other operations.

And usages of MultiLabelIndexCollection are the same of IndexCollection .

>>> from alipy.index import MultiLabelIndexCollection
>>> multi_lab_ind1 = MultiLabelIndexCollection([(0, 1), (0, 2), (0, (3, 4)), (1, (0, 1))], label_size=5)
{(0, 1), (1, 1), (0, 4), (1, 0), (0, 2), (0, 3)}
>>> multi_lab_ind1.update((0, 0))
{(0, 1), (0, 0), (1, 1), (0, 4), (1, 0), (0, 2), (0, 3)}
>>> multi_lab_ind1.update([(1, 2), (1, (3, 4))])
{(0, 1), (1, 2), (0, 0), (1, 3), (1, 4), (1, 1), (0, 4), (1, 0), (0, 2), (0, 3)}
>>> multi_lab_ind1.update([(2,)])
{(0, 1), (1, 2), (0, 0), (1, 3), (2, 2), (1, 4), (2, 1), (2, 0), (1, 1), (2, 3), (2, 4), (0, 4), (1, 0), (0, 2), (0, 3)}
>>> multi_lab_ind1.difference_update([(0,)])
{(1, 2), (1, 3), (2, 2), (1, 4), (2, 1), (2, 0), (1, 1), (2, 3), (2, 4), (1, 0)}

One dim indexes (matlab style) supporting

MultiLabelIndexCollection supports initializing and transforming into an 1d index array which may be useful for matlab users. In matlab, the indexes of elements in a matrix are in column major, here is an example of the 1d indexes of matrix:

[[ 0  3  6  9]
 [ 1  4  7 10]
 [ 2  5  8 11]]

This means, the [1, 4, 11] means the (1, 0), (1, 1), (2, 3) in MultiLabelIndexCollection container. Let's see an example to learn about how does MultiLabelIndexCollection deal with 1d array.

>>> b = [1, 4, 11]
>>> mi = MultiLabelIndexCollection.construct_by_1d_array(array=b, label_mat_shape=(3, 4))
>>> print(mi)
{(1, 0), (2, 3), (1, 1)}
>>> print('col major:', mi.get_onedim_index(order='F', ins_num=3))
col major: [1, 11, 4]
>>> print('row major:', mi.get_onedim_index(order='C'))
row major: [4, 11, 5]

Element mask supporting

Element mask is a matrix with the same shape as the data array. The value of each entry can only be 1 or 0, in which, 1 means the element is selected.

>>> import numpy as np
>>> mask = np.asarray([
    [0, 1], 
    [1, 0], 
    [1, 0]
]) # 3 rows, 2 lines
>>> data = np.asarray([
    [1, 2],
    [3, 4],
    [5, 6]
])
>>> data_with_mask = data * mask
>>> print(data_with_mask)
[[0 2]
 [3 0]
 [5 0]]

MultiLabelIndexCollection supports initializing and transforming into element mask type. Let's see the following examples directly.

>>> import numpy as np
>>> mask = np.asarray([
    [0, 1], 
    [1, 0], 
    [1, 0]
]) # 3 rows, 2 lines
>>> mi = MultiLabelIndexCollection.construct_by_element_mask(mask=mask)
>>> print(mi)
{(0, 1), (2, 0), (1, 0)}

You can also get the mask from a MultiLabelIndexCollection object:

>>> mi = MultiLabelIndexCollection([(0, 1), (2, 0), (1, 0)], label_size=2)
>>> print(mi.get_matrix_mask(mat_shape=(3, 2), sparse=False))
[[0 1]
 [1 0]
 [1 0]]

Retrieving

Since the unit in MultiLabelIndexCollection is a single label, it is necessary to provide some retrieving methods. We provide the following method to get the different indexes in the container:

1. get_instance_index()

Get the index of instances contained in this object. If it is a labeled set, it is equivalent to the indexes of fully and partially labeled instances.

1. get_break_instances()

Return the indexes of break instances which have missing entries.

1. get_unbroken_instances()

Return the indexes of unbroken instances whose entries are all known.

>>> ind = [(1, ), (2, 3), (2, 4), (4, 6)]
>>> mi = MultiLabelIndexCollection(ind, label_size=7)
>>> print(mi.get_instance_index())
[1, 2, 4]
>>> print(mi.get_break_instances())
[4, 2]
>>> print(mi.get_unbroken_instances())
[1]

Multi-label tools

To further support the multi-label operation, we provide several useful functions for multi-label setting.

1. flattern_multilabel_index(index_arr, label_size=None, check_arr=True)

This function can flattern your folded multi-label index. If the label_size if not specified, an inferrence is attempted.

>>>from alipy.index import flattern_multilabel_index
>>>a_ind = [(1,), (2,[1,2])]
>>>flattern_multilabel_index(a_ind, label_size=3)
[(1,0),(1,1),(1,2),(2,1),(2,2)]

2. integrate_multilabel_index(index_arr, label_size=None, check_arr=True)

This method has the opposite function of flattern_multilabel_index() .

>>>from alipy.index import integrate_multilabel_index
>>>a_ind = [(1,0),(1,1),(1,2),(2,1),(2,2)]
>>>integrate_multilabel_index(a_ind, label_size=3)
[(1,), (2,[1,2])]

3. get_labelmatrix_in_multilabel(index, data_matrix, unknown_element=0)

This class will index the data matrix by a list of valid multi-label indexes. Since not all labels of an instance is known, the unknown element will be filled with the value of unknown_element parameter.

>>> from alipy.index import get_labelmatrix_in_multilabel
>>> data_matrix = [[1, 1], [2, 2]]
>>> a_ind = [(0, 1), (1, 1)]
>>> matrix_clip, index_arr = get_labelmatrix_in_multilabel(a_ind,
                                                           data_matrix,
                                                           unknown_element=-1)
>>> print(index_arr)
[0, 1]
>>> print(matrix_clip)
[[-1, 1], [-1, 2]]

4. get_Xy_in_multilabel(index, X, y, unknown_element=0)

It basically has the same function of get_labelmatrix_in_multilabel . But it will index both feature and label matrix and return the matched labeled dataset for model training.

>>>from alipy.index import get_Xy_in_multilabel
>>>X = [[1, 1], [2, 2]]
>>>y = [[3, 3], [4, 4]]
>>>a_ind = [(0, 1), (1, 1)]
>>>X_lab, y_lab = get_Xy_in_multilabel(a_ind, X, y, unknown_element=-1)
	
>>>print(X_lab)
[[1,1],[2,2]]
>>>print(y_lab)
[[-1,3],[-1,4]]