Data manipulate

alipy.data_manipulate is a module to process your data for experiment.

Functions include:

1. Split dataset into training, testing, labeled, unlabeled set

- Split by instance

- Split by singe label in multi-label dataset (see multi-label for more details)

- Split feature for feature querying (see feature querying for more details)

2. Feature scaling

- Min-Max scale

- Standard scale

Here we introduce the usages of this module.

Splitting

To split the data in a custom way, you can import the split functions directly. There are 4 different data split functions:

1. Split by instance

alipy.data_manipulate.split() will split your dataset by instance which may be the most commonly used method. In this method, the unit in split is an instance. Let's see an example:

# prepare data
import numpy as np
X = np.random.rand(10, 10)  # 10 instances with 10 features
y = [0] * 5 + [1] * 5

# split by instance
from alipy.data_manipulate import split
train, test, lab, unlab = split(X=X, y=y, test_ratio=0.5, initial_label_rate=0.5,
                                split_count=1, all_class=True, saving_path='.')

The values in the returned variables are:

train = [array([0, 5, 3, 1, 8])]
test = [array([4, 9, 6, 2, 7])]
lab = [array([0, 5])]
unlab = [array([3, 1, 8])]

Each returned value is a list which contains the indexes of instances. If split_count is larger than 1, each returned value will contain k lists for k-fold experiments.

Note that, you can set the all_class=True to enforce each initially labeled set to contain at least one instance for each class. The saving_path parameter is the path to save the split results. Passing None to this parameter to disable the saving.

2. Split in multi-label dataset (Please see multi-label for more details)

3. Split feature for feature querying (Please see feature querying for more details)

4. Split with data shape

Each split method accepts the shape of data matrix for the dataset which do not have an aligned feature matrix.

img_name = ['IMG_0001.jpg', 'IMG_0002.jpg', 'IMG_0003.jpg', 'IMG_0004.jpg', 'IMG_0005.jpg']
train, test, lab, unlab = split(instance_indexes=img_name, test_ratio=0.5, initial_label_rate=0.5,
                                split_count=1, all_class=False, saving_path=None)

The returned values are like this:

train = [array(['IMG_0005.jpg', 'IMG_0002.jpg'], dtype='<U12')]
test = [array(['IMG_0001.jpg', 'IMG_0004.jpg', 'IMG_0003.jpg'], dtype='<U12')]
lab = [array(['IMG_0005.jpg'], dtype='<U12')]
unlab = [array(['IMG_0002.jpg'], dtype='<U12')]

split by alipy.ToolBox

A more convenient way to split your data is using a alipy.ToolBox object.

import alipy
alibox = alipy.ToolBox(X=X, y=y, query_type='AllLabels')
train, test, lab, unlab = alibox.split_AL(test_ratio=0.3, initial_label_rate=0.1,
                                          split_count=1)

The split type is determined according to the query_type when initializing the ToolBox object.

To get a split of one fold, you can use:

train_0, test_0, label_0, unlabel_0 = alibox.get_split(round=0)
train_1, test_1, label_1, unlabel_1 = alibox.get_split(round=1)

Feature scaling

Min-Max scale

Transforms features by scaling each feature to a given range. The transformation is given by:

X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))

X_scaled = X_std * (max - min) + min

The example usage is:

from alipy.data_manipulate import minmax_scale
X = minmax_scale(X=X, feature_range=(0, 1))

Standard scale

Standardize features by removing the mean and scaling to unit variance.

The example usage is:

from alipy.data_manipulate import StandardScale
X = StandardScale(X=X)