alipy.data_manipulate
is a module to process your data for experiment.
Functions include:
1. Split dataset into training, testing, labeled, unlabeled set
- Split by instance
- Split by singe label in multi-label dataset (see multi-label for more details)
- Split feature for feature querying (see feature querying for more details)
2. Feature scaling
- Min-Max scale
- Standard scale
Here we introduce the usages of this module.
To split the data in a custom way, you can import the split functions directly. There are 4 different data split functions:
alipy.data_manipulate.split()
will split your dataset by instance which may be
the most commonly used method.
In this method, the unit in split is an instance.
Let's see an example:
# prepare data
import numpy as np
X = np.random.rand(10, 10) # 10 instances with 10 features
y = [0] * 5 + [1] * 5
# split by instance
from alipy.data_manipulate import split
train, test, lab, unlab = split(X=X, y=y, test_ratio=0.5, initial_label_rate=0.5,
split_count=1, all_class=True, saving_path='.')
The values in the returned variables are:
train = [array([0, 5, 3, 1, 8])]
test = [array([4, 9, 6, 2, 7])]
lab = [array([0, 5])]
unlab = [array([3, 1, 8])]
Each returned value is a list which contains the indexes of instances. If
split_count
is larger than 1, each returned value will contain
k lists for k-fold experiments.
Note that, you can set the
all_class=True
to enforce each initially labeled set to
contain at least one instance for each class. The
saving_path
parameter is the path
to save the split results. Passing
None
to this parameter to disable the saving.
Each split method accepts the shape of data matrix for the dataset which do not have an aligned feature matrix.
img_name = ['IMG_0001.jpg', 'IMG_0002.jpg', 'IMG_0003.jpg', 'IMG_0004.jpg', 'IMG_0005.jpg']
train, test, lab, unlab = split(instance_indexes=img_name, test_ratio=0.5, initial_label_rate=0.5,
split_count=1, all_class=False, saving_path=None)
The returned values are like this:
train = [array(['IMG_0005.jpg', 'IMG_0002.jpg'], dtype='<U12')]
test = [array(['IMG_0001.jpg', 'IMG_0004.jpg', 'IMG_0003.jpg'], dtype='<U12')]
lab = [array(['IMG_0005.jpg'], dtype='<U12')]
unlab = [array(['IMG_0002.jpg'], dtype='<U12')]
A more convenient way to split your data is using a
alipy.ToolBox
object.
import alipy
alibox = alipy.ToolBox(X=X, y=y, query_type='AllLabels')
train, test, lab, unlab = alibox.split_AL(test_ratio=0.3, initial_label_rate=0.1,
split_count=1)
The split type is determined according to the
query_type
when initializing the
ToolBox object.
To get a split of one fold, you can use:
train_0, test_0, label_0, unlabel_0 = alibox.get_split(round=0)
train_1, test_1, label_1, unlabel_1 = alibox.get_split(round=1)
Transforms features by scaling each feature to a given range. The transformation is given by:
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
The example usage is:
from alipy.data_manipulate import minmax_scale
X = minmax_scale(X=X, feature_range=(0, 1))
Standardize features by removing the mean and scaling to unit variance.
The example usage is:
from alipy.data_manipulate import StandardScale
X = StandardScale(X=X)
Copyright © 2018, alipy developers (BSD 3 License).