Features with Data Pipelines
Contents
Features with Data Pipelines¶
Feature engineering can create hundreds or thousands of variables, each capturing specialized domain knowledge, so a methodical approach to developing the code for such features is important. The use of data pipelines encourage the development of flexible, clean, and performant code by:
compartmentalizing the internal logic of each feature, allowing one to add and subtract them as desired,
controlling possible parameters for features in one place,
providing a uniform interface for composing data transformation logic.
Scikit-learn implements data pipelines as sequences of Transformer
objects.
Data Transformation in Scikit-Learn¶
Features in Scikit-learn are generated using Transformers. These are classes that implement the following interface:
Transformer.set_params
defines parameters needed for the internal logic of the feature.Transformer.fit
takes in data and determines any parameters from the data that are necessary for creating the feature, returning the ‘fit’ transformer.Transformer.transform
takes in data and returns the feature defined by the transformer.Transformer.fit_transform
first callsfit
on the given data, then applies thetransform
method to the same data used to fit the Transformer.
Example: The Binarizer
transformer creates a binary feature from a quantitative attribute. For example, suppose purchases
contains a list of dollar amounts of purchases from a person in a given year:
purchases = pd.DataFrame([[1.0], [3.0], [25.0], [50.0], [6.0], [101.0]], columns=['Amount'])
The Binarizer
transformer can be used to create a binary feature large_purchase
that is 1 if a purchase is above $20 and 0 otherwise:
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=20)
binarizer.transform(purchases)
array([[0.],
[0.],
[1.],
[1.],
[0.],
[1.]])
This transformer is initialized with a ‘threshold’ parameter, then used to transform dollar amounts to binary values according to the threshold.
Remark: The logic of Binarizer
depends only on the value of ‘Amount’ in a given observations. This transformer’s fit
method doesn’t need to do anything, as it doesn’t need to use any properties from the data.
Example: The MinMaxScaler
linearly scales a quantitative attribute so that the resulting feature is between 0 and 1. That is, MinMaxScaler
transforms a dataset X
according to the formula:
(X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
For example, on the purchases
data:
from sklearn.preprocessing import MinMaxScaler
mms = MinMaxScaler()
mms.fit(purchases)
mms.transform(purchases)
array([[0. ],
[0.02],
[0.24],
[0.49],
[0.05],
[1. ]])
Remark: The fit
method is required before transforming the data, as the MinMaxScaler
must determine the minimum and maximum values of the dataset to apply the formula.
Custom Transformers¶
If a desired feature transformation isn’t already implemented in Scikit-Learn, it can still be implemented in a straightforward way.
If the custom feature transformation logic doesn’t require fitting parameters from data, the FunctionTransformer
class implements a transformer from a given function:
Example: To create a Transformer that log-scales the purchases array, pass np.log
to the FunctionTransformer
constructor:
from sklearn.preprocessing import FunctionTransformer
logscaler = FunctionTransformer(func=np.log, validate=False)
logscaler.transform(purchases)
Amount | |
---|---|
0 | 0.000000 |
1 | 1.098612 |
2 | 3.218876 |
3 | 3.912023 |
4 | 1.791759 |
5 | 4.615121 |
FunctionTransformer
can also pass parameters into the custom function. For example, if instead the purchases
data is log-scaled in a different base, this keyword argument can be specified in the FunctionTransformer
constructor:
def log_base(arr, base):
'''Apply Log scaling to an array with the specified base.'''
return np.log(arr) / np.log(base)
logscaler = FunctionTransformer(func=log_base, kw_args={'base': 10}, validate=False)
logscaler.transform(purchases)
Amount | |
---|---|
0 | 0.000000 |
1 | 0.477121 |
2 | 1.397940 |
3 | 1.698970 |
4 | 0.778151 |
5 | 2.004321 |
A custom transformer that requires fitting is implemented by inheriting the TransformerMixin
base class.
Applying Transformations to multiple columns¶
By default, Scikit-Learn Transformers apply a given transformation to every input column separately. However, most datasets contain various column types that require different transformation logic.
rand = pd.DataFrame(np.random.randint(10, size=(7,3)), columns='a b c'.split())
rand
a | b | c | |
---|---|---|---|
0 | 3 | 1 | 8 |
1 | 7 | 6 | 1 |
2 | 4 | 2 | 6 |
3 | 6 | 4 | 3 |
4 | 3 | 8 | 3 |
5 | 3 | 3 | 6 |
6 | 6 | 1 | 3 |
binarizer = Binarizer(5)
binarizer.transform(rand)
array([[0, 1, 0],
[0, 1, 1],
[0, 0, 0],
[0, 1, 0],
[1, 0, 1],
[1, 1, 1],
[1, 0, 0]])
Passing a function that selects the specified columns by name requires passing validate=False
to FunctionTransformer (allowing the function to act on objects other than numpy arrays).
def select(df, cols):
return df[cols]
columnSelector = FunctionTransformer(func=select, validate=False, kw_args={'cols': ['a', 'b']})
columnSelector.transform(rand)
a | b | |
---|---|---|
0 | 3 | 1 |
1 | 7 | 6 |
2 | 4 | 2 |
3 | 6 | 4 |
4 | 3 | 8 |
5 | 3 | 3 |
6 | 6 | 1 |
Composing these two transformers applies the binarizer to only the first two columns:
selected = columnSelector.transform(rand)
out = binarizer.transform(selected)
out
array([[0, 0],
[1, 1],
[0, 0],
[1, 0],
[0, 1],
[0, 0],
[1, 0]])
Data Transformation Pipelines in Scikit Learn¶
Composing many feature transformers by hand is tedious and error-prone. Scikit-Learn has a Pipeline
class to manage the composition of multiple transformers.
A Pipeline
object is instantiated with a sequence of named transformers:
translist = [('trans1', t1), ('trans2', t2),..., ('transN', tN)]
pl = Pipeline(translist)
Each transformer must be given a name, to ease readability and help debugging.
The resulting pipeline is itself a transformer, with fit
and transform
methods. Calling pl.fit_transform(data)
results in iteratively calling fit_transform
on the transformers in the pipeline. fit_ transform
roughly executes the following logic:
out = data
for trans in translist:
out = trans.fit_transform(out)
out
Similar logic applies to both the fit
and transform
methods.
Example: To combine the columnSelector
and binarizer
transformations into a pipeline, merely pass them as a list:
from sklearn.pipeline import Pipeline
translist = [
('selector', columnSelector),
('binarizer', binarizer)
]
pl = Pipeline(translist)
pl.fit_transform(rand)
array([[0, 0],
[1, 1],
[0, 0],
[1, 0],
[0, 1],
[0, 0],
[1, 0]])
Applying Separate Transformations to Subsets of Columns¶
So far, transformers and pipelines have only been used to compose one data transformation after another. Most realistic scenarios however, involve applying separate transformations to different subsets of columns and putting together the resulting features into a single dataset.
Scikit-Learn handles this logic with the ColumnTransformer
class, which separately applies transformers to subsets of columns, returning the resulting features as the columns of an array.
Example: Suppose the the following features are derived from the dataset rand
:
For columns ‘a’ and ‘c’, return 1 if a value is in the top half of the range of the column; otherwise return 0.
For columns ‘a’ and ‘b’, return 1 if a value is greater than 1-standard-deviation above the mean of the column, otherwise return 0.
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
To approach this, create a pipeline for each feature transformation:
trans1 = Pipeline([
('minmax', MinMaxScaler()),
('greater_than_half', Binarizer(threshold=0.5))
])
trans2 = Pipeline([
('stdscale', StandardScaler()),
('greater_than_1std', Binarizer(threshold=1))
])
These transformations are then applied to separate subsets of columns by passing then into ColumnTransformer
:
ct = ColumnTransformer(
[
('top_half_of_range', trans1, ['a', 'c']),
('above_one_stdev', trans2, ['a', 'b'])
]
)
There are four resulting features, as each transformation is applied to two columns:
ct.fit_transform(rand.astype(float))
array([[0., 1., 0., 0.],
[1., 0., 1., 0.],
[0., 1., 0., 0.],
[1., 0., 0., 0.],
[0., 0., 0., 1.],
[0., 1., 0., 0.],
[1., 0., 0., 0.]])