Automated feature engineering solves one of the biggest problems in applied machine learning by streamlining a critical, yet manually intensive step in the ML pipeline. However, even after feature engineering, handling non-numeric data for use by machine learning algorithms is unavoidable and presents its own set of unique challenges.

This article serves as both a guide to categorical encoding as well as an introduction to a Python library for automated categorical encoding, designed with easy integration into feature engineering workflows in mind.

Categorical Encoding: The What/Why

When it comes to automated machine learning, the main focus has traditionally been placed upon selecting particular algorithms or refining models’ parameters, but most of the work in applied machine learning often stems from other steps in the process. For example, relevant predictor variables, features, must be extracted first from raw data—this step, called feature engineering, is essential, yet it is also arguably one of the most manually intensive steps in the applied ML process.

Similarly, handling non-numeric data is a critical component of nearly every machine learning process. In applied machine learning, the two most common types of data are numeric data (such as age: 5, 17, 35) and categorical data (such as color: red, blue, green). It is often easier to deal with numeric data than categorical data because of the fact that machine learning algorithms operate on mathematical vectors that have no knowledge of the context.

In contrast, many machine learning models cannot work directly with categorical data. Categorical values do not have any intrinsic mathematical relations, so we must first do some work on the data before we can feed it to a machine learning model. The process of turning categorical data into usable, machine-learning ready, mathematical data is called categorical encoding.

Categorical Encoding API: The Where

To create an encoder that would be universally applicable to any machine learning pipeline, three distinct situations in which data would need to be categorically encoded must be considered: training data, testing data, and new incoming data upon which our trained ML model would perform predictions.

The categorical-encoding library is designed to handle all three situations in order to seamlessly integrate with its neighboring steps of feature engineering and machine learning model training. The flowchart below illustrates the typical workflow for using categorical-encoding in your machine learning pipeline.

library-flowchart-03-02

In a standard machine learning pipeline, after performing feature engineering on your dataset, you are left with a data table, dubbed a feature matrix, with features relevant to the prediction problem at hand. At this point in the machine learning pipeline, we must first perform a train-test split on our data—we need train data to, well, train our model and test data to validate it.

Here’s what the basic setup for our code would look like. You can get the full code at this Jupyter notebook.

import categorical_encoding as ce
import featuretools as ft

from featuretools.tests.testing_utils import make_ecommerce_entityset
es = make_ecommerce_entityset()
f1 = ft.Feature(es["log"]["product_id"])
f2 = ft.Feature(es["log"]["purchased"])
f3 = ft.Feature(es["log"]["value"])
f4 = ft.Feature(es["log"]["countrycode"])

features = [f1, f2, f3, f4]
ids = [0, 1, 2, 3, 4, 5]
feature_matrix = ft.calculate_feature_matrix(features, es,
                                             instance_ids=ids)
print(feature_matrix)
    product_id  purchased  value countrycode
id                                          
0    coke zero       True    0.0          US
1    coke zero       True    5.0          US
2    coke zero       True   10.0          US
3          car       True   15.0          US
4          car       True   20.0          US
5   toothpaste       True    0.0          AL
train_data = feature_matrix.iloc[[0, 1, 4, 5]]
print(train_data)
    product_id  purchased  value countrycode
id                                          
0    coke zero       True    0.0          US
1    coke zero       True    5.0          US
4          car       True   20.0          US
5   toothpaste       True    0.0          AL
test_data = feature_matrix.iloc[[2, 3]]
print(test_data)
   product_id  purchased  value countrycode
id                                         
2   coke zero       True   10.0          US
3         car       True   15.0          US

The reason why we must deal with these two groups separately rather than encoding the entire feature matrix directly is because many encoders must be trained prior to transforming the data. For example, target encoders depend on the average value of a certain category—incorporating values from the test data into our encoding would lead to label leakage.

Because of that, after initializing our encoder with a method of our choice (see above for how to decide which encoder to pick), we fit (and transform) on our train data, and we only transform without fitting on our test data. After we encode our data, we get our encoded data, of course, and we also produce features.

enc = ce.Encoder(method='leave_one_out')

train_enc = enc.fit_transform(train_data, features, train_data['value'])

test_enc = enc.transform(test_data)
print(train_enc)

    PRODUCT_ID_leave_one_out  purchased  value  COUNTRYCODE_leave_one_out
id                                                                       
0                       5.00       True    0.0                      12.50
1                       0.00       True    5.0                      10.00
4                       6.25       True   20.0                       2.50
5                       6.25       True    0.0                       6.25
print(test_enc)
    PRODUCT_ID_leave_one_out  purchased  value  COUNTRYCODE_leave_one_out
id                                                                       
2                       2.50       True   10.0                   8.333333
3                       6.25       True   15.0                   8.333333

In the final step of the machine learning pipeline, we will get new data that we now use our trained model to perform predictions for. For this, we take our generated features from the previous step and utilize Featuretools’ calculate_feature_matrix() function to instantly create encoded data—we do not have to create a separate encoder and go through the entire encoding process yet again. And, now, we have data that we can instantly apply to our machine learning model.

fm2 = ft.calculate_feature_matrix(features, es, instance_ids=[6,7])
print(fm2)
    product_id  purchased  value countrycode
id                                          
6   toothpaste       True    1.0          AL
7   toothpaste       True    2.0          AL
features_encoded = enc.get_features()

fm2_encoded = ft.calculate_feature_matrix(features_encoded, es, instance_ids=[6,7])

print(fm2_encoded)
    PRODUCT_ID_leave_one_out  purchased  value  COUNTRYCODE_leave_one_out
id                                                                       
6                       6.25       True    1.0                       6.25
7                       6.25       True    2.0                       6.25

Categorical Encoding: The When/How

The hardest part of categorical encoding can sometimes be finding the right categorical encoding method—there are numerous research papers and studies dedicated to analyzing the performance of categorical encoding approaches to different datasets. From compiling all the common factors shared by datasets that use the same encoding method, I’ve created the following flowchart as a guide for finding the method best suited for your data. Please note that this flowchart serves as a starting point—feel free to experiment with different encoding methods, even those not listed here, to see which one works best for your specific data and machine learning problem.

That being said, this flowchart reflects solid general guidelines when it comes to commonly used categorical encoding methods. In addition, if you want a more comprehensive guide to categorical encoding, here is a Jupyter notebook that covers each categorical encoding method and its appropriate situations of application in much greater detail.

categorical-encoding-01-01

Categorical Encoding Methods

Encoding methods can be loosely grouped into several categories.

Classic Encoders

Classic encoders are the most straightforward and easiest to understand, making them very useful and popular among ML practitioners.

If you are not sure which encoding method to use, One-Hot encoding is almost always a good place to start. It is the go-to approach for categorical encoding due to its ease to use and understand, versatility, and accuracy.

Classic encoders that are already located in the categorical-encoding library are: Ordinal, One-Hot, Binary, and Hashing Encoder. Below are code examples detailing how each one is implemented. If you wish to follow along, these examples are also located in the Jupyter Notebook.

# Creates a new column for each unique value. 
enc_one_hot = ce.Encoder(method='one_hot')
fm_enc_one_hot = enc_one_hot.fit_transform(feature_matrix, features)
fm_enc_one_hot
product_id = coke zero product_id = car product_id = toothpaste purchased value countrycode = US countrycode = AL
id
0 1 0 0 True 0.0 1 0
1 1 0 0 True 5.0 1 0
2 1 0 0 True 10.0 1 0
3 0 1 0 True 15.0 1 0
4 0 1 0 True 20.0 1 0
5 0 0 1 True 0.0 0 1
# Each unique string value is assigned a counting number specific to that value.
enc_ord = ce.Encoder(method='ordinal')
fm_enc_ord = enc_ord.fit_transform(feature_matrix, features)
fm_enc_ord
PRODUCT_ID_ordinal purchased value COUNTRYCODE_ordinal
id
0 1 True 0.0 1
1 1 True 5.0 1
2 1 True 10.0 1
3 2 True 15.0 1
4 2 True 20.0 1
5 3 True 0.0 2
# The categories' values are first Ordinal encoded,
# the resulting integers are converted to binary,
# then the resulting digits are split into columns.
enc_bin = ce.Encoder(method='binary')
fm_enc_bin = enc_bin.fit_transform(feature_matrix, features)
fm_enc_bin
PRODUCT_ID_binary[0] PRODUCT_ID_binary[1] PRODUCT_ID_binary[2] purchased value COUNTRYCODE_binary[0] COUNTRYCODE_binary[1]
id
0 0 0 1 True 0.0 0 1
1 0 0 1 True 5.0 0 1
2 0 0 1 True 10.0 0 1
3 0 1 0 True 15.0 0 1
4 0 1 0 True 20.0 0 1
5 0 1 1 True 0.0 1 0
# Use a hashing algorithm to map category values to corresponding columns
enc_hash = ce.Encoder(method='hashing')
fm_enc_hash = enc_hash.fit_transform(feature_matrix, features)
fm_enc_hash
PRODUCT_ID_hashing[0] PRODUCT_ID_hashing[1] PRODUCT_ID_hashing[2] PRODUCT_ID_hashing[3] PRODUCT_ID_hashing[4] PRODUCT_ID_hashing[5] PRODUCT_ID_hashing[6] PRODUCT_ID_hashing[7] purchased value COUNTRYCODE_hashing[0] COUNTRYCODE_hashing[1] COUNTRYCODE_hashing[2] COUNTRYCODE_hashing[3] COUNTRYCODE_hashing[4] COUNTRYCODE_hashing[5] COUNTRYCODE_hashing[6] COUNTRYCODE_hashing[7]
id
0 0 0 0 0 1 0 0 0 True 0.0 0 0 1 0 0 0 0 0
1 0 0 0 0 1 0 0 0 True 5.0 0 0 1 0 0 0 0 0
2 0 0 0 0 1 0 0 0 True 10.0 0 0 1 0 0 0 0 0
3 0 1 0 0 0 0 0 0 True 15.0 0 0 1 0 0 0 0 0
4 0 1 0 0 0 0 0 0 True 20.0 0 0 1 0 0 0 0 0
5 0 0 0 1 0 0 0 0 True 0.0 0 1 0 0 0 0 0 0

Bayesian Encoders

The most significant difference between Bayesian and Classic encoders is that Bayesian encoders use information from a dependent variable in addition to the categorical variable. Furthermore, they output only one column, which eliminates any concerns regarding high-dimensionality that sometimes affect other encoders.

Aside from One-Hot encoding, LeaveOneOut encoding is another go-to categorical encoding method that I highly recommend.

Bayesian encoders that are already implemented in the categorical-encoding library are: Target and LeaveOneOut. I’ve included code examples for both of them below:

# Replaces each specific category value with a weighted average of the dependent variable.
enc_targ = ce.Encoder(method='target')
fm_enc_targ = enc_targ.fit_transform(feature_matrix, features, feature_matrix['value'])
fm_enc_targ
PRODUCT_ID_target purchased value COUNTRYCODE_target
id
0 5.397343 True 0.0 9.970023
1 5.397343 True 5.0 9.970023
2 5.397343 True 10.0 9.970023
3 15.034704 True 15.0 9.970023
4 15.034704 True 20.0 9.970023
5 8.333333 True 0.0 8.333333
# Identical to target except leaves own row out when calculating average
enc_leave = ce.Encoder(method='leave_one_out')
fm_enc_leave = enc_leave.fit_transform(feature_matrix, features, feature_matrix['value'])
fm_enc_leave
PRODUCT_ID_leave_one_out purchased value COUNTRYCODE_leave_one_out
id
0 7.500000 True 0.0 12.500000
1 5.000000 True 5.0 11.250000
2 2.500000 True 10.0 10.000000
3 20.000000 True 15.0 8.750000
4 15.000000 True 20.0 7.500000
5 8.333333 True 0.0 8.333333

You may find that for certain situations, you will prefer a different implementation of an encoder currently located in the library (ex: for my Target encoding implementation, I have a specific weighting for my mean calculation). If you wish to make modifications, such as taking a different weighting to calculate averages, feel free to create a new encoder class, make modifications, and call your new class with the Encoder API.

Alternative Encoders

A third group of encoders are the contrast encoders—they work by finding mathematical patterns among categories. You may consider common contrast encoder methods such as Helmert, Sum, Backward Difference, or Polynomial encoding.

Furthermore, other Bayesian encoders also exist such as James-Stein, M-Estimator, or Weights of Evidence. Always feel free to play around with encoding methods to find the one that best suits your data. However, if you need a good place to start, look at the default Classic/Bayesian encoders that are already implemented in the library. One-Hot and LeaveOneOut encoding are the most popular encoders for a good reason.

Summary

Categorical encoding is an essential step in feature engineering and machine learning, but handling categorical data can be tricky and time-consuming. Utilize the above guide to determine how to start encoding smarter and also how to apply the categorical-encoding library for use in any Featuretools machine learning pipeline. Always feel free to experiment with multiple categorical encoding methods—the API offers the flexibility for you to tweak your own version of common encoders that best fit your specific dataset.

Ultimately, categorical-encoding makes it easier for you to handle categorical data within your machine learning pipeline through seamless integration with feature engineering as well as model creation, training, and application.

Give the categorical-encoding library a try, and leave any feedback on Github! We always welcome any new encoding method implementations. For additional information and instructions, check out the ReadMe and guides.