How pandas get_dummies works in Python? Best example

How pandas get_dummies works in Python? Best example
“`html

When working with categorical data in Python, encoding categorical variables is a crucial step for machine learning. One of the easiest ways to perform one-hot encoding in pandas is by using the handy function get_dummies(). In this article, I’ll dive deep into how pandas.get_dummies() works in Python and provide some of the best examples to illustrate its practical use.

Understanding One-Hot Encoding

Before jumping into get_dummies(), let’s quickly understand what one-hot encoding is. When we deal with categorical data, machine learning models cannot directly process text values like “red,” “blue,” or “green.” We need to convert them into numerical representations.

One-hot encoding transforms categorical variables into a binary matrix, where each unique category becomes a column with values of 0 or 1. This method ensures that no ordinal relationships are mistakenly introduced into the data.

Using pandas.get_dummies()

The get_dummies() function in pandas allows us to quickly perform one-hot encoding on categorical data. Here’s a simple example:

import pandas as pd

# Sample DataFrame
df = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red']
})

# Applying get_dummies
encoded_df = pd.get_dummies(df)

print(encoded_df)

This will produce the following output:

Color_Blue Color_Green Color_Red
0 0 1
1 0 0
0 1 0
1 0 0
0 0 1

Each unique value in the “Color” column has been transformed into a separate column with binary indicators.

Handling Multiple Categorical Columns

If a DataFrame contains multiple categorical columns, get_dummies() will automatically encode all of them by default:

df = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Blue', 'Red'],
    'Size': ['S', 'M', 'L', 'M', 'S']
})

encoded_df = pd.get_dummies(df)
print(encoded_df)

The generated DataFrame will have binary indicators for both “Color” and “Size.”

Using the prefix and prefix_sep Parameters

By default, pandas adds the column name as a prefix, separated by an underscore. We can modify this using the prefix and prefix_sep parameters:

encoded_df = pd.get_dummies(df, prefix=['C', 'S'], prefix_sep='-')
print(encoded_df)

This renames the columns to “C-Red,” “C-Blue,” “C-Green,” etc.

Dropping One Dummy Variable to Avoid Multicollinearity

In many machine learning applications, including all dummy variables can introduce multicollinearity. To avoid this, we can set drop_first=True, which drops the first category:

encoded_df = pd.get_dummies(df, drop_first=True)
print(encoded_df)

This approach reduces redundancy while still capturing all the information.

Handling NaN Values

If the DataFrame contains missing values, get_dummies() automatically treats them as a separate category. We can handle NaNs explicitly by filling them before encoding:

df = pd.DataFrame({'Color': ['Red', 'Blue', None, 'Green', 'Red']})
df.fillna('Unknown', inplace=True)

encoded_df = pd.get_dummies(df)
print(encoded_df)

Applying get_dummies() to Specific Columns

By default, get_dummies() transforms all categorical columns. We can specify which columns to encode by using the columns parameter:

encoded_df = pd.get_dummies(df, columns=['Color'])
print(encoded_df)

This ensures only the selected columns are transformed while keeping others unchanged.

Conclusion

The pandas.get_dummies() function is an efficient way to perform one-hot encoding in Python. Whether you’re working with a single categorical column or multiple, it offers flexibility with options like drop_first and custom prefixes. Understanding how to use get_dummies() effectively will enhance your data preprocessing workflow, making it easier to feed data into machine learning models.

“` Other interesting article: How pandas pivot works in Python? Best example