
Handling missing data is an essential skill when working with pandas in Python. One of the most useful functions for filling in those troublesome NaN
values is fillna()
. In this article, I’ll walk you through how pandas fillna()
works, providing the best examples and explanations to ensure you can use this function effectively in your own data processing.
Understanding Missing Data in Pandas
Before diving into fillna()
, it’s crucial to understand what missing values are in pandas. In most cases, pandas represents missing values using NaN
(Not a Number), which stems from NumPy. A missing value can appear for various reasons, such as:
- Incomplete datasets
- Errors during data extraction
- Intentional placeholders
Let’s start by creating a simple DataFrame with missing values:
import pandas as pd
# Creating a sample dataset with NaN values
data = {'A': [1, 2, None, 4], 'B': [None, 2, 3, 4], 'C': [10, None, 30, 40]}
df = pd.DataFrame(data)
print(df)
This will output:
A B C
0 1.0 NaN 10.0
1 2.0 2.0 NaN
2 NaN 3.0 30.0
3 4.0 4.0 40.0
Basic Usage of fillna()
The fillna()
function replaces all occurrences of missing values with a specified value. Here’s the simplest example:
df_filled = df.fillna(0)
print(df_filled)
The result replaces all NaN
values with 0
:
A B C
0 1.0 0.0 10.0
1 2.0 2.0 0.0
2 0.0 3.0 30.0
3 4.0 4.0 40.0
Filling Missing Values with Different Strategies
Filling with a Specific Value Per Column
Instead of filling all missing values with the same value, you can specify different values for each column using a dictionary:
df_filled = df.fillna({'A': 99, 'B': 50, 'C': 0})
print(df_filled)
Forward and Backward Filling
You can propagate values forward (ffill
) or backward (bfill
) to fill missing data:
method='ffill'
: Fills missing values with the last known value above it.method='bfill'
: Fills missing values with the next known value below it.
df_ffill = df.fillna(method='ffill')
df_bfill = df.fillna(method='bfill')
Filling with Column Mean, Median, or Mode
Another smart way to handle missing values is by filling them with statistical measures such as the mean, median, or mode:
df_filled_mean = df.fillna(df.mean())
df_filled_median = df.fillna(df.median())
# Filling with mode (most frequent value)
df_filled_mode = df.fillna(df.mode().iloc[0])
Replacing Only a Specific Subset of Data
To replace missing values only within a specific row or column range, use slicing:
df.iloc[1:3, 0] = df.iloc[1:3, 0].fillna(55)
Performance Considerations
When working with large datasets, performance matters. Let’s examine how efficient fillna()
is:
Method | Time Complexity | Best Use Case |
---|---|---|
Constant value | O(n) | Quick and simple replacements |
Mean, median | O(n) | Numeric data with missing values |
ffill, bfill | O(n) | Time-series or sequential data |
Final Thoughts
Now you know exactly how pandas fillna()
works in Python and the best examples to use in real-world scenarios. Whether you want to use a constant value, statistical methods, or forward/backward filling, pandas gives you the flexibility to handle missing data effortlessly. The best approach depends on your dataset and the problem you’re solving, so choose wisely!