How pandas interpolate works in Python? Best example

When working with data in Python, we often encounter missing values. Whether you’re dealing with a time-series dataset or a simple table with gaps, filling in those missing values is crucial for accurate analysis. This is where pandas.interpolate() comes into play. In this article, I’ll take a deep dive into how pandas’ interpolation works, with the best examples to help you understand its potential.

What is `pandas.interpolate()`?

Interpolation is a mathematical technique used to estimate missing data points within a range of known values. In pandas, the interpolate() method is available for both Series and DataFrame objects and provides various strategies for filling missing values efficiently.

Basic Usage of `pandas.interpolate()`

The simplest way to use interpolate() is by calling it on a pandas Series with missing values. Let’s take a look at a basic example:

import pandas as pd

# Creating a sample Series with missing values
data = pd.Series([1, 2, None, 4, None, 6])

# Applying interpolation
interpolated_data = data.interpolate()

print(interpolated_data)

Output:


0    1.0
1    2.0
2    3.0
3    4.0
4    5.0
5    6.0
dtype: float64

By default, pandas uses linear interpolation, filling missing values by drawing a straight line between known values.

Interpolation Methods

Pandas supports multiple interpolation methods. Let’s explore the most useful ones:

Linear (default): Estimates missing values in a linearly spaced manner.
Polynomial & Spline: Uses polynomial and spline functions to interpolate missing values.
Nearest: Replaces missing values with the nearest known value.
Pchip & Akima: More advanced interpolation techniques, particularly useful for non-linear data.

Using Different Interpolation Methods

Let’s see some examples of different interpolation techniques in action.

1. Polynomial Interpolation

When dealing with non-linear datasets, polynomial interpolation can be helpful:

import numpy as np

data = pd.Series([1, np.nan, np.nan, 4, np.nan, 6])

# Polynomial interpolation
interpolated_data = data.interpolate(method="polynomial", order=2)

print(interpolated_data)

2. Nearest Interpolation

Nearest interpolation replaces missing values with the closest available value:

interpolated_data = data.interpolate(method="nearest")
print(interpolated_data)

3. Spline Interpolation

Spline interpolation is useful for smooth curve estimation:

interpolated_data = data.interpolate(method="spline", order=2)
print(interpolated_data)

Using `pandas.interpolate()` with a DataFrame

Interpolation can also be applied to DataFrames. Here’s how:

df = pd.DataFrame({
    'A': [1, 2, None, 4, None, 6],
    'B': [None, 3, 4, None, 6, 8]
})

df_interpolated = df.interpolate()
print(df_interpolated)

Output Table:

Index	A	B
0	1.0	NaN
1	2.0	3.0
2	3.0	4.0
3	4.0	5.0
4	5.0	6.0
5	6.0	8.0

Interpolation in Time Series Data

For time-series data, interpolation works well to fill in gaps between missing timestamps.

date_rng = pd.date_range(start='2023-01-01', periods=6, freq='D')
time_series = pd.Series([1, np.nan, np.nan, 4, np.nan, 6], index=date_rng)

# Time-based interpolation
time_series_interpolated = time_series.interpolate(method="time")
print(time_series_interpolated)

Key Takeaways

Using pandas.interpolate() is a powerful approach to handle missing values in Python. Here’s a quick summary of what we learned:

Interpolation helps estimate missing values based on known data points.
The default method is linear, but others like polynomial, spline, and nearest exist.
It works with both Series and DataFrame objects.
It’s particularly useful in time-series data analysis.

Mastering pandas.interpolate() will enhance your data cleaning skills significantly. Whether filling gaps in a dataset or smoothing a time-series graph, it’s an essential tool in any Python data scientist’s toolkit.

Other interesting article:

How pandas resample works in Python? Best example