
If you work with data in Python, you’ve probably used the pandas
library. And if you’ve dealt with aggregating data, there’s a very useful function that can make your life easier: pandas.DataFrame.agg()
. But how exactly does it work? Let’s dive deep into it.
What is pandas agg() in Python?
The agg()
function in pandas
is used to perform multiple aggregate operations on a DataFrame or Series. It allows you to apply one or more functions at once, making it a powerful tool for summarizing data.
Basic Usage of pandas agg()
Let’s start with a simple example. Suppose we have a DataFrame that contains sales data:
import pandas as pd
# Sample data
data = {
'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Sales': [100, 200, 150, 250, 300, 400]
}
df = pd.DataFrame(data)
Now, let’s apply the agg()
method to the “Sales” column:
result = df['Sales'].agg(['sum', 'mean', 'max'])
print(result)
Output:
sum 1400.0
mean 233.3
max 400.0
dtype: float64
As you can see, agg()
has computed multiple aggregation functions at once.
Using Custom Functions with agg()
You can also pass custom functions to agg()
. Here’s an example:
def range_func(x):
return x.max() - x.min()
result = df['Sales'].agg(['sum', 'mean', range_func])
print(result)
Output:
sum 1400.0
mean 233.3
range_func 300.0
dtype: float64
In this case, we defined a function that calculates the range (max – min) and applied it using agg()
.
Applying agg() on Multiple Columns
You can also use agg()
on a DataFrame to apply different aggregation functions to different columns.
# Create a more complex DataFrame
data = {
'Category': ['A', 'B', 'A', 'B', 'A', 'B'],
'Sales': [100, 200, 150, 250, 300, 400],
'Profit': [20, 50, 30, 80, 90, 120]
}
df = pd.DataFrame(data)
# Apply different functions to different columns
result = df.agg({
'Sales': ['sum', 'mean'],
'Profit': ['min', 'max']
})
print(result)
Output:
Sales Profit
sum 1400.0 NaN
mean 233.3 NaN
min NaN 20.0
max NaN 120.0
Here, we computed the sum and mean for the “Sales” column and the min and max for the “Profit” column.
Using agg() with GroupBy
One of the best use cases for agg()
is when working with groupby()
. Let’s say we want to aggregate sales and profit data by category.
grouped = df.groupby('Category').agg({
'Sales': ['sum', 'mean'],
'Profit': ['sum', 'min', 'max']
})
print(grouped)
Output:
Sales Profit
sum mean sum min max
Category
A 550 183.3 140 20 90
B 850 283.3 250 50 120
This neatly summarizes sales and profit values per category, saving you from writing multiple aggregation steps manually.
Comparing agg() to apply()
Sometimes people confuse agg()
with apply()
. The difference is:
apply()
applies a function to each column or row.agg()
applies one or multiple aggregation functions column-wise.
For example, using apply()
for the same task:
df[['Sales', 'Profit']].apply(lambda x: x.sum())
This sums both columns but doesn’t offer multiple aggregation functions like agg()
does.
Performance Considerations
Using agg()
is typically faster than using multiple separate aggregation methods. Instead of calling sum()
, mean()
, and max()
sequentially, agg()
applies them in a single call, improving efficiency.
Conclusion
The pandas.DataFrame.agg()
function is a powerful tool in Python for summarizing and analyzing data. Whether you’re applying it to a single column, multiple columns, or grouped data, it helps streamline and optimize data aggregation tasks.