
When working with data in pandas, one of the most powerful and frequently used tools is the groupby()
function. It allows us to group our data based on a given column (or multiple columns) and perform aggregate functions like sum, mean, count, and more. In this article, I will walk you through how pandas.groupby()
works in Python with the best examples.
Understanding pandas groupby in Python
At its core, groupby()
splits the data into groups based on a specific column (or columns) and then allows us to apply various operations on those groups. This technique follows the “split-apply-combine” approach:
- Split: The data is divided into groups based on a specific feature.
- Apply: A function is applied to each group (e.g., sum, mean, count, etc.).
- Combine: The results from each group are combined into a new DataFrame.
Basic example of pandas groupby()
Let’s start with a simple DataFrame and see how groupby works with an aggregate function like sum()
.
import pandas as pd
# Sample data
data = {
'Category': ['A', 'B', 'A', 'B', 'A', 'C'],
'Values': [10, 20, 30, 40, 50, 60]
}
df = pd.DataFrame(data)
# Group by 'Category' and sum the 'Values'
grouped = df.groupby('Category')['Values'].sum()
print(grouped)
Output:
Category
A 90
B 60
C 60
Name: Values, dtype: int64
Here, pandas grouped the data by the Category
column and calculated the sum for each category.
Applying multiple aggregation functions
We can use multiple aggregation functions at once using the agg()
method.
# Applying multiple functions: sum and mean
grouped = df.groupby('Category')['Values'].agg(['sum', 'mean'])
print(grouped)
The output will look like this:
Category | Sum | Mean |
---|---|---|
A | 90 | 30.0 |
B | 60 | 30.0 |
C | 60 | 60.0 |
Grouping by multiple columns
We can also group by multiple columns if needed. Let’s modify our dataset to include another column:
# Sample data with an additional column
data = {
'Category': ['A', 'B', 'A', 'B', 'A', 'C'],
'SubCategory': ['X', 'X', 'Y', 'Y', 'X', 'Y'],
'Values': [10, 20, 30, 40, 50, 60]
}
df = pd.DataFrame(data)
# Group by multiple columns
grouped = df.groupby(['Category', 'SubCategory'])['Values'].sum()
print(grouped)
The output:
Category SubCategory
A X 60
Y 30
B X 20
Y 40
C Y 60
Name: Values, dtype: int64
Using groupby with transform()
The transform()
function helps us apply operations to grouped data while keeping the original structure.
# Calculate the mean per category but keep the original structure
df['MeanValue'] = df.groupby('Category')['Values'].transform('mean')
print(df)
Now, each row will have the mean value of its respective category.
Filtering groups based on conditions
Sometimes, we may want to filter groups based on a condition. For example, selecting only those groups where the sum of values is greater than 50.
# Filtering groups where sum of values is greater than 50
filtered = df.groupby('Category').filter(lambda x: x['Values'].sum() > 50)
print(filtered)
Common mistakes when using groupby()
While groupby()
is a powerful method, there are some common mistakes to avoid:
- Forgetting to use an aggregation function: Simply using
groupby()
without an aggregation function will not return useful results. - Using groupby without selecting a column: If no column is selected, the entire DataFrame is grouped, which may not be desirable.
- Incorrect use of reset_index(): When grouping data, the index changes. Using
.reset_index()
ensures the output remains a DataFrame.
Conclusion
In this article, I explained how pandas.groupby()
works in Python with different examples. Whether you’re doing basic aggregation, grouping by multiple columns, using transform()
, or filtering data, mastering groupby()
is essential for efficient data analysis.