
If you’ve ever worked with data in Python, you’ve probably come across pandas
. One of its most essential functions for quick data analysis is describe()
. But how exactly does it work, and why is it so useful? Let’s dive in!
Understanding pandas.describe()
The describe()
function in pandas provides a quick summary of numerical (and sometimes categorical) data. When applied to a DataFrame, it generates statistics such as count, mean, standard deviation, min, max, and percentiles.
By default, it works on numerical columns, but it can also be used for other data types with the right arguments. Below is an example of how it works.
Basic Example of describe()
Let’s start by creating a simple DataFrame and applying describe()
to see what we get.
import pandas as pd
# Creating a sample DataFrame
data = {
'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
'Salary': [40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000]
}
df = pd.DataFrame(data)
# Applying describe()
print(df.describe())
The output will look something like this:
Statistic | Age | Salary |
---|---|---|
count | 10.0 | 10.0 |
mean | 47.5 | 85000.0 |
std | 15.14 | 25819.89 |
min | 25.0 | 40000.0 |
25% | 36.25 | 62500.0 |
50% | 47.5 | 85000.0 |
75% | 58.75 | 107500.0 |
max | 70.0 | 130000.0 |
What Each Statistic Means
- count: Number of non-null values in each column.
- mean: The average value.
- std: Standard deviation, measuring data spread.
- min: Minimum value.
- 25%, 50%, 75%: Percentiles (commonly known as quartiles).
- max: Maximum value.
Using describe()
on Categorical Data
If you have non-numeric data, you can still use the include
parameter to generate useful summaries.
data = {
'Gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
'Department': ['HR', 'IT', 'IT', 'HR', 'IT', 'HR', 'IT', 'HR', 'IT', 'HR']
}
df_cat = pd.DataFrame(data)
# Using describe on categorical data
print(df_cat.describe(include='object'))
The output will look like this:
Statistic | Gender | Department |
---|---|---|
count | 10 | 10 |
unique | 2 | 2 |
top | Male | IT |
freq | 5 | 6 |
Here, we get a summary of categorical values:
- Unique: Number of unique values.
- Top: Most frequently occurring value.
- Freq: Frequency of the most common value.
Customizing describe()
Output
You can also modify what statistics are displayed by using the percentiles
parameter.
print(df.describe(percentiles=[0.1, 0.9]))
This changes the default percentiles from 25%, 50%, and 75% to 10% and 90%.
Summary
The describe()
function in pandas is incredibly useful for quickly summarizing data. Whether you’re working with numerical or categorical data, it provides a wealth of insights in just one line of code.
Key takeaways:
- By default, it works on numeric data.
- It provides statistics like mean, standard deviation, and percentiles.
- It can be customized using the
include
andpercentiles
parameters. - It works for categorical data when explicitly specified.
If you want a quick overview of your dataset, pandas.describe()
is one of the best tools available. Give it a try in your next data project!