How pandas describe works in Python? Best example

How pandas describe works in Python? Best example
“`html

If you’ve ever worked with data in Python, you’ve probably come across pandas. One of its most essential functions for quick data analysis is describe(). But how exactly does it work, and why is it so useful? Let’s dive in!

Understanding pandas.describe()

The describe() function in pandas provides a quick summary of numerical (and sometimes categorical) data. When applied to a DataFrame, it generates statistics such as count, mean, standard deviation, min, max, and percentiles.

By default, it works on numerical columns, but it can also be used for other data types with the right arguments. Below is an example of how it works.

Basic Example of describe()

Let’s start by creating a simple DataFrame and applying describe() to see what we get.

import pandas as pd

# Creating a sample DataFrame
data = {
    'Age': [25, 30, 35, 40, 45, 50, 55, 60, 65, 70],
    'Salary': [40000, 50000, 60000, 70000, 80000, 90000, 100000, 110000, 120000, 130000]
}

df = pd.DataFrame(data)

# Applying describe()
print(df.describe())

The output will look something like this:

Statistic Age Salary
count 10.0 10.0
mean 47.5 85000.0
std 15.14 25819.89
min 25.0 40000.0
25% 36.25 62500.0
50% 47.5 85000.0
75% 58.75 107500.0
max 70.0 130000.0

What Each Statistic Means

  • count: Number of non-null values in each column.
  • mean: The average value.
  • std: Standard deviation, measuring data spread.
  • min: Minimum value.
  • 25%, 50%, 75%: Percentiles (commonly known as quartiles).
  • max: Maximum value.

Using describe() on Categorical Data

If you have non-numeric data, you can still use the include parameter to generate useful summaries.

data = {
    'Gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Female'],
    'Department': ['HR', 'IT', 'IT', 'HR', 'IT', 'HR', 'IT', 'HR', 'IT', 'HR']
}

df_cat = pd.DataFrame(data)

# Using describe on categorical data
print(df_cat.describe(include='object'))

The output will look like this:

Statistic Gender Department
count 10 10
unique 2 2
top Male IT
freq 5 6

Here, we get a summary of categorical values:

  • Unique: Number of unique values.
  • Top: Most frequently occurring value.
  • Freq: Frequency of the most common value.

Customizing describe() Output

You can also modify what statistics are displayed by using the percentiles parameter.

print(df.describe(percentiles=[0.1, 0.9]))

This changes the default percentiles from 25%, 50%, and 75% to 10% and 90%.

Summary

The describe() function in pandas is incredibly useful for quickly summarizing data. Whether you’re working with numerical or categorical data, it provides a wealth of insights in just one line of code.

Key takeaways:

  1. By default, it works on numeric data.
  2. It provides statistics like mean, standard deviation, and percentiles.
  3. It can be customized using the include and percentiles parameters.
  4. It works for categorical data when explicitly specified.

If you want a quick overview of your dataset, pandas.describe() is one of the best tools available. Give it a try in your next data project!

“` Other interesting article: How pandas itertuples works in Python? Best example