
One of the most useful functions in the pandas
library is crosstab()
. It allows us to compute cross-tabulations, which are useful for summarizing categorical data, analyzing relationships between variables, and creating pivot tables. In this post, I’ll walk you through how pandas.crosstab()
works and provide the best examples to help you master it.
What is pandas.crosstab()?
The pandas.crosstab()
function is used to compute simple frequency tables or contingency tables. It’s particularly useful when analyzing categorical data, and it works similarly to pivot tables in Excel.
Here’s the basic syntax of pandas.crosstab()
:
import pandas as pd
pd.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', normalize=False, dropna=True, normalize_axis=None)
Now, let’s break down its parameters and see how to use them effectively.
Basic Example: Creating a Frequency Table
Suppose I have a dataset of employees containing their department and gender. I want to see how many employees belong to each department by gender.
import pandas as pd
data = {'Department': ['HR', 'IT', 'HR', 'IT', 'HR', 'Sales', 'Sales', 'IT'],
'Gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female']}
df = pd.DataFrame(data)
cross_tab = pd.crosstab(df['Department'], df['Gender'])
print(cross_tab)
This will output:
Gender | Female | Male |
---|---|---|
HR | 1 | 2 |
IT | 2 | 1 |
Sales | 1 | 1 |
This simple table shows how employees are distributed by gender across different departments.
Using the values
and aggfunc
Parameters
Sometimes, we don’t just want counts; we might want summed values from another column. Let’s say our dataset includes salaries, and we want to see the total salary distribution.
data = {'Department': ['HR', 'IT', 'HR', 'IT', 'HR', 'Sales', 'Sales', 'IT'],
'Gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female'],
'Salary': [5000, 6000, 5200, 5800, 5000, 4300, 4500, 6200]}
df = pd.DataFrame(data)
salary_distribution = pd.crosstab(df['Department'], df['Gender'], values=df['Salary'], aggfunc='sum')
print(salary_distribution)
This will result in a table where the values represent the total salary for each category.
Adding Margins (Row and Column Totals)
You can add row and column totals using the margins=True
option.
cross_tab_with_margins = pd.crosstab(df['Department'], df['Gender'], margins=True)
print(cross_tab_with_margins)
This will add an extra row and column labeled All
, showing the total counts for each category.
Normalizing Data (Percentage Calculations)
The normalize
parameter allows you to convert values into proportions instead of counts.
normalized_tab = pd.crosstab(df['Department'], df['Gender'], normalize=True)
print(normalized_tab)
This will return proportions instead of absolute counts, making it easier to interpret.
Multi-Index Cross Tabulation
Sometimes, you might want to analyze data based on multiple categorical variables at once. You can pass multiple columns as lists.
multi_tab = pd.crosstab([df['Department'], df['Gender']], df['Salary'])
print(multi_tab)
This allows us to create more detailed breakdowns of our dataset.
Conclusion
Now that you understand how pandas.crosstab()
works, you can efficiently analyze categorical data in Python. Whether you’re working with frequency distributions, aggregating values, or normalizing data, this function provides a flexible approach. Play around with the different parameters, and you’ll quickly see how powerful it is in real-world data analysis.