How pandas crosstab works in Python? Best example

How pandas crosstab works in Python? Best example
“`html

One of the most useful functions in the pandas library is crosstab(). It allows us to compute cross-tabulations, which are useful for summarizing categorical data, analyzing relationships between variables, and creating pivot tables. In this post, I’ll walk you through how pandas.crosstab() works and provide the best examples to help you master it.

What is pandas.crosstab()?

The pandas.crosstab() function is used to compute simple frequency tables or contingency tables. It’s particularly useful when analyzing categorical data, and it works similarly to pivot tables in Excel.

Here’s the basic syntax of pandas.crosstab():

import pandas as pd

pd.crosstab(index, columns, values=None, rownames=None, colnames=None, aggfunc=None, margins=False, margins_name='All', normalize=False, dropna=True, normalize_axis=None)

Now, let’s break down its parameters and see how to use them effectively.

Basic Example: Creating a Frequency Table

Suppose I have a dataset of employees containing their department and gender. I want to see how many employees belong to each department by gender.

import pandas as pd

data = {'Department': ['HR', 'IT', 'HR', 'IT', 'HR', 'Sales', 'Sales', 'IT'],
        'Gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female']}

df = pd.DataFrame(data)

cross_tab = pd.crosstab(df['Department'], df['Gender'])
print(cross_tab)

This will output:

Gender Female Male
HR 1 2
IT 2 1
Sales 1 1

This simple table shows how employees are distributed by gender across different departments.

Using the values and aggfunc Parameters

Sometimes, we don’t just want counts; we might want summed values from another column. Let’s say our dataset includes salaries, and we want to see the total salary distribution.

data = {'Department': ['HR', 'IT', 'HR', 'IT', 'HR', 'Sales', 'Sales', 'IT'],
        'Gender': ['Male', 'Female', 'Female', 'Male', 'Male', 'Female', 'Male', 'Female'],
        'Salary': [5000, 6000, 5200, 5800, 5000, 4300, 4500, 6200]}

df = pd.DataFrame(data)

salary_distribution = pd.crosstab(df['Department'], df['Gender'], values=df['Salary'], aggfunc='sum')
print(salary_distribution)

This will result in a table where the values represent the total salary for each category.

Adding Margins (Row and Column Totals)

You can add row and column totals using the margins=True option.

cross_tab_with_margins = pd.crosstab(df['Department'], df['Gender'], margins=True)
print(cross_tab_with_margins)

This will add an extra row and column labeled All, showing the total counts for each category.

Normalizing Data (Percentage Calculations)

The normalize parameter allows you to convert values into proportions instead of counts.

normalized_tab = pd.crosstab(df['Department'], df['Gender'], normalize=True)
print(normalized_tab)

This will return proportions instead of absolute counts, making it easier to interpret.

Multi-Index Cross Tabulation

Sometimes, you might want to analyze data based on multiple categorical variables at once. You can pass multiple columns as lists.

multi_tab = pd.crosstab([df['Department'], df['Gender']], df['Salary'])
print(multi_tab)

This allows us to create more detailed breakdowns of our dataset.

Conclusion

Now that you understand how pandas.crosstab() works, you can efficiently analyze categorical data in Python. Whether you’re working with frequency distributions, aggregating values, or normalizing data, this function provides a flexible approach. Play around with the different parameters, and you’ll quickly see how powerful it is in real-world data analysis.

“` Other interesting article: How pandas melt works in Python? Best example