How pandas duplicated works in Python? Best example

How pandas duplicated works in Python? Best example
“`html

When working with data in Python, one of the most common tasks is handling duplicate values. The pandas.duplicated() function is a powerful tool that helps identify duplicate rows in a DataFrame. But how exactly does it work? In this article, I’ll break it down step by step, with examples and best practices.

Understanding pandas.duplicated()

The pandas.duplicated() function checks for duplicate rows in a DataFrame and returns a boolean Series where:

  • True means that a row is a duplicate of a previous one.
  • False means that the row is unique.

By default, pandas.duplicated() considers all columns for identifying duplicates, but you can specify individual columns to focus on.

Basic Usage of pandas.duplicated()

Let’s start with a simple example:

import pandas as pd

# Creating a DataFrame
data = {'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
        'Age': [25, 30, 25, 40, 30]}

df = pd.DataFrame(data)

# Identifying duplicates
duplicates = df.duplicated()

print(df)
print(duplicates)

This will output:

    Name  Age
0  Alice   25
1    Bob   30
2  Alice   25
3  David   40
4    Bob   30

0    False
1    False
2     True
3    False
4     True
dtype: bool

As you can see, the second occurrence of “Alice, 25” and “Bob, 30” are marked as True, indicating they are duplicates.

Controlling Duplicate Detection with keep

The keep parameter decides which duplicate occurrence to mark as True:

  • keep='first' (default) – Marks all duplicates after the first occurrence as True.
  • keep='last' – Marks all duplicates except the last occurrence as True.
  • keep=False – Marks all occurrences of duplicates as True.

Example of using keep=False:

duplicates_all = df.duplicated(keep=False)
print(duplicates_all)

Output:

0     True
1     True
2     True
3    False
4     True
dtype: bool

Checking Duplicates in Specific Columns

If you only want to check for duplicates in a specific column, pass the column name to subset:

name_duplicates = df.duplicated(subset=['Name'])
print(name_duplicates)

Output:

0    False
1    False
2     True
3    False
4     True
dtype: bool

Here, only names are checked, so “Alice” and “Bob” appear as duplicates.

Comparison of Different keep Methods

Method Description
keep='first' Only the first occurrence is considered unique.
keep='last' Only the last occurrence is considered unique.
keep=False All duplicate occurrences are marked as duplicates.

Removing Duplicates with drop_duplicates()

If you want to remove duplicates instead of just identifying them, use drop_duplicates():

df_no_duplicates = df.drop_duplicates()
print(df_no_duplicates)

This will return:

    Name  Age
0  Alice   25
1    Bob   30
3  David   40

Best Practices for Using pandas.duplicated()

  1. Always check what data you’re analyzing; unexpected duplicates may indicate data issues.
  2. Use subset if duplicates should be checked for specific columns.
  3. Use keep=False if you want to mark all duplicates.
  4. Combine duplicated() with drop_duplicates() for effective duplicate management.

Conclusion

The pandas.duplicated() function is an essential tool when dealing with data in Python. Whether you’re identifying duplicate entries, filtering them out, or understanding how they are distributed in your dataset, mastering this function will make your data processing workflow much more efficient.

“` Other interesting article: How pandas t (transpose) works in Python? Best example