How pandas drop_duplicates works in Python? Best example

How pandas drop_duplicates works in Python? Best example
“`html

When working with data in Python, especially using the pandas library, it’s common to encounter duplicate values in a DataFrame. Thankfully, pandas.drop_duplicates() makes it easy to remove them efficiently. In this article, I’ll walk you through how drop_duplicates() works, covering various parameters and practical examples.

Understanding pandas.drop_duplicates()

The drop_duplicates() method is used to remove duplicate rows from a DataFrame. By default, it keeps the first occurrence and removes the rest. It’s a powerful method when handling real-world datasets, where duplicate values can distort analysis.

Basic Syntax of drop_duplicates()

Here’s the basic syntax of the method:

DataFrame.drop_duplicates(subset=None, keep='first', inplace=False, ignore_index=False)

Now, let’s break down the key parameters:

  • subset: Specifies the columns to check for duplicates. If None (default), all columns are considered.
  • keep: Determines which duplicate to retain. Options:
    • 'first' (default) – keeps the first occurrence and drops the rest.
    • 'last' – keeps the last occurrence and removes earlier ones.
    • False – removes all duplicates.
  • inplace: If True, modifies the original DataFrame.
  • ignore_index: If True, resets the index after dropping duplicates.

Example: Removing Duplicates from a DataFrame

Let’s take a look at a simple example to see how drop_duplicates() works:

import pandas as pd

# Sample DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Alice', 'David', 'Bob'],
    'Age': [25, 30, 25, 40, 30],
    'City': ['New York', 'Chicago', 'New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)

# Dropping duplicate rows
df_unique = df.drop_duplicates()

print(df_unique)

Output:

    Name  Age         City
0  Alice   25     New York
1    Bob   30      Chicago
3  David   40  Los Angeles

As you can see, the duplicate rows have been removed, keeping only the first occurrences.

Using ‘subset’ to Drop Duplicates Based on Specific Columns

We can specify particular columns instead of checking all columns. Here’s an example:

# Dropping duplicates based only on the 'Name' column
df_unique_name = df.drop_duplicates(subset=['Name'])

print(df_unique_name)

Output:

    Name  Age         City
0  Alice   25     New York
1    Bob   30      Chicago
3  David   40  Los Angeles

Even though ‘Alice’ and ‘Bob’ appear multiple times, only the first occurrence of each was kept since we specified subset=['Name'].

Using ‘keep’ Parameter to Control Duplicate Removal

Let’s see the difference when using the 'last' option for the keep parameter.

# Keeping the last occurrence instead of the first
df_last = df.drop_duplicates(keep='last')

print(df_last)

Output:

    Name  Age         City
2  Alice   25     New York
4    Bob   30      Chicago
3  David   40  Los Angeles

Notice how 'Alice' and 'Bob' now refer to their last occurrences in the initial DataFrame.

Dropping All Duplicate Rows

If you want to remove all repetitions, leaving only unique values, set keep=False:

# Removing all duplicate occurrences
df_none = df.drop_duplicates(keep=False)

print(df_none)

Output:

    Name  Age         City
3  David   40  Los Angeles

All names that appeared more than once were completely removed, leaving only the unique records.

Using ‘inplace’ to Modify the Original DataFrame

By default, drop_duplicates() returns a new DataFrame instead of modifying the existing one. To change the original DataFrame in place, use inplace=True:

df.drop_duplicates(inplace=True)

After this operation, df itself will have duplicate rows removed.

Using ‘ignore_index’ to Reset the Index

When duplicates are removed, missing index values might appear. If you want to reset the index, set ignore_index=True:

df_reset = df.drop_duplicates(ignore_index=True)

This ensures the index is sequential after removing duplicates.

Performance Considerations

Processing large datasets can be slow with drop_duplicates(), so here are some tips for optimization:

  • Use subset to limit comparison to necessary columns.
  • Leverage inplace=True to save memory if modifications are needed directly.
  • For larger datasets, consider using more efficient data-processing libraries like modin.pandas.

Summary of Key Concepts in a Table

Parameter Description Default
subset Columns to check for duplicates None (all columns)
keep Which duplicate to retain 'first'
inplace Modify original DataFrame False
ignore_index Reset index after dropping duplicates False

Conclusion

Understanding how drop_duplicates() works in Python is crucial for efficient data preprocessing. Whether you’re working with small datasets or handling massive amounts of data, this method is a lifesaver. By leveraging different parameters like subset, keep, and inplace, you can refine your deduplication process based on your needs.

“` Other interesting article: How pandas nunique works in Python? Best example