How pandas unique works in Python? Best example

How pandas unique works in Python? Best example
“`html

When working with data in Python, handling unique values is a common task. Thankfully, pandas provides a simple yet powerful method to extract unique values from a Series or Index: pandas.unique(). In this article, I’ll walk you through how this function works, where it’s useful, and provide an in-depth example to help solidify your understanding.

Understanding pandas.unique()

The pandas.unique() function is used to return the unique values from a given Series or Index object. It is widely used in data preprocessing, helping to eliminate duplicate values quickly and efficiently.

Here’s the basic syntax:

import pandas as pd

pd.unique(values)

The parameter values can be a:

  • Pandas Series
  • Pandas Index
  • Numpy array
  • Python list

How pandas.unique() Works

Internally, pandas.unique() operates by scanning through the provided data and filtering out duplicate entries. It maintains the order of appearance, meaning that if a value appears multiple times, only the first occurrence is kept, and the rest are ignored.

Example: Finding Unique Values in a DataFrame Column

Let’s see a practical example of how pandas.unique() works in Python.

import pandas as pd

# Create a sample DataFrame
data = {'Category': ['A', 'B', 'A', 'C', 'B', 'D', 'C', 'E']}
df = pd.DataFrame(data)

# Find unique values in the 'Category' column
unique_values = pd.unique(df['Category'])

print(unique_values)

Output:

['A' 'B' 'C' 'D' 'E']

As you can see, pandas.unique() has returned an array of unique values, respecting the order in which they appeared in the original data.

Performance Considerations

One of the key advantages of pandas.unique() is its efficiency. It is optimized for handling large datasets and performs better than Python’s built-in set() when working with NumPy or Pandas objects.

To illustrate this, let’s compare pandas.unique() with Python’s set() in terms of performance:

import numpy as np
import time

# Generate a large dataset
large_data = np.random.randint(0, 10000, size=1000000)

# Measure performance of pandas.unique()
start = time.time()
pd.unique(large_data)
end = time.time()
print(f"pandas.unique() execution time: {end - start} seconds")

# Measure performance of set()
start = time.time()
set(large_data)
end = time.time()
print(f"set() execution time: {end - start} seconds")

For large datasets, pandas.unique() is typically faster because it is implemented in optimized C code, leveraging NumPy’s efficient array operations.

Handling Different Data Types

pandas.unique() can handle various data types, including:

  • Integers
  • Floats
  • Strings
  • Boolean values
  • Datetime objects

Below is an example demonstrating how pandas.unique() operates on different data types:

df = pd.DataFrame({
    'integers': [1, 2, 2, 3, 4, 4, 5],
    'floats': [1.1, 2.2, 2.2, 3.3, 3.3, 3.3, 4.4],
    'strings': ['apple', 'banana', 'apple', 'orange', 'banana'],
    'booleans': [True, False, True, True, False, False]
})

print(pd.unique(df['integers']))
print(pd.unique(df['floats']))
print(pd.unique(df['strings']))
print(pd.unique(df['booleans']))

Each data type is handled independently, maintaining the order of first occurrence.

Using pandas.unique() with Duplicates and NaN Values

pandas.unique() also works seamlessly with missing values (NaN). If there are NaN values in the dataset, they will be retained in the unique results:

df = pd.DataFrame({'values': [1, 2, 2, 3, np.nan, 3, np.nan, 4]})
unique_vals = pd.unique(df['values'])

print(unique_vals)

Output:

[  1.   2.   3.  nan   4.]

Even though NaN appears multiple times, it is treated as a unique entry.

Comparing pandas.unique() with drop_duplicates()

Sometimes, you might come across drop_duplicates(), which also helps in removing duplicate values in Pandas. However, there’s a key difference:

  • pandas.unique() works only on a single column or array.
  • drop_duplicates() can work on an entire DataFrame.

Example using drop_duplicates():

df = pd.DataFrame({'Category': ['A', 'B', 'A', 'C', 'B', 'D', 'C', 'E']})

# Using drop_duplicates instead of pandas.unique
unique_df = df.drop_duplicates()

print(unique_df)

While both methods remove duplicates, drop_duplicates() is more useful when working with structured data and considering multiple columns.

Key Takeaways

  • pandas.unique() extracts unique values from a Pandas Series, Index, or NumPy array.
  • It maintains the order of appearance and is optimized for performance.
  • It handles NaN values efficiently.
  • It works well with various data types, including numbers, strings, booleans, and datetime objects.
  • For working on entire DataFrames, drop_duplicates() may be a better choice.

Understanding how to use pandas.unique() correctly will help you manage and analyze data effectively in Python.

“` Other interesting article: How pandas value_counts works in Python? Best example