How pandas nlargest works in Python? Best example

How pandas nlargest works in Python? Best example

Understanding pandas.nlargest() in Python

When working with large datasets in Python, there are times when we need to extract the highest values from a DataFrame. That’s where pandas.nlargest() comes in. This method allows us to efficiently retrieve the top n largest values from a specific column. Let’s dive into how it works and explore the best way to use it.

What is pandas.nlargest()?

The nlargest() method in pandas is used to return the top n rows from a DataFrame based on the values of a specified column. It is particularly useful when dealing with numerical data where sorting and selecting the largest values is essential.

Basic Syntax of nlargest()

The syntax of nlargest() is straightforward. Here’s how you use it:

DataFrame.nlargest(n, columns, keep='first')

Let’s break down the parameters:

  • n – The number of largest values you want to extract.
  • columns – The column in which you want to find the largest values.
  • keep – Determines how to handle duplicate values:
    • 'first' – Keeps the first occurrence of duplicate values.
    • 'last' – Keeps the last occurrence.
    • 'all' – Keeps all duplicates.

Best Example of nlargest() in Action

Let’s see a simple example of how this method works in a real-world scenario.

import pandas as pd

# Sample data
data = {
    'Product': ['Laptop', 'Tablet', 'Smartphone', 'Monitor', 'Desktop'],
    'Sales': [500, 300, 900, 200, 700]
}

df = pd.DataFrame(data)

# Extract top 3 products with highest sales
top_sales = df.nlargest(3, 'Sales')

print(top_sales)

Output:

     Product  Sales
2  Smartphone    900
4    Desktop    700
0     Laptop    500

As we can see, the method returns the top 3 products with the highest sales in descending order.

Handling Multiple Columns with nlargest()

We can also sort by multiple columns. If two rows have the same value in the primary column, the second column helps determine the order.

import pandas as pd

data = {
    'Product': ['Laptop', 'Tablet', 'Laptop', 'Monitor', 'Desktop'],
    'Sales': [500, 300, 500, 200, 700],
    'Ratings': [4.5, 4.8, 4.6, 4.3, 4.7]
}

df = pd.DataFrame(data)

# Extract top 3 products by Sales, then by Ratings
top_products = df.nlargest(3, ['Sales', 'Ratings'])

print(top_products)

Output:

   Product  Sales  Ratings
4  Desktop    700      4.7
2  Laptop    500      4.6
0  Laptop    500      4.5

When two Laptop entries have the same sales, the one with a higher rating appears first.

Performance Benefits of nlargest() vs Sorting

One might wonder why not just sort the DataFrame and take the top n rows. The answer is performance. Unlike sorting, which has a time complexity of O(n log n), nlargest() is optimized and has a complexity of O(n). It is more efficient when working with large datasets.

# Less efficient sorting approach
df.sort_values(by='Sales', ascending=False).head(3)

# More efficient nlargest approach
df.nlargest(3, 'Sales')

Handling Duplicate Values with nlargest()

By default, nlargest() keeps only the first occurrence when duplicate values exist. However, we can modify this behavior using the keep parameter.

df.nlargest(3, 'Sales', keep='all')

This will return all rows when there are duplicates among the top values, making it useful for situations where we don’t want to arbitrarily remove ties.

Table Summary of nlargest() Functionality

Functionality Description
Extracting top n rows Returns the largest values from a column
Multiple column sorting Resolves ties by comparing additional columns
Efficiency Faster than sorting for large datasets
Handling duplicates Can include or exclude duplicate values

Final Thoughts

Understanding how pandas.nlargest() works in Python is crucial for efficient data analysis. It helps extract the highest values quickly, provides flexibility in handling ties, and outperforms traditional sorting methods. Whether you’re working with sales figures, rankings, or scores, nlargest() is an excellent tool to keep in your pandas toolkit.

 

Other interesting article:

How pandas nsmallest works in Python? Best example