
Understanding pandas.nlargest()
in Python
When working with large datasets in Python, there are times when we need to extract the highest values from a DataFrame. That’s where pandas.nlargest()
comes in. This method allows us to efficiently retrieve the top n largest values from a specific column. Let’s dive into how it works and explore the best way to use it.
What is pandas.nlargest()
?
The nlargest()
method in pandas is used to return the top n rows from a DataFrame based on the values of a specified column. It is particularly useful when dealing with numerical data where sorting and selecting the largest values is essential.
Basic Syntax of nlargest()
The syntax of nlargest()
is straightforward. Here’s how you use it:
DataFrame.nlargest(n, columns, keep='first')
Let’s break down the parameters:
n
– The number of largest values you want to extract.columns
– The column in which you want to find the largest values.keep
– Determines how to handle duplicate values:'first'
– Keeps the first occurrence of duplicate values.'last'
– Keeps the last occurrence.'all'
– Keeps all duplicates.
Best Example of nlargest()
in Action
Let’s see a simple example of how this method works in a real-world scenario.
import pandas as pd
# Sample data
data = {
'Product': ['Laptop', 'Tablet', 'Smartphone', 'Monitor', 'Desktop'],
'Sales': [500, 300, 900, 200, 700]
}
df = pd.DataFrame(data)
# Extract top 3 products with highest sales
top_sales = df.nlargest(3, 'Sales')
print(top_sales)
Output:
Product Sales
2 Smartphone 900
4 Desktop 700
0 Laptop 500
As we can see, the method returns the top 3 products with the highest sales in descending order.
Handling Multiple Columns with nlargest()
We can also sort by multiple columns. If two rows have the same value in the primary column, the second column helps determine the order.
import pandas as pd
data = {
'Product': ['Laptop', 'Tablet', 'Laptop', 'Monitor', 'Desktop'],
'Sales': [500, 300, 500, 200, 700],
'Ratings': [4.5, 4.8, 4.6, 4.3, 4.7]
}
df = pd.DataFrame(data)
# Extract top 3 products by Sales, then by Ratings
top_products = df.nlargest(3, ['Sales', 'Ratings'])
print(top_products)
Output:
Product Sales Ratings
4 Desktop 700 4.7
2 Laptop 500 4.6
0 Laptop 500 4.5
When two Laptop
entries have the same sales, the one with a higher rating appears first.
Performance Benefits of nlargest()
vs Sorting
One might wonder why not just sort the DataFrame and take the top n rows. The answer is performance. Unlike sorting, which has a time complexity of O(n log n)
, nlargest()
is optimized and has a complexity of O(n)
. It is more efficient when working with large datasets.
# Less efficient sorting approach
df.sort_values(by='Sales', ascending=False).head(3)
# More efficient nlargest approach
df.nlargest(3, 'Sales')
Handling Duplicate Values with nlargest()
By default, nlargest()
keeps only the first occurrence when duplicate values exist. However, we can modify this behavior using the keep
parameter.
df.nlargest(3, 'Sales', keep='all')
This will return all rows when there are duplicates among the top values, making it useful for situations where we don’t want to arbitrarily remove ties.
Table Summary of nlargest()
Functionality
Functionality | Description |
---|---|
Extracting top n rows | Returns the largest values from a column |
Multiple column sorting | Resolves ties by comparing additional columns |
Efficiency | Faster than sorting for large datasets |
Handling duplicates | Can include or exclude duplicate values |
Final Thoughts
Understanding how pandas.nlargest()
works in Python is crucial for efficient data analysis. It helps extract the highest values quickly, provides flexibility in handling ties, and outperforms traditional sorting methods. Whether you’re working with sales figures, rankings, or scores, nlargest()
is an excellent tool to keep in your pandas toolkit.
Other interesting article:
How pandas nsmallest works in Python? Best example