How pandas nsmallest works in Python? Best example

How pandas nsmallest works in Python? Best example
“`html

When working with data in Python, especially with large datasets, we often need to find the smallest values in a column while keeping performance in check. The pandas.nsmallest() method is a fantastic tool for efficiently retrieving the smallest n elements from a column while maintaining high performance.

Understanding pandas.nsmallest()

The nsmallest() method in pandas is used to retrieve the n smallest values from a DataFrame or Series based on specified criteria. Compared to traditional sorting methods, it provides a more optimized way to get the smallest elements, making it ideal for large datasets.

Its syntax looks like this:

DataFrame.nsmallest(n, columns, keep='first')

Here’s what each parameter does:

  • n: The number of smallest values you want to retrieve.
  • columns: The column based on which the ordering will happen.
  • keep: Determines how to handle duplicate values. It can take the following options:
    • 'first' (default): Keeps the first occurrence when duplicate values exist.
    • 'last': Keeps the last occurrence of duplicates.
    • 'all': Returns all duplicate values.

Best Example: Using pandas.nsmallest() in Python

Let’s consider a practical example where we have a dataset of products with prices, and we want to find the top 3 cheapest products.

import pandas as pd

# Creating a sample DataFrame
data = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Smartwatch', 'Monitor'],
    'Price': [1200, 800, 400, 150, 250, 300]
}

df = pd.DataFrame(data)

# Finding the 3 cheapest products
smallest_prices = df.nsmallest(3, 'Price')

print(smallest_prices)

The output will look like this:

Product Price
Headphones 150
Smartwatch 250
Monitor 300

Why Use nsmallest() Instead of Sorting?

One might wonder what’s the advantage of using nsmallest() instead of simply sorting the DataFrame and taking the first n values. The major benefits include:

  1. Performance: Instead of sorting the entire DataFrame, nsmallest() performs an efficient partial sort, which is much faster when dealing with large datasets.
  2. Memory Efficiency: Sorting a full DataFrame requires more memory, which can be avoided with nsmallest(), especially in big data scenarios.

To illustrate this, sorting and slicing would look like this:

df.sort_values(by='Price').head(3)

While this works, it’s less efficient than using nsmallest() because sorting all values isn’t necessary.

Handling Duplicates with nsmallest()

If there are duplicate values in the DataFrame and they appear within the smallest n elements, you can choose how they should be handled.

data = {
    'Product': ['Laptop', 'Phone', 'Tablet', 'Headphones', 'Monitor', 'Monitor'],
    'Price': [1200, 800, 400, 150, 300, 300]
}

df = pd.DataFrame(data)

# Using keep='all' to keep all duplicates
smallest_prices = df.nsmallest(3, 'Price', keep='all')

print(smallest_prices)

Here, both instances of “Monitor” will be included since their prices are the same (300) and we set keep='all'.

Using nsmallest() with Multiple Columns

If two or more rows have the same value in the column being used for sorting, and we want to use additional columns to decide the order, we can pass multiple columns like this:

df.nsmallest(3, ['Price', 'Product'])

This will first sort by ‘Price’ and in case of ties, it will sort by ‘Product’ alphabetically.

Conclusion

The pandas.nsmallest() method is a highly efficient way to extract the smallest values from a DataFrame without the overhead of full sorting. It improves both performance and memory efficiency, making it essential for working with large datasets.

“` Other interesting article: How pandas interpolate works in Python? Best example