
If you’re working with pandas in Python, filtering data frames efficiently is crucial. One of the best ways to do this is using the query()
method. It’s not only concise but also enhances readability. In this article, I’ll explain how pandas query works in Python with the best examples.
What is pandas.query()?
The query()
method in pandas allows for querying DataFrames and Series using a readable string-based syntax. Instead of using bracket notation or logical operators, you can filter data using a query language similar to SQL.
Basic Syntax of pandas.query()
The basic syntax of query()
is straightforward:
DataFrame.query(expression, inplace=False, engine='numexpr')
- expression – A string representing the filtering condition.
- inplace – If set to
True
, modifies the DataFrame in place. - engine – Specifies the computation engine, either
'numexpr'
(default) or'python'
.
Filtering Rows Using pandas.query()
Let’s start by creating a sample DataFrame and using query()
to filter it.
import pandas as pd
# Sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', 'David'],
'Age': [25, 30, 35, 40],
'Salary': [50000, 60000, 70000, 80000]}
df = pd.DataFrame(data)
# Filtering with query()
filtered_df = df.query('Age > 30')
print(filtered_df)
Output:
Name Age Salary
2 Charlie 35 70000
3 David 40 80000
Instead of using df[df['Age'] > 30]
, we achieve the same with a cleaner expression.
Using Multiple Conditions in pandas.query()
You can use logical operators like and
(&
) and or
(|
) within the query.
# Filtering with multiple conditions
filtered_df = df.query('Age > 25 & Salary < 80000')
print(filtered_df)
Output:
Name Age Salary
1 Bob 30 60000
2 Charlie 35 70000
Here, I filtered rows where Age is greater than 25 and Salary is less than 80000.
Working with String Columns in pandas.query()
If you have textual (string) data, you can query it using comparison operators.
# Filtering by name
filtered_df = df.query("Name == 'Alice'")
print(filtered_df)
Output:
Name Age Salary
0 Alice 25 50000
Querying Data Using Variables
Instead of hardcoding values, you can use variables inside query()
by prefixing them with @
.
# Variable defined
min_salary = 60000
# Query using variable
filtered_df = df.query("Salary >= @min_salary")
print(filtered_df)
Output:
Name Age Salary
1 Bob 30 60000
2 Charlie 35 70000
3 David 40 80000
Performance Optimization with query()
Using query()
is often faster than traditional boolean masking, especially for large DataFrames. This is because query()
internally uses numexpr
for evaluation, optimizing performance.
Pandas.query() vs. Traditional Filtering
Let's compare query()
with standard filtering:
Method | Code | Readability |
---|---|---|
Standard Filtering | df[df['Age'] > 30] |
Less readable |
Using query() | df.query('Age > 30') |
More readable |
When Should You Use pandas.query()?
While query()
is incredibly useful, it's not always the best choice. Here are some situations where it excels:
- You have complex conditions that would make bracket notation difficult to read.
- You want to improve speed, especially on large DataFrames.
- You prefer a SQL-like filtering approach.
However, if you're dealing with dynamically generated column names or expressions that require advanced Python functions, standard filtering might be a better choice.
Conclusion
Now you know how pandas query works in Python with the best examples. It's a powerful tool for filtering data efficiently and improving code readability. Try implementing query()
in your own projects, and you'll see how much cleaner and more intuitive your code becomes.