
When working with data in Python, one of the most common tasks is combining multiple datasets. Whether you’re dealing with CSV files, database tables, or API responses, pandas.merge()
allows you to efficiently merge, join, and integrate data. In this article, I’ll give you the best example of how pandas merge works in Python and cover everything you need to know to use it effectively.
Understanding pandas.merge()
The pandas.merge()
function is used to combine two DataFrames based on a common column or index. It operates similarly to SQL joins, offering different types of joins such as inner, outer, left, and right. By specifying the right parameters, you can control how data is merged to fit your use case.
Basic Syntax of pandas.merge()
The basic syntax of pandas.merge()
is as follows:
import pandas as pd
merged_df = pd.merge(df1, df2, how='inner', on='common_column')
Here’s what each parameter does:
- df1, df2: The two DataFrames you want to merge.
- how: Specifies the type of merge:
inner
– Returns only matching rows (default).outer
– Returns all rows from both DataFrames, filling in missing values with NaN.left
– Returns all rows from the first DataFrame and matching rows from the second.right
– Returns all rows from the second DataFrame and matching rows from the first.
- on: Specifies the key column(s) to merge on.
Best Example: Merging Two DataFrames
Let’s dive into an example of how pandas merge works in Python using the best approach.
import pandas as pd
# Creating the first DataFrame
data1 = {'ID': [1, 2, 3, 4], 'Name': ['Alice', 'Bob', 'Charlie', 'David'], 'Age': [25, 30, 35, 40]}
df1 = pd.DataFrame(data1)
# Creating the second DataFrame
data2 = {'ID': [3, 4, 5, 6], 'City': ['New York', 'San Francisco', 'Los Angeles', 'Chicago'], 'Salary': [70000, 80000, 90000, 100000]}
df2 = pd.DataFrame(data2)
# Merging the DataFrames on the 'ID' column
merged_df = pd.merge(df1, df2, on='ID', how='inner')
print(merged_df)
Understanding the Output
The result of this merge would be:
ID | Name | Age | City | Salary |
---|---|---|---|---|
3 | Charlie | 35 | New York | 70000 |
4 | David | 40 | San Francisco | 80000 |
Since we performed an inner join, only rows with matching ID
values in both DataFrames (3 and 4) appear in the output.
Other Merge Types: Left, Right, and Outer Joins
Here’s how you can perform different types of merges:
1. Left Join
In a left join, all rows from the left DataFrame are retained, while non-matching rows from the right DataFrame are replaced with NaN.
left_join_df = pd.merge(df1, df2, on='ID', how='left')
print(left_join_df)
2. Right Join
In a right join, all rows from the right DataFrame are retained, and non-matching rows from the left DataFrame are replaced with NaN.
right_join_df = pd.merge(df1, df2, on='ID', how='right')
print(right_join_df)
3. Outer Join
An outer join returns all rows from both DataFrames. Missing values are filled with NaN where no match is found.
outer_join_df = pd.merge(df1, df2, on='ID', how='outer')
print(outer_join_df)
Using Multiple Columns to Merge
Sometimes, merging on just one column isn’t enough. We can specify multiple columns for a more precise merge.
merged_df = pd.merge(df1, df2, on=['ID', 'Name'], how='inner')
Handling Column Name Conflicts
When both DataFrames contain columns with the same name (other than the key column), pandas.merge()
adds suffixes to avoid conflicts.
merged_df = pd.merge(df1, df2, on='ID', how='inner', suffixes=('_df1', '_df2'))
Conclusion
Understanding how pandas.merge()
works in Python is essential for efficient data analysis and manipulation. Whether you need an inner, outer, left, or right join, this function provides a flexible way to combine datasets. With the best example illustrated above, now you can confidently merge DataFrames and manage relational data with ease.