How pandas read_excel works in Python? Best example

If you’re working with Excel files in Python, you’ve probably come across the pandas.read_excel() function. It’s a powerful tool that allows you to import data from Excel spreadsheets into a pandas DataFrame with just a single line of code. In this article, I’ll walk you through how it works, how to use its various options, and some best practices to get the most out of it.

Understanding `pandas.read_excel()`

The pandas.read_excel() function is part of the pandas library and is used to read data from Excel files (both .xls and .xlsx formats). It simplifies the process of loading spreadsheet data into a DataFrame for further analysis. With numerous optional parameters, you can specify exactly how the data should be read.

Basic Usage of `pandas.read_excel()`

Let’s start with a simple example. Suppose you have an Excel file named data.xlsx containing the following table:

Name	Age	City
Alice	25	New York
Bob	30	Los Angeles
Charlie	35	Chicago

To read this file into a pandas DataFrame, use the following code:

import pandas as pd

# Read the Excel file
df = pd.read_excel("data.xlsx")

# Display the DataFrame
print(df)

This will output:

     Name  Age         City
0  Alice   25     New York
1    Bob   30  Los Angeles
2 Charlie   35     Chicago

Reading Specific Sheets

Excel files often contain multiple sheets. By default, pandas.read_excel() reads the first sheet, but you can specify a particular sheet using the sheet_name parameter.

df = pd.read_excel("data.xlsx", sheet_name="Sheet2")

You can also read multiple sheets at once by passing a list:

sheets = pd.read_excel("data.xlsx", sheet_name=["Sheet1", "Sheet2"])

This returns a dictionary where the keys are sheet names and the values are corresponding DataFrames.

Handling Headers and Index Columns

By default, pandas assumes that the first row of the Excel file contains column headers. If your dataset starts from a different row, you can specify it using the header parameter.

df = pd.read_excel("data.xlsx", header=2)

If you want to use a specific column as the DataFrame index, you can use the index_col parameter:

df = pd.read_excel("data.xlsx", index_col=0)

Selecting Specific Columns

Sometimes you don’t need all the columns from the Excel file. You can specify which columns to load using the usecols parameter.

df = pd.read_excel("data.xlsx", usecols=["Name", "Age"])

You can also use column indices. For example, to read only the first two columns:

df = pd.read_excel("data.xlsx", usecols="A:B")

Reading Excel Files More Efficiently

If you’re dealing with large Excel files, reading them efficiently is crucial. Here are a few ways to speed up the process:

Use nrows to read only a subset of rows: df = pd.read_excel("data.xlsx", nrows=1000)
Skip unnecessary rows using skiprows: df = pd.read_excel("data.xlsx", skiprows=3)
Use dtype to specify data types and reduce memory usage: df = pd.read_excel("data.xlsx", dtype={"Age": int})

Handling Missing Data

Missing values are common in spreadsheets. Pandas automatically detects them and fills them with NaN. However, you can customize this behavior using the na_values parameter.

df = pd.read_excel("data.xlsx", na_values=["N/A", "Missing"])

Writing Data Back to Excel

Once you’ve processed your data, you might want to save it back to Excel. Use to_excel() for this purpose.

df.to_excel("output.xlsx", index=False)

The index=False argument prevents pandas from writing the default index column to the Excel file.

Conclusion

The pandas.read_excel() function is a versatile tool for working with Excel data in Python. Whether you’re reading a simple spreadsheet or processing large amounts of data, understanding its parameters can greatly enhance your workflow. From selecting specific sheets to optimizing performance, these tricks should help you master Excel file handling in pandas.

Other interesting article:

How pandas read_csv works in Python? Best example