How pandas dataframe works in Python? Best example

How pandas dataframe works in Python? Best example
“`html

When working with data in Python, one of the most powerful tools at our disposal is the Pandas DataFrame. It provides a flexible and efficient way to store, manipulate, and analyze structured data. In this article, I’ll break down how a pandas.DataFrame works, with clear explanations and practical examples.

What Is a Pandas DataFrame?

A DataFrame is a two-dimensional, labeled data structure in Pandas. Think of it like a table in a database or a spreadsheet in Excel. It consists of rows and columns, where each column can hold values of different data types.

Creating a Pandas DataFrame

A DataFrame can be created in multiple ways, but the most common methods include:

  • Using a dictionary of lists
  • Using a list of dictionaries
  • Reading data from an external source (CSV, Excel, SQL database)

Let’s start with a basic example:

import pandas as pd

# Creating a DataFrame using a dictionary
data = {
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Age': [25, 30, 35],
    'City': ['New York', 'Los Angeles', 'Chicago']
}

df = pd.DataFrame(data)
print(df)

Output:


     Name  Age         City
0   Alice   25     New York
1     Bob   30  Los Angeles
2  Charlie   35     Chicago

Understanding DataFrame Structure

A DataFrame has three main components:

  1. Index: The row labels (default is numeric but can be customized).
  2. Columns: The column labels that define different data attributes.
  3. Data: The actual values.

We can explore these properties using:

print(df.index)   # Displays the index
print(df.columns) # Displays column names
print(df.values)  # Displays the data as a NumPy array

Reading Data from External Files

One of the greatest strengths of Pandas is the ability to read data from various formats:

df_csv = pd.read_csv('data.csv')  # Reads a CSV file
df_excel = pd.read_excel('data.xlsx')  # Reads an Excel file
df_sql = pd.read_sql('SELECT * FROM users', connection)  # Reads data from a database

Accessing and Modifying Data

We can access specific columns, rows, or even single values using different methods:

  • Accessing a single column:
print(df['Name'])
  • Accessing multiple columns:
print(df[['Name', 'Age']])
  • Accessing rows using loc and iloc:
print(df.loc[1])  # Access by label
print(df.iloc[2]) # Access by position
  • Modifying values:
df.at[1, 'Age'] = 31  # Change the age of the second row
df.loc[df['Name'] == 'Alice', 'City'] = 'San Francisco'  # Modify based on condition

Filtering Data

Filtering a DataFrame helps us focus on specific records that match our conditions:

filtered_df = df[df['Age'] > 30]  # Gets rows where Age > 30
print(filtered_df)

Sorting Data

Pandas allows us to sort data effortlessly:

sorted_df = df.sort_values(by=['Age'], ascending=False)
print(sorted_df)

Basic DataFrame Operations

When working with numerical data, Pandas provides a range of useful functions:

print(df.describe())  # Summary statistics
print(df.mean(numeric_only=True))  # Mean values of numerical columns
print(df['Age'].sum())  # Total age sum

Handling Missing Data

Real-world datasets often contain missing values. We can handle them as follows:

  • Remove rows with missing values:
df.dropna(inplace=True)
  • Fill missing values with a default value:
df.fillna(value="Unknown", inplace=True)

Grouping Data

Grouping is essential for aggregations. The groupby() method is useful when analyzing subsets of data:

grouped_df = df.groupby('City')['Age'].mean()
print(grouped_df)

Merging and Concatenating DataFrames

We can merge multiple datasets using different techniques:

  • Concatenation:
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
merged_df = pd.concat([df1, df2])
print(merged_df)
  • Merging based on a common key:
df1 = pd.DataFrame({'ID': [1, 2], 'Name': ['Alice', 'Bob']})
df2 = pd.DataFrame({'ID': [1, 2], 'Age': [25, 30]})
merged_df = pd.merge(df1, df2, on='ID')
print(merged_df)

Pivot Tables

Pivot tables provide a way to summarize data dynamically:

pivot_df = df.pivot_table(values='Age', index='City', aggfunc='mean')
print(pivot_df)

Exporting Data to Files

After processing the data, we might need to save it:

df.to_csv('output.csv', index=False)  # Export to CSV
df.to_excel('output.xlsx', index=False)  # Export to Excel
df.to_json('output.json')  # Export to JSON

Conclusion

Understanding how pandas.DataFrame works in Python is essential when working with data. Whether you’re loading, manipulating, or analyzing datasets, mastering these operations will greatly enhance your data-handling capabilities. With the flexibility and efficiency Pandas provides, working with structured data becomes seamless and powerful.

“` Other interesting article: How pandas read_json works in Python? Best example