How pandas memory_usage works in Python? Best example

How pandas memory_usage works in Python? Best example
“`html

When working with large datasets in Python using pandas, understanding memory usage is crucial for performance optimization. The pandas.memory_usage() method provides insights into how much memory different parts of a DataFrame or Series occupy. In this article, I’ll explain how this function works, how to use it effectively, and share the best example to maximize its utility.

Understanding pandas.memory_usage()

The memory_usage() method in pandas allows us to check the memory consumption of a DataFrame or Series. This function is useful when trying to optimize data structures, detect memory bottlenecks, or reduce the overall footprint of a dataset in RAM.

By default, memory_usage() returns the memory usage of each column in bytes. The method can be applied to both DataFrame and Series objects.

Basic Syntax


DataFrame.memory_usage(index=True, deep=False)

Where:

  • index (bool, default True): If set to True, it includes the memory usage of the DataFrame index. If False, the index size is omitted.
  • deep (bool, default False): If set to True, it analyzes object datatypes (like strings) more in-depth to get an accurate memory estimation.

Best Example: Using memory_usage() in Practice

Let’s take a simple DataFrame and see how memory_usage() works.


import pandas as pd

# Creating a sample DataFrame
data = {
    'int_column': [1, 2, 3, 4, 5],
    'float_column': [1.1, 2.2, 3.3, 4.4, 5.5],
    'string_column': ['A', 'B', 'C', 'D', 'E'],
}

df = pd.DataFrame(data)

# Checking memory usage
print(df.memory_usage(deep=True))

Output:


Index             128
int_column         40
float_column      40
string_column    325
dtype: int64

Breaking it down:

  • The Index takes 128 bytes.
  • Each integer and float column uses 40 bytes.
  • The string column takes significantly more memory because each string is an object.

Why Use deep=True?

If we omit deep=True, the reported memory for object (string) columns is underestimated because Python objects store references rather than raw data values.

Compare the difference:


print(df.memory_usage(deep=False))

Output:


Index             128
int_column         40
float_column      40
string_column     40
dtype: int64

Without deep=True, the string column reports minimal memory usage, which is misleading.

Optimizing Memory Usage

If your DataFrame is large, reducing memory consumption can significantly improve processing efficiency. Here are some common strategies:

  1. Downcasting numeric types: Use smaller data types to save memory.
  2. Converting object columns to categorical: If a column has few unique values, converting it to categorical can save memory.
  3. Dropping unnecessary columns: Remove columns that are not needed to free up memory.

Example of downcasting numeric types:


df['int_column'] = pd.to_numeric(df['int_column'], downcast='unsigned')
df['float_column'] = pd.to_numeric(df['float_column'], downcast='float')

Analyzing Memory Usage for Large DataFrames

For large datasets, looping through memory_usage() results can help identify which columns are consuming the most memory.


usage = df.memory_usage(deep=True)
print(usage.sort_values(ascending=False))

This will display the memory usage sorted from highest to lowest, allowing you to pinpoint the heaviest columns and optimize accordingly.

Comparing Memory Reduction in a Table

Here’s a quick comparison of memory before and after optimization:

Column Before Optimization (Bytes) After Optimization (Bytes)
int_column 40 25
float_column 40 25
string_column 325 180

As shown, the memory usage of int_column and float_column decreased significantly after downcasting, and converting object types to categorical can provide an even greater benefit.

Conclusion

The pandas.memory_usage() function is a powerful tool for analyzing and optimizing DataFrame memory usage in Python. By using it effectively, you can improve performance and handle large datasets more efficiently. Whether you’re working with numerical data or objects, applying the right memory management techniques can make a significant difference.

“` Other interesting article: How pandas info works in Python? Best example