How pandas sample works in Python? Best example

How pandas sample works in Python? Best example
“`html

When working with large datasets in Python, sometimes we need to randomly pick a subset of data for analysis, testing, or visualization. That’s where the sample() method from the pandas library comes in handy. In this article, I’ll walk you through how pandas.sample() works, the various parameters you can use, and the best examples of how to leverage it in your code.

Understanding pandas.sample()

The sample() function in pandas allows us to randomly select rows or columns from a DataFrame or Series. This is particularly useful when we want to:

  • Extract a random sample of data for analysis.
  • Create test sets for machine learning models.
  • Perform data augmentation through random selection.

The basic syntax for DataFrame.sample() is:

DataFrame.sample(n=None, frac=None, replace=False, weights=None, random_state=None, axis=None, ignore_index=False)

Let’s break down the key parameters of this method.

Key Parameters in pandas.sample()

Parameter Usage
n Specifies the number of samples to return. Cannot be used with frac.
frac Specifies the fraction of the data to sample. For example, frac=0.2 returns 20% of the data.
replace If set to True, sampling is done with replacement, allowing duplicate samples.
weights Defines weights for each row, influencing the probability of being sampled.
random_state Sets a seed for reproducibility.
axis Determines whether to sample rows (axis=0) or columns (axis=1).
ignore_index If True, resets the index in the returned sample.

Basic Example: Selecting a Random Sample of Rows

Let’s start with a simple example. Suppose we have the following DataFrame:

import pandas as pd

# Sample dataset
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'David', 'Emma'],
    'Age': [25, 30, 35, 40, 22],
    'Score': [85, 90, 78, 92, 88]
}

df = pd.DataFrame(data)

# Selecting 2 random rows
sampled_df = df.sample(n=2)

print(sampled_df)

This will output two random rows from the DataFrame. Each time you run it, you’ll likely see different results.

Using the frac Parameter

Instead of selecting a fixed number of rows, we can select a fraction of the data:

sampled_df = df.sample(frac=0.4)  # Selects 40% of the rows

This means that if our dataset has five rows, we will get a sample of two rows (since 40% of 5 is 2).

Sampling with Replacement

By default, sampling is done without replacement, meaning a row can only appear once in the sample. If we set replace=True, rows can be selected multiple times:

sampled_df = df.sample(n=3, replace=True)

Now, the same row may appear more than once in the output.

Setting a Random Seed for Reproducibility

To ensure reproducibility, we can use the random_state parameter. This ensures that we get the same random sample every time we run the code:

sampled_df = df.sample(n=2, random_state=42)

Using a fixed seed is especially useful when sharing code or performing debugging.

Sampling Columns Instead of Rows

By default, sample() randomly selects rows. If we want to randomly select columns instead, we can use the axis=1 parameter:

sampled_df = df.sample(n=2, axis=1)

This returns a subset of columns rather than rows.

Weighted Sampling

Sometimes, we want some rows to have a higher chance of being selected. We can achieve this using the weights parameter:

weights = [0.1, 0.2, 0.3, 0.2, 0.2]
sampled_df = df.sample(n=2, weights=weights, random_state=42)

In this example, rows with higher weights are more likely to be chosen in the sample.

Conclusion

The pandas.sample() method is an incredibly useful tool for random sampling in Python. Whether you need to select a fixed number of rows, a fractional subset, sample with replacement, or add weights, sample() gives you the flexibility to do it all. Understanding these options will help you efficiently manipulate and analyze your data.

“` Other interesting article: How pandas sort_index works in Python? Best example