Q&A 20 How do you take a random sample from a large dataset in Python and R?

20.1 Explanation

Sampling is useful when working with large datasets, doing quick inspections, or preparing training/test splits. It allows you to explore a manageable portion of the data without loading everything.

In this example, we’ll take a random sample of 500 rows from the diamonds dataset — which is considerably larger and more varied than the iris dataset we’ve used so far.

🎨 Why Switch to the Diamonds Dataset?

The Iris dataset is a great starting point for learning structure, filtering, and variable creation — but it’s small and limited in variety.

The Diamonds dataset is much larger and richer, with:

Categorical variables like cut, color, and clarity
Continuous variables like price, carat, and dimensions
Greater potential for beautiful and insightful visualizations

We’ll continue using Iris for clarity and comparison, but starting in the Visualization Layer, you’ll also work with a sampled Diamonds dataset to practice chart types like:

Boxplots grouped by cut or clarity
Scatter plots of carat vs. price
Histograms of distribution patterns

Sampling allows us to keep things fast and responsive while still leveraging the power of a large, visual-friendly dataset.

20.2 Python Code

import pandas as pd
import seaborn as sns

# Load full diamonds dataset
diamonds = sns.load_dataset("diamonds")

# Take a random sample of 500 rows
sampled_diamonds = diamonds.sample(n=500, random_state=42)

# Save to CSV (optional)
sampled_diamonds.to_csv("data/diamonds_sample.csv", index=False)

# View first few rows
print(sampled_diamonds.head())

       carat        cut color clarity  depth  table  price     x     y     z
1388    0.24      Ideal     G    VVS1   62.1   56.0    559  3.97  4.00  2.47
50052   0.58  Very Good     F    VVS2   60.0   57.0   2201  5.44  5.42  3.26
41645   0.40      Ideal     E    VVS2   62.1   55.0   1238  4.76  4.74  2.95
42377   0.43    Premium     E    VVS2   60.8   57.0   1304  4.92  4.89  2.98
17244   1.55      Ideal     E     SI2   62.3   55.0   6901  7.44  7.37  4.61

20.3 R Code

library(ggplot2)
library(dplyr)
library(readr)

# Load full diamonds dataset
data("diamonds")  # loads dataset into the environment

# Access the dataset
diamonds_df <- diamonds

# Take a random sample of 500 rows
set.seed(42)
diamonds_sample <- sample_n(diamonds_df, 500)

# Save to CSV (optional)
write_csv(diamonds_sample, "data/diamonds_sample.csv")

# View first few rows
head(diamonds_sample)

# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.39 Ideal     I     VVS2     60.8    56   849  4.74  4.76  2.89
2  1.12 Very Good G     SI2      63.3    58  4478  6.7   6.63  4.22
3  0.51 Very Good G     VVS2     62.9    57  1750  5.06  5.12  3.2 
4  0.52 Very Good D     VS1      62.5    57  1829  5.11  5.16  3.21
5  0.28 Very Good E     VVS2     61.4    55   612  4.22  4.25  2.6 
6  1.01 Fair      F     SI1      67.2    60  4276  6.06  6     4.05