Q&A 20 How do you take a random sample from a large dataset in Python and R?

20.1 Explanation

Sampling is useful when working with large datasets, doing quick inspections, or preparing training/test splits. It allows you to explore a manageable portion of the data without loading everything.

In this example, we’ll take a random sample of 500 rows from the diamonds dataset — which is considerably larger and more varied than the iris dataset we’ve used so far.

🎨 Why Switch to the Diamonds Dataset?

The Iris dataset is a great starting point for learning structure, filtering, and variable creation — but it’s small and limited in variety.

The Diamonds dataset is much larger and richer, with:

  • Categorical variables like cut, color, and clarity
  • Continuous variables like price, carat, and dimensions
  • Greater potential for beautiful and insightful visualizations

We’ll continue using Iris for clarity and comparison, but starting in the Visualization Layer, you’ll also work with a sampled Diamonds dataset to practice chart types like:

  • Boxplots grouped by cut or clarity
  • Scatter plots of carat vs. price
  • Histograms of distribution patterns

Sampling allows us to keep things fast and responsive while still leveraging the power of a large, visual-friendly dataset.


20.2 Python Code

import pandas as pd
import seaborn as sns

# Load full diamonds dataset
diamonds = sns.load_dataset("diamonds")

# Take a random sample of 500 rows
sampled_diamonds = diamonds.sample(n=500, random_state=42)

# Save to CSV (optional)
sampled_diamonds.to_csv("data/diamonds_sample.csv", index=False)

# View first few rows
print(sampled_diamonds.head())
       carat        cut color clarity  depth  table  price     x     y     z
1388    0.24      Ideal     G    VVS1   62.1   56.0    559  3.97  4.00  2.47
50052   0.58  Very Good     F    VVS2   60.0   57.0   2201  5.44  5.42  3.26
41645   0.40      Ideal     E    VVS2   62.1   55.0   1238  4.76  4.74  2.95
42377   0.43    Premium     E    VVS2   60.8   57.0   1304  4.92  4.89  2.98
17244   1.55      Ideal     E     SI2   62.3   55.0   6901  7.44  7.37  4.61

20.3 R Code

library(ggplot2)
library(dplyr)
library(readr)

# Load full diamonds dataset
data("diamonds")  # loads dataset into the environment

# Access the dataset
diamonds_df <- diamonds

# Take a random sample of 500 rows
set.seed(42)
diamonds_sample <- sample_n(diamonds_df, 500)

# Save to CSV (optional)
write_csv(diamonds_sample, "data/diamonds_sample.csv")

# View first few rows
head(diamonds_sample)
# A tibble: 6 × 10
  carat cut       color clarity depth table price     x     y     z
  <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1  0.39 Ideal     I     VVS2     60.8    56   849  4.74  4.76  2.89
2  1.12 Very Good G     SI2      63.3    58  4478  6.7   6.63  4.22
3  0.51 Very Good G     VVS2     62.9    57  1750  5.06  5.12  3.2 
4  0.52 Very Good D     VS1      62.5    57  1829  5.11  5.16  3.21
5  0.28 Very Good E     VVS2     61.4    55   612  4.22  4.25  2.6 
6  1.01 Fair      F     SI1      67.2    60  4276  6.06  6     4.05