Q&A 20 How do you take a random sample from a large dataset in Python and R?
20.1 Explanation
Sampling is useful when working with large datasets, doing quick inspections, or preparing training/test splits. It allows you to explore a manageable portion of the data without loading everything.
In this example, we’ll take a random sample of 500 rows from the diamonds dataset — which is considerably larger and more varied than the iris dataset we’ve used so far.
🎨 Why Switch to the Diamonds Dataset?
The Iris dataset is a great starting point for learning structure, filtering, and variable creation — but it’s small and limited in variety.
The Diamonds dataset is much larger and richer, with:
- Categorical variables like
cut,color, andclarity
- Continuous variables like
price,carat, and dimensions
- Greater potential for beautiful and insightful visualizations
We’ll continue using Iris for clarity and comparison, but starting in the Visualization Layer, you’ll also work with a sampled Diamonds dataset to practice chart types like:
- Boxplots grouped by cut or clarity
- Scatter plots of carat vs. price
- Histograms of distribution patterns
Sampling allows us to keep things fast and responsive while still leveraging the power of a large, visual-friendly dataset.
20.2 Python Code
import pandas as pd
import seaborn as sns
# Load full diamonds dataset
diamonds = sns.load_dataset("diamonds")
# Take a random sample of 500 rows
sampled_diamonds = diamonds.sample(n=500, random_state=42)
# Save to CSV (optional)
sampled_diamonds.to_csv("data/diamonds_sample.csv", index=False)
# View first few rows
print(sampled_diamonds.head()) carat cut color clarity depth table price x y z
1388 0.24 Ideal G VVS1 62.1 56.0 559 3.97 4.00 2.47
50052 0.58 Very Good F VVS2 60.0 57.0 2201 5.44 5.42 3.26
41645 0.40 Ideal E VVS2 62.1 55.0 1238 4.76 4.74 2.95
42377 0.43 Premium E VVS2 60.8 57.0 1304 4.92 4.89 2.98
17244 1.55 Ideal E SI2 62.3 55.0 6901 7.44 7.37 4.61
20.3 R Code
library(ggplot2)
library(dplyr)
library(readr)
# Load full diamonds dataset
data("diamonds") # loads dataset into the environment
# Access the dataset
diamonds_df <- diamonds
# Take a random sample of 500 rows
set.seed(42)
diamonds_sample <- sample_n(diamonds_df, 500)
# Save to CSV (optional)
write_csv(diamonds_sample, "data/diamonds_sample.csv")
# View first few rows
head(diamonds_sample)# A tibble: 6 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.39 Ideal I VVS2 60.8 56 849 4.74 4.76 2.89
2 1.12 Very Good G SI2 63.3 58 4478 6.7 6.63 4.22
3 0.51 Very Good G VVS2 62.9 57 1750 5.06 5.12 3.2
4 0.52 Very Good D VS1 62.5 57 1829 5.11 5.16 3.21
5 0.28 Very Good E VVS2 61.4 55 612 4.22 4.25 2.6
6 1.01 Fair F SI1 67.2 60 4276 6.06 6 4.05