Q&A 3 What are common sources of datasets for Python and R?

3.1 Explanation

Before working with data, it’s important to know where data comes from. In both Python and R, you can use:

  1. Public datasets from libraries or platforms
  2. Downloaded datasets from repositories
  3. Real-world data from research, surveys, APIs, or government sources

These sources help you practice data skills using real, structured information.

Common sources include:

  • Built-in datasets:
    • Python: seaborn, sklearn.datasets, statsmodels, pydataset
    • R: datasets package, MASS, ggplot2, palmerpenguins
  • Online repositories:
  • Research & Surveys:
    • CSV/Excel/JSON files published with academic papers or institutions
    • Survey data from organizations (e.g., Pew Research, Eurostat)
  • APIs and live feeds:
    • Weather, financial markets, genomics, social media (e.g., Twitter API)
  • Local files:
    • Saved from tools like Excel, Google Sheets, SPSS, or exported from databases

3.2 Python Package-Based Datasets

  • Seaborn:
import seaborn as sns
iris = sns.load_dataset("iris")
print(iris[:5])
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa
  • Scikit-learn:
from sklearn import datasets
irisml = datasets.load_iris()
print(irisml.data[:5])
[[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]]
  • Statsmodels:

import statsmodels.api as sm df = sm.datasets.get_rdataset(“Guerry”, “HistData”).data

3.3 R Package-Based Datasets

  • datasets package:

    iris <- datasets::iris
    head(iris)
      Sepal.Length Sepal.Width Petal.Length Petal.Width Species
    1          5.1         3.5          1.4         0.2  setosa
    2          4.9         3.0          1.4         0.2  setosa
    3          4.7         3.2          1.3         0.2  setosa
    4          4.6         3.1          1.5         0.2  setosa
    5          5.0         3.6          1.4         0.2  setosa
    6          5.4         3.9          1.7         0.4  setosa
  • ggplot2:

    data("diamonds", package = "ggplot2")
    head(diamonds)
      carat       cut color clarity depth table price    x    y    z
    1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
    2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
    3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
    4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
    5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
    6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
  • palmerpenguins (if installed):

    library(palmerpenguins)
    head(penguins)
    # A tibble: 6 × 8
      species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
      <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
    1 Adelie  Torgersen           39.1          18.7               181        3750
    2 Adelie  Torgersen           39.5          17.4               186        3800
    3 Adelie  Torgersen           40.3          18                 195        3250
    4 Adelie  Torgersen           NA            NA                  NA          NA
    5 Adelie  Torgersen           36.7          19.3               193        3450
    6 Adelie  Torgersen           39.3          20.6               190        3650
    # ℹ 2 more variables: sex <fct>, year <int>

3.4 Online Public Data Sources

Source Link
UCI Machine Learning Repo https://archive.ics.uci.edu/ml/
Kaggle Datasets https://www.kaggle.com/datasets
data.gov (US Government) https://www.data.gov
Awesome Public Datasets https://github.com/awesomedata/awesome-public-datasets
World Bank Open Data https://data.worldbank.org/

💡 Tip: Always save downloaded datasets in your data/ folder and reference them using relative paths like data/filename.csv.


✅ Now that you know where to find data, let’s learn how to save, load and preview it in your Python or R environment.