Q&A 3 What are common sources of datasets for Python and R?
3.1 Explanation
Before working with data, it’s important to know where data comes from. In both Python and R, you can use:
- Public datasets from libraries or platforms
- Downloaded datasets from repositories
- Real-world data from research, surveys, APIs, or government sources
These sources help you practice data skills using real, structured information.
Common sources include:
- Built-in datasets:
- Python:
seaborn,sklearn.datasets,statsmodels,pydataset
- R:
datasetspackage,MASS,ggplot2,palmerpenguins
- Python:
- Online repositories:
- Research & Surveys:
- CSV/Excel/JSON files published with academic papers or institutions
- Survey data from organizations (e.g., Pew Research, Eurostat)
- CSV/Excel/JSON files published with academic papers or institutions
- APIs and live feeds:
- Weather, financial markets, genomics, social media (e.g., Twitter API)
- Local files:
- Saved from tools like Excel, Google Sheets, SPSS, or exported from databases
3.2 Python Package-Based Datasets
- Seaborn:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
- Scikit-learn:
[[5.1 3.5 1.4 0.2]
[4.9 3. 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5. 3.6 1.4 0.2]]
- Statsmodels:
import statsmodels.api as sm df = sm.datasets.get_rdataset(“Guerry”, “HistData”).data
3.3 R Package-Based Datasets
datasets package:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosaggplot2:
carat cut color clarity depth table price x y z 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48palmerpenguins (if installed):
# A tibble: 6 × 8 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 # ℹ 2 more variables: sex <fct>, year <int>
3.4 Online Public Data Sources
| Source | Link |
|---|---|
| UCI Machine Learning Repo | https://archive.ics.uci.edu/ml/ |
| Kaggle Datasets | https://www.kaggle.com/datasets |
| data.gov (US Government) | https://www.data.gov |
| Awesome Public Datasets | https://github.com/awesomedata/awesome-public-datasets |
| World Bank Open Data | https://data.worldbank.org/ |
💡 Tip: Always save downloaded datasets in your
data/folder and reference them using relative paths likedata/filename.csv.
✅ Now that you know where to find data, let’s learn how to save, load and preview it in your Python or R environment.