Appendix: Data Sources

This appendix documents the datasets used in this book. For each one: where it came from, what we did to it, and how to cite it.

The Meuse Dataset

We use the Meuse River dataset throughout the geostatistics chapters. The original meuse.all from the gstat package has 17 columns, most of which we never touch - landuse codes, indicator variables for which survey the row came from, a lime field test result. Handing students a 17-column data frame when you only ever use 7 of those columns is a tax on attention. Same problem with the original meuse.grid: it had a normalized distance column and two arbitrary area-split columns (part.a, part.b) that served no purpose in this book. So we stripped both down. meuse2 has the coordinates and the metals plus organic matter. meuse.grid2 has the grid coordinates and a set of modern, interpretable covariates. The covariates that matter for the regression kriging chapter - soil type, flooding frequency, river distance - live in meuse.grid2 and get extracted onto the point data when needed using terra::extract(). That workflow is more realistic than having covariates pre-joined into the point dataset anyway.

Origin

The original data were collected in the early 1990s near the village of Stein in the Netherlands, in the floodplain of the Meuse River. The data include topsoil heavy metal concentrations (cadmium, copper, lead, zinc) at 155 sample locations, along with a handful of environmental covariates. The dataset was published and distributed with the sp package in R by Edzer Pebesma and Roger Bivand, and it has been a workhorse teaching dataset in spatial statistics ever since.

The original dataset comes in two objects from the sp package: meuse (155 sample points) and meuse.grid (the prediction grid). In this book we use trimmed versions of both. For the point data we use meuse2, and for the prediction grid we use meuse.grid2. Load them like this:

meuse2 <- readRDS("data/meuse2.Rds")            # 164 sample points
meuse.grid2 <- readRDS("data/meuse.grid2.Rds")  # prediction grid

The prediction grid

meuse.grid2 extends the original meuse.grid with modern covariate layers pulled from open data sources. Here is what was added and where it came from.

Elevation and terrain derivatives were pulled using the elevatr package (Hollister et al. 2025), which fetches AWS terrain tiles at roughly 10 m resolution (zoom level 14). The elevation raster was reprojected to RD New (EPSG:28992) before computing terrain derivatives with terra::terrain() and terra::flowAccumulation() (Hijmans 2026).

Land cover is from ESA WorldCover 2021, a global 10 m land cover product based on Sentinel-1 and Sentinel-2 imagery. The data are freely available from the ESA public S3 bucket and were accessed using the rstac package (Simoes et al. 2021). The study area straddles the boundary between two 3x3 degree tiles (N48E003 and N51E003), which were merged before cropping.

River distance was computed using main channel geometries pulled from OpenStreetMap via the osmdata package (Mark Padgham et al. 2017), filtered to features tagged as waterway=river with names matching Maas, Meuse, or Afgesneden Maas. Streams and canals were excluded to avoid spurious near-zero distances in the upland parts of the study area. Distance from each grid cell to the nearest retained feature was calculated in meters using sf::st_distance() (Pebesma 2026).

Fourteen boundary cells with NA terrain values (edge effects from the neighborhood calculations in terrain()) were dropped, leaving 3089 grid cells.

The script that builds the augmented grid is in rScripts/buildMeuseCovariates.R.

Variable descriptions

meuse2 (point data, 164 rows)

Variable Units Description
x meters Easting, RD New (EPSG:28992)
y meters Northing, RD New (EPSG:28992)
cadmium mg/kg Topsoil cadmium concentration
copper mg/kg Topsoil copper concentration
lead mg/kg Topsoil lead concentration
zinc mg/kg Topsoil zinc concentration
om % Topsoil organic matter

Note that meuse.all from gstat has 164 observations, nine more than the 155 in sp::meuse. The column in.meuse155 in the original flags which rows belong to the classic 155-point dataset. We keep all 164 in meuse2.

meuse.grid2 (prediction grid, 3089 rows)

Variable Type Description
x numeric Easting, RD New (EPSG:28992), meters
y numeric Northing, RD New (EPSG:28992), meters
soil factor Soil type (see below)
ffreq factor Flooding frequency class (see below)
elev_m numeric Elevation above sea level, meters
slope_deg numeric Terrain slope, degrees
aspect_deg numeric Aspect (direction of slope), degrees from north
tpi numeric Topographic position index: positive values are ridges/hilltops, negative values are valleys
twi numeric Topographic wetness index: higher values indicate areas where water accumulates
landcover integer ESA WorldCover class code (see below)
river_dist_m numeric Distance to nearest OSM waterway, meters

Categorical variable codes

Soil type (soil) — Dutch 1:50,000 soil map:

Code Name Description
1 Rd10A Calcareous weakly-developed meadow soils, light sandy clay
2 Rd90C/VII Non-calcareous weakly-developed meadow soils, heavy sandy clay to light clay
3 Bkd26/VII Red Brick soil, fine-sandy, silty light clay

A note from the original documentation: it is questionable whether the soil data in meuse.grid2 come from a real soil map, as they do not match the published 1:50,000 map. Use them with that caveat in mind.

Flooding frequency (ffreq):

Code Description
1 Once in two years
2 Once in ten years
3 Once in 50 years

The original documentation notes that it is not known how the flooding frequency map was generated.

ESA WorldCover land cover (landcover):

Code Class
10 Tree cover
20 Shrubland
30 Grassland
40 Cropland
50 Built-up
60 Bare / sparse vegetation
80 Permanent water bodies
90 Herbaceous wetland

The full WorldCover legend has a few additional classes (snow/ice, mangroves, moss/lichen) that don’t appear in this area.

Maps

Code
library(tidyverse)
library(terra)
library(tidyterra)
library(patchwork)

meuse.grid2 <- readRDS("data/meuse.grid2.Rds")
r <- rast(meuse.grid2, type = "xyz", crs = "EPSG:28992")

# helper for continuous maps
cmap <- function(lyr, title) {
  ggplot() +
    geom_spatraster(data = r[[lyr]]) +
    scale_fill_whitebox_c(palette = "muted", na.value = NA) +
    labs(title = title, fill = NULL) +
    theme_minimal() +
    theme(axis.text = element_blank(), axis.title = element_blank(),
          panel.grid = element_blank())
}

# categorical variables need factor levels and discrete fill
soil_r <- as.factor(r[["soil"]])
levels(soil_r) <- data.frame(value = 1:3,
                              label = c("Rd10A", "Rd90C/VII", "Bkd26/VII"))
p_soil <- ggplot() +
  geom_spatraster(data = soil_r) +
  scale_fill_manual(values = c("#d8b365","#5ab4ac","#c7eae5"),
                    na.value = NA, na.translate = FALSE) +
  labs(title = "Soil type", fill = NULL) +
  theme_minimal() +
  theme(axis.text = element_blank(), axis.title = element_blank(),
        panel.grid = element_blank())

ffreq_r <- as.factor(r[["ffreq"]])
levels(ffreq_r) <- data.frame(value = 1:3,
                               label = c("< 2 yr", "2-10 yr", "10-50 yr"))
p_ffreq <- ggplot() +
  geom_spatraster(data = ffreq_r) +
  scale_fill_manual(values = c("#2166ac","#92c5de","#f4a582"),
                    na.value = NA, na.translate = FALSE) +
  labs(title = "Flood frequency", fill = NULL) +
  theme_minimal() +
  theme(axis.text = element_blank(), axis.title = element_blank(),
        panel.grid = element_blank())

lc_codes  <- c(10, 20, 30, 40, 50, 60, 80, 90)
lc_labels <- c("Tree cover","Shrubland","Grassland","Cropland",
               "Built-up","Bare/sparse","Water","Wetland")
lc_colors <- c("#1a9641","#a6d96a","#d9ef8b","#ffffbf",
               "#d7191c","#bdbdbd","#4575b4","#74add1")
lc_r <- as.factor(r[["landcover"]])
lc_present <- sort(unique(na.omit(values(r[["landcover"]]))))
idx <- lc_codes %in% lc_present
p_lc <- ggplot() +
  geom_spatraster(data = lc_r) +
  scale_fill_manual(values = lc_colors[idx], labels = lc_labels[idx],
                    na.value = NA, na.translate = FALSE) +
  labs(title = "Land cover (ESA WorldCover)", fill = NULL) +
  theme_minimal() +
  theme(axis.text = element_blank(), axis.title = element_blank(),
        panel.grid = element_blank())

(cmap("elev_m",       "Elevation (m)")      +
 cmap("slope_deg",    "Slope (degrees)")    +
 cmap("tpi",          "TPI")                +
 cmap("twi",          "TWI")                +
 cmap("river_dist_m", "River distance (m)") +
 p_soil + p_ffreq + p_lc) +
  plot_layout(ncol = 2)


Mexico Bird Richness

The file birdRichnessMexico.rds contains 200 point locations across Mexico, each with a count of bird species (nSpecies) plus a set of WorldClim bioclimatic variables.

Origin

The richness values come from the global bird species richness maps at BiodiversityMapping.org, assembled by Clinton Jenkins at Florida International University. The maps are derived from BirdLife International v7 range maps (2018), processed at 10x10 km resolution in the Eckert IV equal-area projection, native extant species only.

We sampled 500 random points across Mexico from the richness raster, then subsampled to 200 for the exercises. The points were reprojected to North America Albers Equal Area Conic centered on Mexico (SR-ORG:38), giving coordinates in kilometers.

WorldClim variables

Six bioclimatic variables from WorldClim v2.1 (Fick and Hijmans 2017) at 10 arc-minute resolution were extracted to the point locations using the geodata and terra packages:

Variable WorldClim code Description Units
mat BIO1 Annual mean temperature °C
tempSeason BIO4 Temperature seasonality SD × 100
tempRange BIO7 Temperature annual range °C
map BIO12 Annual precipitation mm
precipSeason BIO15 Precipitation seasonality CV
precipDryQ BIO17 Precipitation of driest quarter mm

The script that builds the dataset is in data/buildBirdRichnessMexico.R.

Permissions and citation

BiodiversityMapping.org permits free use for personal, not-for-profit purposes by students and educational institutions, with attribution. Cite the underlying papers when using the data:

Jenkins CN, Pimm SL, Joppa LN (2013). Global patterns of terrestrial vertebrate diversity and conservation. PNAS 110(28). doi:10.1073/pnas.1302251110

Pimm SL, Jenkins CN, Abell R, Brooks TM, Gittleman JL, Joppa LN, Raven PH, Roberts CM, Sexton JO (2014). The biodiversity of species and their rates of extinction, distribution, and protection. Science 344(6187): 1246752.


California Ozone and Precipitation

Two California datasets appear in the spatial data crash course and the interpolation chapters: ozone monitoring station records and weather station precipitation records. Both come from the rspat package by Robert Hijmans, which bundles teaching datasets for the rspatial.org tutorials.

California ozone (californiaOzonePoints.csv)

345 ozone monitoring stations across California with mean daily ozone (parts per billion, averaged 1980-2009). Originally from the US EPA Air Quality System. In rspat this is the airqual dataset; we exported it to CSV and trimmed to three columns (ozone, longitude, latitude). Used in the spatial data crash course to demonstrate point data and raster extraction.

California precipitation (prcpCA.rds, gridCA.rds)

prcpCA.rds has 432 weather stations with monthly and annual precipitation totals (mm), station names, elevations, and coordinates in the California Teale Albers projection (EPSG:3310). Station IDs follow the rspat convention (e.g., ID741 = Death Valley). In rspat this is the precipitation dataset.

gridCA.rds is a 4053-point regular prediction grid covering California in the same projection, used as the interpolation target in the IDW and kriging exercises.

Citation

Both datasets are distributed with the rspat package and should be cited as:

Hijmans RJ (2023). rspat: rspatial.org data. R package. https://github.com/rspatial/rspat