Code
library(tidyverse)This is likely review for most of you, so roll your eyes as needed. The idea is simple: once you have a fitted model in R, you can use it to predict values for data the model has never seen. This comes up constantly in regression kriging, where you fit a model to your sample points and then predict across an entire grid. Let’s walk through it.
We’ll use tidyverse(Wickham 2023) for data wrangling.
Here we create a data.frame called originalDat with three variables: x1 and x2 are n draws from a normal distribution (\(N(0,1)\)) and y is a function of x1 and x2 following a linear model with coefficients b0, b1, and b2 plus some noise (epsilon) drawn from \(N(0,2)\).
We fit a linear model with lm:
Call:
lm(formula = y ~ x1 + x2, data = originalDat)
Residuals:
Min 1Q Median 3Q Max
-3.4578 -0.8958 -0.1260 0.5328 6.1281
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.6995 0.2680 13.805 < 2e-16 ***
x1 0.7764 0.2406 3.226 0.00229 **
x2 3.0430 0.2996 10.158 1.93e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.883 on 47 degrees of freedom
Multiple R-squared: 0.6885, Adjusted R-squared: 0.6752
F-statistic: 51.93 on 2 and 47 DF, p-value: 1.253e-12
We have one clearly useful predictor and one borderline one. Now, say we want to extend that model to a new data set. The object lm1 is class lm, and most model classes in R have a predict function that accepts a newdata argument for exactly this purpose.
See the Understanding Methods and Generics in R aside if the idea of a generic predict function dispatching to predict.lm is unclear.
Let’s imagine a new data set with x1 and x2 but no observed y:
x1 x2
1 -0.04069848 -2.0009292
2 -1.55154482 0.3337772
3 1.16716955 1.1713251
4 -0.27364570 2.0595392
5 -0.46784532 -1.3768616
6 -1.23825233 -1.1508556
We pass newDat to predict via the newdata argument. The predicted values go into a new column yhat – the hat signaling that these are estimates, not observations.
x1 x2 yhat
1 -0.04069848 -2.0009292 -2.4210168
2 -1.55154482 0.3337772 3.5105920
3 1.16716955 1.1713251 8.1699702
4 -0.27364570 2.0595392 9.7542369
5 -0.46784532 -1.3768616 -0.8535832
6 -1.23825233 -1.1508556 -0.7639540
If you call predict without supplying newdata, it returns the fitted values from the original data. In other words, predict(lm1), fitted(lm1), and lm1$fitted.values all return the same thing.
This pattern – fit on sample data, predict onto a grid – is exactly what regression kriging does, with the added step of kriging the residuals on top of the OLS prediction.