Aside: Predicting New Data with a Fitted Model

This is likely review for most of you, so roll your eyes as needed. The idea is simple: once you have a fitted model in R, you can use it to predict values for data the model has never seen. This comes up constantly in regression kriging, where you fit a model to your sample points and then predict across an entire grid. Let’s walk through it.

Packages

We’ll use tidyverse(Wickham 2023) for data wrangling.

Code
library(tidyverse)

A Simulated Example

Here we create a data.frame called originalDat with three variables: x1 and x2 are n draws from a normal distribution (\(N(0,1)\)) and y is a function of x1 and x2 following a linear model with coefficients b0, b1, and b2 plus some noise (epsilon) drawn from \(N(0,2)\).

Code
set.seed(42)
n <- 50
b0 <- 4
b1 <- 0.6
b2 <- 3
originalDat <- data.frame(x1=rnorm(n), x2=rnorm(n), epsilon = rnorm(n,sd = 2))
originalDat$y <- b0 + b1 * originalDat$x1 + b2 * originalDat$x2 + originalDat$epsilon

We fit a linear model with lm:

Code
lm1 <- lm(y~x1 + x2, data = originalDat)
summary(lm1)

Call:
lm(formula = y ~ x1 + x2, data = originalDat)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4578 -0.8958 -0.1260  0.5328  6.1281 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   3.6995     0.2680  13.805  < 2e-16 ***
x1            0.7764     0.2406   3.226  0.00229 ** 
x2            3.0430     0.2996  10.158 1.93e-13 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.883 on 47 degrees of freedom
Multiple R-squared:  0.6885,    Adjusted R-squared:  0.6752 
F-statistic: 51.93 on 2 and 47 DF,  p-value: 1.253e-12

We have one clearly useful predictor and one borderline one. Now, say we want to extend that model to a new data set. The object lm1 is class lm, and most model classes in R have a predict function that accepts a newdata argument for exactly this purpose.

Note

See the Understanding Methods and Generics in R aside if the idea of a generic predict function dispatching to predict.lm is unclear.

Let’s imagine a new data set with x1 and x2 but no observed y:

Code
newDat <- data.frame(x1=rnorm(n), x2=rnorm(n))
head(newDat)
           x1         x2
1 -0.04069848 -2.0009292
2 -1.55154482  0.3337772
3  1.16716955  1.1713251
4 -0.27364570  2.0595392
5 -0.46784532 -1.3768616
6 -1.23825233 -1.1508556

We pass newDat to predict via the newdata argument. The predicted values go into a new column yhat – the hat signaling that these are estimates, not observations.

Code
newDat$yhat <- predict(object = lm1, newdata = newDat)
head(newDat)
           x1         x2       yhat
1 -0.04069848 -2.0009292 -2.4210168
2 -1.55154482  0.3337772  3.5105920
3  1.16716955  1.1713251  8.1699702
4 -0.27364570  2.0595392  9.7542369
5 -0.46784532 -1.3768616 -0.8535832
6 -1.23825233 -1.1508556 -0.7639540

If you call predict without supplying newdata, it returns the fitted values from the original data. In other words, predict(lm1), fitted(lm1), and lm1$fitted.values all return the same thing.

This pattern – fit on sample data, predict onto a grid – is exactly what regression kriging does, with the added step of kriging the residuals on top of the OLS prediction.