How to Validate a Model

Residuals, sensitivity, uncertainty, and honest tests for geographic models

Published

April 4, 2026

Before You Start

You should know
What a model is, what an input and output are, and why simplified models can still be useful.

You will learn
How to test a model honestly, how to read residuals, how to ask which assumptions matter most, and how to report uncertainty without pretending to know more than you do.

Why this matters
Most weak modelling work fails not because the equation is impossible, but because the test is too easy or the uncertainty is hidden.

If this gets hard, focus on…
The four basic questions: Did the model fit? Did it generalize? What assumptions drive it? How uncertain is the answer?

In 1978, forecasters at the U.S. National Weather Service already knew that a single deterministic weather run could look impressively precise while still being wrong in exactly the ways that mattered most. The issue was not just “bad physics.” It was evaluation. A forecast could match yesterday’s observations closely, then fail tomorrow because the initial state was uncertain, the test set was too easy, or the model was sensitive to one badly known parameter. That lesson is not unique to weather. The same pattern appears in watershed delineation, fire spread, trade gravity models, snowpack estimation, and land-cover classification. A model that is not checked honestly is only a polished hypothesis.

This chapter makes validation a first-class modelling skill. The goal is not to turn every reader into a statistician. The goal is to give every reader a disciplined way to ask whether a model is good enough for the question at hand. We do that by separating four tasks that are often blurred together: fitting, validation, sensitivity analysis, and uncertainty reporting.

1. The Question

How do we tell whether a geographic model is trustworthy?

A model can fail in at least four different ways:

  • It fits the training data badly.
  • It fits the training data well but fails on new places or times.
  • It is too sensitive to one uncertain assumption.
  • It returns a single number without showing how uncertain that number is.

The practical question is therefore not just “what is the model output?” It is:

  1. How close is the model to observations?
  2. Does it still work away from the data used to build it?
  3. Which assumptions matter most?
  4. How wide is the plausible answer range?

2. The Conceptual Model

Validation is easiest to understand as a sequence of increasingly harder tests.

Validation Ladder

A Trustworthy Model Passes More Than One Kind Of Test

Fitting is only the first rung. A useful model also has to survive out-of-sample testing, reveal where it is wrong, and show how much its answer moves when uncertain inputs change.

Fit

Compare predictions with known observations and summarize the mismatch.

Validate

Test on withheld places, times, or cases the model did not see during fitting.

Probe sensitivity

Change important assumptions and watch which outputs move the most.

Report uncertainty

Return a range, interval, or scenario spread instead of a single overconfident value.

A strong model is not the one with the prettiest map. It is the one that keeps working as the tests become more demanding.

Fit is not the same as validation

Suppose we predict streamflow, land surface temperature, or crop yield from a set of inputs. If we evaluate the model on the same cases used to tune it, we are mostly asking whether the model can reproduce familiar data. That is a fit question.

Validation asks the harder question: does the model still work on new cases?

For geographic data, “new” often needs to mean:

  • a different place
  • a different time period
  • a different event
  • or at least a spatially separated block of observations

This is why random train-test splits are often too optimistic in geography. Nearby samples share climate, terrain, and land-cover context. A random split may leak that shared structure into both train and test sets.

Residuals are maps of model failure

A residual is the difference between an observed value and a predicted value:

e_i = y_i - \hat{y}_i

Residuals are not just bookkeeping. They tell you where the model is systematically wrong.

  • Residuals clustered by region suggest missing spatial processes.
  • Residuals that grow with prediction size suggest heteroscedasticity.
  • Residuals that flip sign by season suggest temporal structure the model missed.

If the residual map still has obvious structure, the model has not captured all the structure in the system.

Sensitivity asks which assumptions matter

Many geographic models depend on uncertain inputs:

  • roughness coefficients in flood models
  • emissivity in thermal remote sensing
  • discount rates in energy economics
  • blend ratios in pipeline models
  • lapse rates in mountain temperature models

Sensitivity analysis varies one or more uncertain inputs and measures how strongly the output changes.

For a model output Y that depends on parameter p, a simple local sensitivity is:

S_p = \frac{\Delta Y / Y}{\Delta p / p}

Interpretation:

  • S_p = 0.2: output changes only a little when the parameter changes
  • S_p = 1: output changes proportionally
  • S_p > 1: the model is highly sensitive to that parameter

Uncertainty is a range, not a confession of defeat

An uncertainty interval does not mean the model failed. It means the model is being honest about what is not perfectly known.

Common uncertainty sources:

  • measurement error in observations
  • uncertain parameters
  • uncertain initial conditions
  • structural uncertainty in the model form

The simplest useful rule is: if important inputs are uncertain, important outputs should usually be reported as ranges or scenario bands.


3. Worked Example by Hand

Imagine a simple runoff model that predicts peak discharge Q from rainfall intensity I:

Q = cI

where c is a runoff coefficient.

Suppose we calibrate the model using three observed storms:

Storm Observed rainfall I (mm/h) Observed peak flow Q (m³/s)
1 10 16
2 20 29
3 30 47

We choose c = 1.5, so the model predictions are:

Storm Predicted Q (m³/s) Residual e = y - \hat{y}
1 15 1
2 30 -1
3 45 2

Step 1: summarize fit

The mean absolute error is:

MAE = \frac{|1| + |-1| + |2|}{3} = \frac{4}{3} \approx 1.33 \text{ m}^3/\text{s}

That tells us the model is off by about 1.33 m³/s on average for these storms.

Step 2: validate on a held-out storm

Now test a new storm with rainfall intensity 25 mm/h and observed peak flow 41 m³/s.

Predicted:

\hat{Q} = 1.5 \times 25 = 37.5 \text{ m}^3/\text{s}

Residual:

e = 41 - 37.5 = 3.5 \text{ m}^3/\text{s}

The model looked quite good on the calibration storms, but it performs worse on a new event. That is exactly why validation must be separate from fitting.

Step 3: test sensitivity

Suppose the runoff coefficient is uncertain and might be 1.3, 1.5, or 1.7.

For the same 25 mm/h storm:

  • Q = 1.3 \times 25 = 32.5
  • Q = 1.5 \times 25 = 37.5
  • Q = 1.7 \times 25 = 42.5

So uncertainty in a single parameter creates an output range of 32.5 to 42.5 m³/s.

Step 4: report uncertainty honestly

A defensible statement is not “the peak flow will be 37.5 m³/s.”

A defensible statement is:

“Under plausible runoff coefficients, the model predicts a peak flow of about 38 m³/s, with a simple parameter range of roughly 33 to 43 m³/s.”


4. Practical Workflow

When you finish a model, run through this checklist:

  1. Check fit metrics. Use MAE, RMSE, classification accuracy, or another metric appropriate to the task.
  2. Look at residuals. Plot them against space, time, and predicted magnitude.
  3. Hold something out honestly. Use a different place, time block, or event when possible.
  4. Vary uncertain inputs. Learn which parameters actually control the answer.
  5. Report a range. Use intervals, scenarios, or ensemble spread if uncertainty is material.

Which validation style fits which problem?

Problem type Better validation habit Why
Raster classification spatial holdout blocks nearby pixels are not independent
Time-series forecasting future-period holdout random splits leak future information
Watershed or hazard modelling event-based holdout one storm is not all storms
Trade or mobility models out-of-sample regions or years coefficients may drift by context

5. Computational Implementation

# Basic validation workflow for a simple regression-style geographic model

observed = [16.0, 29.0, 47.0]
predicted = [15.0, 30.0, 45.0]

residuals = [y - yhat for y, yhat in zip(observed, predicted)]
mae = sum(abs(r) for r in residuals) / len(residuals)
rmse = (sum(r**2 for r in residuals) / len(residuals)) ** 0.5

print("Residuals:", residuals)
print("MAE:", round(mae, 2))
print("RMSE:", round(rmse, 2))

# Simple one-parameter sensitivity analysis
rainfall = 25.0
coefficients = [1.3, 1.5, 1.7]
outputs = [c * rainfall for c in coefficients]

print("Sensitivity outputs:", outputs)
print("Prediction range:", (min(outputs), max(outputs)))

Professional tools go further, but the logic is the same:

  • sklearn for train/test workflows and metrics
  • spatial block cross-validation for geographic ML
  • Monte Carlo sampling for parameter uncertainty
  • ensemble methods when initial conditions are uncertain

6. What Could Go Wrong?

Easy test sets

If training and test data are too similar, validation looks better than reality.

Single-metric thinking

A model can have a respectable RMSE and still fail in exactly the spatial locations that matter most.

Ignoring structural uncertainty

Even perfect parameter tuning cannot rescue a model built on the wrong process assumptions.

False precision

Reporting 41.73 instead of about 42 does not make a model more reliable. It often just hides uncertainty behind decimal places.


Summary

  • Fit and validation are different tasks.
  • Residuals show where the model still fails.
  • Sensitivity analysis tells you which assumptions matter most.
  • Uncertainty ranges make outputs more useful, not less.
  • A trustworthy model is one that survives harder tests, not one that only looks neat on the data it already saw.