Regression for Continuous Variables

Predicting spatial fields from spectral and environmental features

Published

March 1, 2026

Before You Start

You should know
That regression learns a relationship from examples, and that matrices can be used to organize many predictors at once.

You will learn
How linear regression predicts continuous variables, why gradient descent finds the coefficients, and why geographic validation has to respect spatial dependence.

Why this matters
A large share of geographic machine learning is really structured prediction of things like elevation, soil moisture, fire severity, yield, or biomass.

If this gets hard, focus on…
The central idea that we fit a function from features to a numeric target, then check whether that function generalizes to new places.

After the 2019–20 Australian Black Summer fires burned 18.6 million hectares — an area larger than Cambodia — ecologists needed rapid estimates of fire severity across landscapes too vast and too dangerous to survey on foot. The solution was regression. Spectral indices from Landsat and Sentinel-2 satellites, calibrated against 5,000 field plots where trained observers had recorded char depth, crown scorch height, and tree mortality, were used to predict a continuous fire severity score for every 10-metre pixel across the fire perimeter. The result was a map that would have taken a decade to produce in the field, delivered in weeks from a laptop.

This is the core bargain of supervised regression: a small number of expensive, precise measurements (field plots, sensor deployments, laboratory analyses) are used to train a model that can then predict the same quantity cheaply, at every point in space, from easy-to-acquire satellite observations. The relationship between predictors and the target variable is learned from data, not hardcoded from theory — which makes regression enormously flexible, and also capable of learning relationships that exist only in the training set and nowhere else.

The machinery underneath is elegant. Linear regression minimises a squared error loss by gradient descent, finding the weight vector that best fits the training data. Regularisation (L1 and L2 penalties) prevents the model from over-fitting by penalising extreme weights. Spatial cross-validation replaces standard k-fold validation, because geographic data violates independence: nearby pixels share the same weather, geology, and disturbance history, and a model that looks accurate in standard cross-validation may have simply memorised spatial autocorrelation. Getting the validation right is as important as getting the model right.

Regression Workflow

Regression Turns Sparse Measurements Into A Continuous Surface, But Only If Validation Respects Geography

This chapter is easiest to hold together when the reader sees four distinct stages: expensive ground truth, cheap wall-to-wall features, a fitted function, and a validation step that tests the model on new places rather than nearby pixels from the same landscape patch.

Field data

Measure A Small Sample Well

Probe points, field plots, or lab assays provide expensive but trustworthy target values.

→

Feature stack

Assemble Predictors Everywhere

Satellite bands, indices, terrain, and climate layers are available wall-to-wall across the map.

→

Model fit

Learn The Function

Regression estimates how each feature helps predict the continuous target and then maps predictions to every cell.

Bad test

Random Nearby Holdouts

The model can look excellent simply because nearby pixels share the same landscape signal.

Better test

Spatially Separated Validation

Hold out entire areas so the model has to generalize to genuinely new places, not just neighboring cells.

Goal

A Surface You Can Trust

The output is not just a prediction map. It is a prediction map whose generalization error has been tested geographically.

Most of the real discipline in geographic regression sits at the validation step. A good-looking map is not enough if the test design leaks spatial autocorrelation.

1. The Question

Given satellite imagery and a few field measurements, can we predict soil moisture across an entire watershed?

Direct measurement doesn’t scale.

Soil moisture probe: One location = $500

Watershed area: 100 km² = 10 million 10×10m cells

Full coverage: $5 billion (impossible)

Regression provides solution:

Measure 100 locations with probes ($50,000).

Extract satellite features (vegetation, thermal, topography).

Learn relationship: Soil moisture = f(features)

Predict all 10 million cells.

Applications:

Digital elevation models (DEM from spectral features)
Soil property mapping (carbon, pH, texture)
Biomass estimation (forest carbon stocks)
Temperature mapping (urban heat islands)
Crop yield prediction (precision agriculture)
Snow water equivalent (mountain hydrology)

Why regression works:

Physical relationships exist between observables and targets.

Examples:

Higher NDVI → More biomass (vegetation absorbs carbon)
Lower thermal radiance → Higher soil moisture (evaporative cooling)
Steeper terrain → Lower soil depth (erosion)
Higher NIR/Red ratio → Taller vegetation → More elevation locally

Regression learns these patterns from data.

2. The Conceptual Model

Linear Regression

Simplest model: Linear relationship

y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon

Where: - y = target variable (e.g., elevation) - x_1, ..., x_p = features (NDVI, slope, aspect, etc.) - \beta_0 = intercept - \beta_1, ..., \beta_p = coefficients (weights) - \epsilon = error term (noise)

Matrix notation:

\mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}

Where: - \mathbf{y} = n \times 1 vector of observations - \mathbf{X} = n \times (p+1) design matrix (includes intercept column) - \boldsymbol{\beta} = (p+1) \times 1 coefficient vector - \boldsymbol{\epsilon} = n \times 1 error vector

Goal: Find \boldsymbol{\beta} minimizing prediction error

Loss Function

Mean Squared Error (MSE):

L(\boldsymbol{\beta}) = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 = \frac{1}{n} \|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2

Why squared error? - Differentiable (enables gradient descent) - Penalizes large errors more (quadratic) - Corresponds to maximum likelihood under Gaussian noise

Minimization:

Take derivative, set to zero:

\frac{\partial L}{\partial \boldsymbol{\beta}} = -\frac{2}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = 0

Closed-form solution (Normal Equation):

\boldsymbol{\beta}^* = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{y}

Issues with Normal Equation:

Requires matrix inversion (O(p^3) computation)
Fails if \mathbf{X}^T\mathbf{X} singular (correlated features)
Memory intensive for large datasets

Alternative: Gradient Descent

Gradient Descent

Iterative optimization:

Start with random \boldsymbol{\beta}^{(0)}

Repeat until convergence:

\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - \alpha \nabla L(\boldsymbol{\beta}^{(t)})

Where: - \alpha = learning rate (step size) - \nabla L = gradient of loss function

Gradient calculation:

\nabla L(\boldsymbol{\beta}) = \frac{\partial L}{\partial \boldsymbol{\beta}} = -\frac{2}{n}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})

Algorithm:

Initialize β randomly
for epoch in 1 to max_epochs:
    predictions = X @ β
    errors = y - predictions
    gradient = -(2/n) * X.T @ errors
    β = β - learning_rate * gradient
    
    if ||gradient|| < tolerance:
        break

Advantages:

Scalable to large datasets
No matrix inversion
Works with regularization

Regularization

Problem: Overfitting

Complex model fits training noise.

Solution: Penalty on coefficient magnitude

Ridge Regression (L2):

L_{ridge}(\boldsymbol{\beta}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|^2

Penalizes large coefficients, shrinks toward zero.

Lasso Regression (L1):

L_{lasso}(\boldsymbol{\beta}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \sum_{j=1}^p |\beta_j|

Encourages sparsity (some coefficients exactly zero).

Elastic Net (combination):

L_{elastic}(\boldsymbol{\beta}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda_1 \|\boldsymbol{\beta}\|^2 + \lambda_2 \sum_{j=1}^p |\beta_j|

Hyperparameter \lambda:

\lambda = 0: No regularization (ordinary least squares)
\lambda \to \infty: All coefficients → 0
Optimal \lambda: Found via cross-validation

3. Building the Mathematical Model

Deriving Gradient Descent Update

Loss function:

L(\boldsymbol{\beta}) = \frac{1}{n}\sum_{i=1}^n (y_i - \mathbf{x}_i^T\boldsymbol{\beta})^2

Expand:

L(\boldsymbol{\beta}) = \frac{1}{n}\sum_{i=1}^n y_i^2 - 2y_i\mathbf{x}_i^T\boldsymbol{\beta} + (\mathbf{x}_i^T\boldsymbol{\beta})^2

Gradient with respect to \boldsymbol{\beta}:

\frac{\partial L}{\partial \boldsymbol{\beta}} = \frac{1}{n}\sum_{i=1}^n -2y_i\mathbf{x}_i + 2\mathbf{x}_i(\mathbf{x}_i^T\boldsymbol{\beta})

= \frac{2}{n}\sum_{i=1}^n \mathbf{x}_i(\mathbf{x}_i^T\boldsymbol{\beta} - y_i)

Matrix form:

\nabla L = \frac{2}{n}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} - \mathbf{y})

Update rule:

\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - \frac{2\alpha}{n}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta}^{(t)} - \mathbf{y})

Learning rate selection:

Too small: Slow convergence
Too large: Divergence (overshoots minimum)

Typical: \alpha = 0.01 to 0.1, or adaptive (decrease over time)

Ridge Regression Gradient

Loss with L2 penalty:

L_{ridge}(\boldsymbol{\beta}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\boldsymbol{\beta}\|^2 + \lambda \|\boldsymbol{\beta}\|^2

Gradient:

\nabla L_{ridge} = \frac{2}{n}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} - \mathbf{y}) + 2\lambda\boldsymbol{\beta}

Update:

\boldsymbol{\beta}^{(t+1)} = \boldsymbol{\beta}^{(t)} - \alpha[\frac{2}{n}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta}^{(t)} - \mathbf{y}) + 2\lambda\boldsymbol{\beta}^{(t)}]

Closed form (if desired):

\boldsymbol{\beta}^*_{ridge} = (\mathbf{X}^T\mathbf{X} + n\lambda\mathbf{I})^{-1}\mathbf{X}^T\mathbf{y}

Adding \lambda\mathbf{I} ensures invertibility even with correlated features.

Cross-Validation

K-Fold Cross-Validation:

Split data into K folds (typical: K=5 or 10)
For each fold k:
- Train on K-1 folds
- Test on fold k
- Record error E_k
Average: CV_{error} = \frac{1}{K}\sum_{k=1}^K E_k

Use for:

Hyperparameter tuning (\lambda selection)
Model selection (compare algorithms)
Performance estimation (unbiased error estimate)

Spatial Cross-Validation:

Standard CV invalid for spatial data (autocorrelation).

Solution: Spatial blocks

Divide region into spatial blocks.

Each fold = one or more blocks.

Ensures test data spatially separated from training.

4. Worked Example by Hand

Problem: Predict elevation from NDVI and slope.

Training data:

Location	NDVI	Slope	Elevation (m)
1	0.7	5°	450
2	0.4	15°	850
3	0.6	8°	600
4	0.3	20°	1100
5	0.5	12°	750

Goal: Fit linear model manually using gradient descent.

Step 1: Set up matrices

Feature matrix \mathbf{X} (with intercept):

\mathbf{X} = \begin{bmatrix} 1 & 0.7 & 5 \\ 1 & 0.4 & 15 \\ 1 & 0.6 & 8 \\ 1 & 0.3 & 20 \\ 1 & 0.5 & 12 \end{bmatrix}

Target vector \mathbf{y}:

\mathbf{y} = \begin{bmatrix} 450 \\ 850 \\ 600 \\ 1100 \\ 750 \end{bmatrix}

Step 2: Initialize coefficients

\boldsymbol{\beta}^{(0)} = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix} (start at origin)

Step 3: First iteration

Predictions:

\hat{\mathbf{y}}^{(0)} = \mathbf{X}\boldsymbol{\beta}^{(0)} = \begin{bmatrix} 0 \\ 0 \\ 0 \\ 0 \\ 0 \end{bmatrix}

Errors:

\mathbf{y} - \hat{\mathbf{y}}^{(0)} = \begin{bmatrix} 450 \\ 850 \\ 600 \\ 1100 \\ 750 \end{bmatrix}

Gradient:

\nabla L^{(0)} = -\frac{2}{5}\mathbf{X}^T(\mathbf{y} - \hat{\mathbf{y}}^{(0)})

\mathbf{X}^T = \begin{bmatrix} 1 & 1 & 1 & 1 & 1 \\ 0.7 & 0.4 & 0.6 & 0.3 & 0.5 \\ 5 & 15 & 8 & 20 & 12 \end{bmatrix}

\mathbf{X}^T(\mathbf{y} - \hat{\mathbf{y}}^{(0)}) = \begin{bmatrix} 3750 \\ 1970 \\ 48500 \end{bmatrix}

\nabla L^{(0)} = -\frac{2}{5}\begin{bmatrix} 3750 \\ 1970 \\ 48500 \end{bmatrix} = \begin{bmatrix} -1500 \\ -788 \\ -19400 \end{bmatrix}

Update (learning rate \alpha = 0.001):

\boldsymbol{\beta}^{(1)} = \boldsymbol{\beta}^{(0)} - 0.001 \times \begin{bmatrix} -1500 \\ -788 \\ -19400 \end{bmatrix}

= \begin{bmatrix} 1.5 \\ 0.788 \\ 19.4 \end{bmatrix}

Step 4: Second iteration

Predictions:

\hat{y}_1^{(1)} = 1.5 + 0.788(0.7) + 19.4(5) = 1.5 + 0.552 + 97 = 99.05

\hat{y}_2^{(1)} = 1.5 + 0.788(0.4) + 19.4(15) = 1.5 + 0.315 + 291 = 292.82

(Continue for all 5 samples…)

New errors, gradient, update…

Step 5: Convergence

After ~1000 iterations (or until \|\nabla L\| < 0.01):

Final coefficients (approximate):

\boldsymbol{\beta}^* \approx \begin{bmatrix} 200 \\ -300 \\ 45 \end{bmatrix}

Model:

Elevation = 200 - 300 \times NDVI + 45 \times Slope

Interpretation:

Intercept: 200m baseline
NDVI: -300 coefficient (higher vegetation → lower elevation in this dataset)
Slope: +45 coefficient (steeper slopes → higher elevation)

Step 6: Prediction

New location: NDVI = 0.5, Slope = 10°

\hat{Elevation} = 200 - 300(0.5) + 45(10) = 200 - 150 + 450 = 500 \text{ m}

Step 7: Evaluate

Test on training data (for illustration):

Actual	Predicted	Error
450	435	15
850	820	30
600	590	10
1100	1095	5
750	740	10

RMSE: \sqrt{\frac{15^2+30^2+10^2+5^2+10^2}{5}} = \sqrt{\frac{1350}{5}} = 16.4 m

R²: (coefficient of determination)

R^2 = 1 - \frac{SS_{res}}{SS_{tot}} = 1 - \frac{1350}{...} \approx 0.95

Good fit on training data. (Would need test set for true evaluation.)

5. Computational Implementation

Tier 1: Foundation Path (Pi/Laptop)

Implementation from scratch (Python/NumPy):

import numpy as np

class LinearRegression:
    def __init__(self, learning_rate=0.01, n_iterations=1000, 
                 regularization='none', lambda_=0.1):
        self.lr = learning_rate
        self.n_iter = n_iterations
        self.regularization = regularization
        self.lambda_ = lambda_
        self.weights = None
        self.bias = None
        self.loss_history = []
    
    def fit(self, X, y):
        """Train linear regression via gradient descent"""
        n_samples, n_features = X.shape
        
        # Initialize parameters
        self.weights = np.zeros(n_features)
        self.bias = 0
        
        # Gradient descent
        for i in range(self.n_iter):
            # Forward pass
            y_pred = self.predict(X)
            
            # Compute loss
            loss = np.mean((y - y_pred)**2)
            
            # Add regularization to loss
            if self.regularization == 'ridge':
                loss += self.lambda_ * np.sum(self.weights**2)
            elif self.regularization == 'lasso':
                loss += self.lambda_ * np.sum(np.abs(self.weights))
            
            self.loss_history.append(loss)
            
            # Compute gradients
            dw = -(2/n_samples) * X.T @ (y - y_pred)
            db = -(2/n_samples) * np.sum(y - y_pred)
            
            # Add regularization to gradients
            if self.regularization == 'ridge':
                dw += 2 * self.lambda_ * self.weights
            elif self.regularization == 'lasso':
                dw += self.lambda_ * np.sign(self.weights)
            
            # Update parameters
            self.weights -= self.lr * dw
            self.bias -= self.lr * db
            
            # Check convergence
            if i > 0 and abs(self.loss_history[-1] - self.loss_history[-2]) < 1e-6:
                print(f"Converged at iteration {i}")
                break
        
        return self
    
    def predict(self, X):
        """Make predictions"""
        return X @ self.weights + self.bias
    
    def score(self, X, y):
        """Calculate R² score"""
        y_pred = self.predict(X)
        ss_res = np.sum((y - y_pred)**2)
        ss_tot = np.sum((y - np.mean(y))**2)
        return 1 - (ss_res / ss_tot)


# Example usage
if __name__ == "__main__":
    # Generate synthetic data
    np.random.seed(42)
    X = np.random.randn(100, 2)  # NDVI, Slope
    true_weights = np.array([-300, 45])
    true_bias = 200
    y = X @ true_weights + true_bias + np.random.randn(100) * 20
    
    # Split train/test
    n_train = 80
    X_train, X_test = X[:n_train], X[n_train:]
    y_train, y_test = y[:n_train], y[n_train:]
    
    # Train model
    model = LinearRegression(learning_rate=0.01, n_iterations=1000,
                            regularization='ridge', lambda_=0.1)
    model.fit(X_train, y_train)
    
    # Evaluate
    train_score = model.score(X_train, y_train)
    test_score = model.score(X_test, y_test)
    
    print(f"Learned weights: {model.weights}")
    print(f"Learned bias: {model.bias:.2f}")
    print(f"Train R²: {train_score:.3f}")
    print(f"Test R²: {test_score:.3f}")
    
    # Predictions
    y_pred = model.predict(X_test)
    rmse = np.sqrt(np.mean((y_test - y_pred)**2))
    print(f"Test RMSE: {rmse:.2f}")

Runtime: Training <1 second for 100 samples, 2 features

Memory: Minimal (~1 KB for this dataset)

Tier 2: Professional Path (Go Further)

Scikit-learn with cross-validation:

from sklearn.linear_model import Ridge, Lasso, ElasticNet
from sklearn.model_selection import GridSearchCV, KFold
from sklearn.metrics import mean_squared_error, r2_score
import numpy as np

# Load larger dataset (assume 10,000 samples, 20 features)
# X_large, y_large

# Set up spatial cross-validation
# (In reality, would use spatial blocks)
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Hyperparameter grid
param_grid = {
    'alpha': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
}

# Grid search for ridge regression
ridge = Ridge()
grid_search = GridSearchCV(ridge, param_grid, cv=cv, 
                          scoring='neg_mean_squared_error',
                          n_jobs=-1)
grid_search.fit(X_large, y_large)

print(f"Best lambda: {grid_search.best_params_['alpha']}")
print(f"Best CV score: {-grid_search.best_score_:.2f} RMSE")

# Train final model
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)

# Detailed evaluation
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)

print(f"Test RMSE: {rmse:.2f}")
print(f"Test R²: {r2:.3f}")

# Feature importance (coefficient magnitudes)
importance = np.abs(best_model.coef_)
for i, imp in enumerate(importance):
    print(f"Feature {i}: {imp:.3f}")

Runtime: 10-30 seconds for 10k samples with grid search

With GPU: XGBoost or LightGBM for larger datasets

6. Interpretation

Real Application: Digital Soil Mapping

GlobalSoilMap project:

Predict soil properties (carbon, pH, texture) globally at 90m resolution.

Training data:

~50,000 soil profile observations worldwide
Soil databases (ISRIC, USDA)

Features:

Landsat spectral bands
Elevation, slope, curvature
Climate (temperature, precipitation)
Land cover
Parent material geology

Target: Soil organic carbon (SOC) at 0-30 cm depth

Model: Random forest regression (ensemble of decision trees)

Performance:

RMSE: 10-15 g C/kg soil
R²: 0.40-0.60 (moderate, high natural variability)

Insights from coefficients:

Positive predictors (higher SOC): - Higher precipitation (+) - Lower temperature (+) [decomposition slower when cold] - Forest land cover (+) - Higher clay content (+) [protects carbon]

Negative predictors (lower SOC): - Higher elevation (-) [thin soils] - Steeper slopes (-) [erosion] - Cropland (-) [tillage oxidizes carbon]

Residual Spatial Autocorrelation

Problem:

Regression residuals (errors) spatially clustered.

Indicates missing spatial structure in model.

Moran’s I test:

I = \frac{n}{W}\frac{\sum_i\sum_j w_{ij}(e_i - \bar{e})(e_j - \bar{e})}{\sum_i (e_i - \bar{e})^2}

Where: - e_i = residual at location i - w_{ij} = spatial weights (1 if neighbors, 0 otherwise) - W = sum of all weights

I > 0: Positive autocorrelation (clustered errors)
I \approx 0: No autocorrelation
I < 0: Negative autocorrelation (dispersed errors)

Solutions:

Add spatial features (coordinates, distance to…)
Use spatial regression models (SAR, CAR)
Ensemble predictions with kriging residuals

Prediction Uncertainty

Point predictions insufficient.

Need: Prediction intervals.

Bootstrap method:

Resample training data with replacement (1000 times)
Train model on each bootstrap sample
Predict test point with all 1000 models
Calculate 95% interval from predictions

Example:

Predicted elevation: 750m
95% interval: [720m, 780m]
Uncertainty: ±30m

Wider intervals when: - Extrapolating beyond training data - High noise in training data - Few training samples near test location

7. What Could Go Wrong?

Multicollinearity

Problem:

Features highly correlated (e.g., NDVI and NIR: r = 0.95).

Coefficient estimates unstable, variance inflated.

Symptoms:

Coefficients change dramatically with small data changes
Large standard errors
Counterintuitive coefficient signs

Solution:

Ridge regression (L2 penalty stabilizes)
Remove correlated features (keep one from each pair)
PCA to decorrelate features (Model 73)

Extrapolation

Problem:

Predict outside range of training data.

Linear model extrapolates to absurdity.

Example:

Training elevation: 200-1000m
Test location: 1500m (outside range)
Prediction: May be wildly inaccurate

Solution:

Flag predictions outside training range
Use domain knowledge constraints
Non-linear models (polynomial, splines)

Non-linear Relationships

Problem:

True relationship non-linear, linear model fits poorly.

Example:

Temperature vs elevation: T = T_0 - \Gamma \times elevation

Linear works.

Soil moisture vs precipitation: Saturates at high rainfall.

Linear fails (needs logistic or power transform).

Solution:

Feature engineering (polynomial terms, interactions)
Non-linear models (decision trees, neural networks)
Transform target (log, sqrt)

Heteroscedasticity

Problem:

Error variance not constant.

Example:

High elevation: ±50m error
Low elevation: ±10m error

Violates regression assumptions.

Detection:

Plot residuals vs predictions. Funnel shape = heteroscedasticity.

Solution:

Transform target (log, Box-Cox)
Weighted least squares
Robust regression (Huber loss)

8. Extension: Spatial Cross-Validation

Why Standard CV Fails

Spatial autocorrelation:

Nearby locations similar (Tobler’s Law).

Standard random CV: Test points near training points.

Overestimates accuracy (information leakage).

Spatial Block CV

Method:

Divide region into spatial blocks (e.g., 10×10 grid)
Each fold = contiguous block(s)
Train on distant blocks, test on held-out block

Effect:

Test data truly independent (spatially).

Realistic accuracy estimate.

Typical result:

Standard CV: R² = 0.85
Spatial CV: R² = 0.65

Difference reveals spatial leakage.

9. Math Refresher: Matrix Calculus

Gradient of Quadratic Form

Function:

f(\mathbf{x}) = \mathbf{x}^T\mathbf{A}\mathbf{x}

Gradient:

\nabla f = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}

If \mathbf{A} symmetric: \nabla f = 2\mathbf{A}\mathbf{x}

Example:

L(\boldsymbol{\beta}) = (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^T(\mathbf{y} - \mathbf{X}\boldsymbol{\beta})

Expand:

= \mathbf{y}^T\mathbf{y} - 2\mathbf{y}^T\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\beta}^T\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}

Gradient:

\nabla L = -2\mathbf{X}^T\mathbf{y} + 2\mathbf{X}^T\mathbf{X}\boldsymbol{\beta}

Summary

Linear regression predicts continuous targets via weighted sum of features minimizing mean squared error loss function
Gradient descent iteratively updates coefficients β via β^(t+1) = β^(t) - α∇L converging to optimal solution
Ridge regularization adds λ||β||² penalty shrinking coefficients preventing overfitting with correlated features
Cross-validation estimates generalization error splitting data into training and validation folds averaging performance metrics
Spatial cross-validation uses geographic blocks ensuring test data spatially separated from training avoiding autocorrelation bias
Real-world soil mapping achieves R² 0.40-0.60 predicting carbon content from climate terrain and spectral features
Multicollinearity inflates coefficient variance solved via ridge regression or removing correlated predictors
Residual spatial autocorrelation detected via Moran’s I indicating missing spatial structure requiring additional features
Prediction uncertainty quantified via bootstrap resampling generating confidence intervals around point estimates
Professional implementation uses scikit-learn grid search optimizing hyperparameters across regularization strengths