← Back to methodology
Whitepaperv0.1 · April 2026 · 14 min read

Forecasting US metro HPI: model specification, features, and accuracy

We describe the production model behind re-invest.ai — a global gradient-boosted tree ensemble trained on 50 years of FHFA HPI history across 410 US MSAs — along with the feature set, training protocol, evaluation methodology, and the full accuracy metrics (MAE, RMSE, WMAPE, MdAPE, and skill score) computed out-of-sample.

1. Problem statement

The target is the 4-quarter-ahead log return of FHFA All-Transactions HPI, seasonally adjusted, at the metro (Core-Based Statistical Area) level. This is equivalent to forecasting approximate year-over-year price change one year into the future: a horizon long enough to be useful to investment committees but short enough that the signal-to-noise ratio remains tractable.

Working in log returns rather than levels stabilizes the variance across markets of different sizes and makes multiplicative price shocks additive in the feature space.

2. Data

Training data is pulled from five federal and open sources, all ingested into a Postgres warehouse keyed by (series_id, geo_id, period_start). No paid data is used. All sources and cadences:

SourceSeriesCadenceRole
FHFAHPI_AT (metro SA)QuarterlyTarget + autocorrelation features
FREDMORTGAGE30US, DGS10Weekly / DailyAffordability, rate path
FREDHOUST, PERMIT, CPIHOSSLMonthlySupply + shelter inflation
BLSCES, LNS14000000MonthlyDemand (jobs, unemployment)
CensusACS 5-year (per MSA)AnnualPopulation, income, tenure
Zillow ResearchZHVI, ZORIMonthlyPrice + rent cross-checks

The working sample covers 410 MSAs over approximately 200 quarters (1975 Q1 – 2025 Q4), yielding roughly 70,000 HPI observations after dropping rows without sufficient history to compute lagged features.

3. Features

Features fall into four groups:

  • Autocorrelation: 1Q, 4Q, and 12Q log returns of HPI — captures momentum and mean reversion.
  • Macro as-of: FRED series joined to each MSA×period at the last value before the target window opens (no look-ahead).
  • MSA identity: CBSA code as a categorical feature. Lets the booster learn persistent per-metro offsets without per-market fitting.
  • Demographic (future): ACS features are ingested but not yet in the v1 feature set — they'll land in v0.2.

All macro features are joined as-of the last observation before the forecast window, so training data never sees information from its own future.

4. Model

Single global XGBoost regressor (histogram tree method, 600 estimators, depth 5, learning rate 0.03, 0.8 subsample + colsample). One model for all metros, with CBSA encoded as a learned categorical split. This was chosen over per-metro fits because:

  • Many small metros have fewer than 30 years of data — per-metro fits overfit.
  • Global priors are shareable across markets that move together (e.g. sunbelt, rust belt).
  • Training is ~40 seconds on a laptop. Monthly retraining is cheap.

An AR(1) baseline (next return = last return) is trained on the same split for comparison. Any promoted model must beat this baseline on out-of-sample MAE.

5. Evaluation

Evaluation is strictly out-of-sample on a temporalholdout: the last 8 quarters of the sample are held out from training. Every metric is computed on this holdout, across all 410 metros × 8 quarters = roughly 3,280 predictions.

Metrics reported (all on log-return space):

MetricDefinitionWhy it matters
MAEmean(|y − ŷ|)Robust to outliers; dollar-intuitive
RMSE√mean((y − ŷ)²)Penalizes large errors harder
WMAPEΣ|y−ŷ| / Σ|y|Unit-free %; works when some y ≈ 0
MdAPEmedian(|y−ŷ|/|y|)Outlier-robust relative error
Skill score1 − MAE / MAE_baselineGain over the AR(1) baseline
Biasmean(ŷ − y)Systematic over/under-forecast

We avoid MAPE directly because several MSAs have periods where YoY log return is near zero — MAPE explodes when |y| → 0. WMAPE and MdAPE sidestep that failure mode.

6. Current state (v0 baseline)

The model in production today is a naive YoY extrapolation — last year's realized growth, applied forward. It exists so the dashboard shows real FHFA data while we validate the full XGBoost pipeline.

The XGBoost model is implemented and passes unit tests; first production publish is scheduled for the next monthly refresh (28th of next month). Expected accuracy targets for v1:

  • MAE ≤ 2.0% on national 12-month HPI
  • WMAPE ≤ 18% (relative error across all MSAs)
  • Skill score ≥ 0.12 vs AR(1) baseline

These numbers will land on the methodology page the morning after the first production run and update monthly thereafter.

7. What we don't do

  • We do not produce level forecasts (dollar prices), only returns. Levels are reconstructed from the last realized observation times the forecast growth factor.
  • We do not use MLS-listing data, proprietary feeds (Moody's, CoreLogic, ATTOM, NAR), or any source we can't link to.
  • We do not re-weight training data to favor recent periods. The model sees the full 50-year history equally, and temporal holdout handles regime shifts naturally.
  • We do not publish forecasts we can't defend to an investment committee. If a metro's forecast has unusually wide intervals or poor backtest performance, it's flagged with a lower risk score (C–F).

8. Reproducibility

Every production forecast is reproducible from the commit SHA of /models, plus the raw data snapshot in s3://re-invest-raw/at the forecast run's timestamp. A research subscriber who wants to re-run a historical forecast can do so with:

git clone https://github.com/evanmcclure1/re-invest
cd re-invest/models && pip install -e .
python scripts/run_training.py --holdout-quarters 8