Whitepaperv0.1 · April 2026 · 14 min read

How our US metro forecast actually works

For the curious, the skeptical, and the auditor. We walk through the model powering re-invest.ai — a gradient-boosted tree ensemble trained on 50 years of US house-price history across 410 metros — along with the data sources, feature set, training protocol, and every accuracy metric we track, all tested against data the model never saw during training.

1. Problem statement

The target is the 4-quarter-ahead log return of FHFA All-Transactions HPI, seasonally adjusted, at the metro (Core-Based Statistical Area) level. This is equivalent to forecasting approximate year-over-year price change one year into the future: a horizon long enough to be useful to investment committees but short enough that the signal-to-noise ratio remains tractable.

Working in log returns rather than levels stabilizes the variance across markets of different sizes and makes multiplicative price shocks additive in the feature space.

2. Data

Training data is pulled from five federal and open sources, all ingested into a Postgres warehouse keyed by (series_id, geo_id, period_start). No paid data is used. All sources and cadences:

Source	Series	Cadence	Role
FHFA	HPI_AT (metro SA)	Quarterly	Target + autocorrelation features
FRED	MORTGAGE30US, DGS10	Weekly / Daily	Affordability, rate path
FRED	HOUST, PERMIT, CPIHOSSL	Monthly	Supply + shelter inflation
BLS	CES, LNS14000000	Monthly	Demand (jobs, unemployment)
Census	ACS 5-year (per MSA)	Annual	Population, income, tenure
Zillow Research	ZHVI, ZORI	Monthly	Price + rent cross-checks

The working sample covers 410 MSAs over approximately 200 quarters (1975 Q1 – 2025 Q4), yielding roughly 70,000 HPI observations after dropping rows without sufficient history to compute lagged features.

3. Features

Features fall into four groups:

Autocorrelation: 1Q, 4Q, and 12Q log returns of HPI — captures momentum and mean reversion.
Macro as-of: FRED series joined to each MSA×period at the last value before the target window opens (no look-ahead).
MSA identity: CBSA code as a categorical feature. Lets the booster learn persistent per-metro offsets without per-market fitting.
Demographic (future): ACS features are ingested but not yet in the v1 feature set — they'll land in v0.2.

All macro features are joined as-of the last observation before the forecast window, so training data never sees information from its own future.

4. Model

Single global XGBoost regressor (histogram tree method, 600 estimators, depth 5, learning rate 0.03, 0.8 subsample + colsample). One model for all metros, with CBSA encoded as a learned categorical split. This was chosen over per-metro fits because:

Many small metros have fewer than 30 years of data — per-metro fits overfit.
Global priors are shareable across markets that move together (e.g. sunbelt, rust belt).
Training is ~40 seconds on a laptop. Monthly retraining is cheap.

An AR(1) baseline (next return = last return) is trained on the same split for comparison. Any promoted model must beat this baseline on out-of-sample MAE.

5. Evaluation

Evaluation is strictly out-of-sample on a temporalholdout: the last 8 quarters of the sample are held out from training. Every metric is computed on this holdout, across all 410 metros × 8 quarters = roughly 3,280 predictions.

Metrics reported (all on log-return space):

Metric	Definition	Why it matters
MAE	mean(\|y − ŷ\|)	Robust to outliers; dollar-intuitive
RMSE	√mean((y − ŷ)²)	Penalizes large errors harder
WMAPE	Σ\|y−ŷ\| / Σ\|y\|	Unit-free %; works when some y ≈ 0
MdAPE	median(\|y−ŷ\|/\|y\|)	Outlier-robust relative error
Skill score	1 − MAE / MAE_baseline	Gain over the AR(1) baseline
Bias	mean(ŷ − y)	Systematic over/under-forecast

We avoid MAPE directly because several MSAs have periods where YoY log return is near zero — MAPE explodes when |y| → 0. WMAPE and MdAPE sidestep that failure mode.

6. Current state (v0 baseline)

The model in production today is a naive YoY extrapolation — last year's realized growth, applied forward. It exists so the dashboard shows real FHFA data while we validate the full XGBoost pipeline.

The XGBoost model is implemented and passes unit tests; first production publish is scheduled for the next monthly refresh (28th of next month). Expected accuracy targets for v1:

MAE ≤ 2.0% on national 12-month HPI
WMAPE ≤ 18% (relative error across all MSAs)
Skill score ≥ 0.12 vs AR(1) baseline

These numbers will land on the methodology page the morning after the first production run and update monthly thereafter.

7. What we don't do

We do not produce level forecasts (dollar prices), only returns. Levels are reconstructed from the last realized observation times the forecast growth factor.
We do not use MLS-listing data, proprietary feeds (Moody's, CoreLogic, ATTOM, NAR), or any source we can't link to.
We do not re-weight training data to favor recent periods. The model sees the full 50-year history equally, and temporal holdout handles regime shifts naturally.
We do not publish forecasts we can't defend to an investment committee. If a metro's forecast has unusually wide intervals or poor backtest performance, it's flagged with a lower risk score (C–F).

8. Reproducibility

Every production forecast is reproducible from the commit SHA of /models, plus the raw data snapshot in s3://re-invest-raw/at the forecast run's timestamp. A research subscriber who wants to re-run a historical forecast can do so with:

git clone https://github.com/evanmcclure1/re-invest
cd re-invest/models && pip install -e .
python scripts/run_training.py --holdout-quarters 8

← Methodology overview See the live forecasts →