Budapest rental price prediction

A reproducible offline modelling artifact for ingatlan.com's data team — temporal validation, geospatial enrichment, and model diagnostics.

MSc Business Analytics · Central European University · June 2026 · source repo ↗

[§U]Use case

Suggest a rent in seconds, list faster

A landlord opens the upload flow and enters the characteristics of the flat they want to let — area, rooms, floor, condition, balcony, and so on. Before they finish, the page shows a suggested monthly rent. Because the number is credible and arrives instantly, they price competitively from day one, list sooner, and start earning rent earlier.

The chain is simple: faster pricing → quicker listing → happier landlord → higher engagement → more supply on the platform. For ingatlan.com, which leads the Hungarian for-sale market but faces stiffer competition in rentals, this is a direct lever to grow the rental user base.

Mock of a stepped ingatlan.com listing upload flow with an inline price estimator (BÉTA), showing area, balcony, rooms and half-rooms inputs. — Reference UI direction: a stepped upload flow with an inline Ingatlan árbecslő (price estimator) — the model output slots into an existing wizard step rather than a separate tool.

[§1]The decision this supports

Can we suggest a credible rent at upload?

The capstone stays offline — modelling and recommendations, not deployment or A/B testing. The question is whether a model can hit the accuracy target given the data, scope, and product constraints.

Scope: Budapest rental flats.
Target: monthly rent per square metre (HUF/sqm).
Metric: MdAPE — typical absolute % miss.
Success criterion: ≤10% MdAPE, set by the client. Met at ~9.8% in time-ordered CV.

[§2]Data & temporal validation

Why the newest 20% is reserved

Budapest rental prices move quickly, so a random split would leak market-regime information from the future. Candidate selection uses rolling time-series CV on older listings; a frozen holdout on the newest 20% is kept for the eventual production go/no-go.

50,898

modelling rows

Budapest rental flats, through 2026-06-02

40,718

train rows

2021-01-01 → 2025-11-29

10,180

holdout rows

2025-11-29 → 2026-06-02

+17.6%

test-period rent shift

median HUF/sqm higher in newest 20%

[§3]Pipeline · medallion architecture

Bronze → silver → gold

The medallion pattern layers a warehouse into three progressively curated tiers. Bronze is raw ingested data as it arrived, with lineage but no cleaning. Silver is typed, deduplicated, and quality-checked — the same shape as source but trustworthy. Gold is modelling-ready feature marts: business rules applied, scope filters baked in, joins with enrichment tables done.

This capstone follows the same shape: raw listing dumps land in a bronze layer, the data-prep pipeline produces a cleaned silver extract, and the modelling table plus geospatial features form the gold layer that CV and the holdout consume.

  bronze                silver                 gold
  ──────                ──────                 ────
  raw dumps    ─────►   typed / cleaned  ────► modelling extract
  ingatlan.com          scope filters          + WorldPop lag features
  scrape                dedup + QA             + NDVI H3 features
                                               + 90-day comparable prior
                                                     │
                                                     ▼
                                          time-series CV  +  holdout

[§4]Model leaderboard

Boosting clears the 10% bar; linear stays close

Same target, same rolling time-series CV, same metric. Boosting leads at ~9.8–9.9% MdAPE, but well-specified linear models are only ~0.2 pp behind — and both are ~7 pp better than a district-median baseline. Stability and product fit matter more than the marginal CV gain.

model	CV MdAPE	CV MAPE	MAE HUF/sqm
XGBoost (log price/sqm)	9.84%	12.19%	629
HistGradientBoosting	9.86%	12.24%	632
LightGBM	9.94%	12.33%	635
Ridge	10.06%	12.61%	646
OLS	10.07%	12.58%	645
Random Forest	10.78%	13.29%	693
district median (baseline)	16.80%	19.32%	1,010
dummy median (baseline)	17.95%	20.26%	1,067

[§5]What the model attends to

Size, comparables, condition, location

Filling core structured fields matters more than adding NLP at this stage. The neighbourhood comparable prior and location features carry strong signal, confirming the client team's interim feedback. Importance is not causality — use it for field prioritization and sanity checks.

[size & layout] [local comparables] [condition] [balcony ratio] [air conditioning] [centrality]

SHAP beeswarm plot for the best model — SHAP beeswarm — each dot is one held-out observation; horizontal position is contribution to predicted log HUF/sqm, colour is the feature value.

Interaction terms kept in the model

A small set of numeric interactions, chosen to represent plausible housing-market effects without exploding the one-hot feature space.

term	formula	interpretation
Distance × log area	centrality × log(area)	Centrality gradient differs by flat size.
Balcony × distance	balcony × distance to centre	Balcony premium may vary with centrality.
Elevator × floor	elevator × floor number	Higher floors are valued differently when an elevator exists.
Room equivalent × log area	eff. rooms × log(area)	Layout usefulness interacts with size.
Balcony-to-area ratio	balcony size / area (capped)	Bounded ratio, not simple multiplication.
Room density	eff. rooms / area	Captures compactness of the layout.

[§6]Geospatial enrichment

Population and greenness, lagged and historized

WorldPop

Hungary rasters, 2020–2025.
H3 resolutions 6, 7, 8.
Total population + age/sex cohorts.
Lag: a 2025 listing sees 2024 features.

NDVI greenness

Sentinel-2 L2A via Microsoft Planetary Computer.
Summer composites, 2020–2025.
H3 mean, p90, and pixel count.
Point coverage ≈ 100% for the prototype.

Geospatial features stay lagged and source-dated to avoid leakage. Static land cover is useful for EDA but is not aligned with historized validation.

A quick primer on H3

H3 is Uber's open-source discrete global grid: the earth is tiled with hexagons, each with a stable string ID (e.g. 8830953b13fffff). Hexagons are convenient because every neighbour is the same distance away, so aggregations and lookups behave uniformly. Resolution controls cell size — res 6 covers a metro area, res 7 fits a district, and res 8 (used below) is about the size of a few city blocks, matching how buyers think about local rent.

Fig. summer Sentinel-2 NDVI, aggregated to H3 res-8 cells over Budapest. Darker green = more vegetation, useful as a proxy for green-space amenity around a listing.

[§7]Implications for the upload flow

Fields worth prioritizing

Area, rooms, and room density.
Condition, comfort, subtype, heating.
Balcony size relative to area.
Air conditioning and equipment flags.
District, zone, and enough address detail for local signal.

Potential UX pattern

Predict HUF/sqm; multiply by area for total rent.
Show a range when uncertainty is high.
Suppress the suggestion when coverage or slice reliability is poor.
Optionally show "what drives this estimate" via a local SHAP explanation.

[§8]Risks & recommended next steps

risk	why it matters	mitigation in the artifact
Temporal leakage / drift	Prices move quickly.	Chronological split, rolling CV, reserved newest 20%.
Aggregate metric hides local failure	Districts / subtypes behave differently.	District + subtype MdAPE slices before product claims.
Missing optional fields	Model can lean on data users don't always provide.	Pipeline imputation, coverage guardrail, missingness profile.
Scope creep to national rentals	Budapest priors misprice countryside listings.	Explicit Budapest-only pilot; retrain + validate for rollout.
Business-metric gap	MdAPE gains may not reduce time-on-market.	Acceptance bands + later link predictions to TOM / experiment.

1 · Freeze evaluation

One-shot holdout run; publish MdAPE / MAPE / RMSE, total-rent errors, coverage, and slice diagnostics.

2 · Define UX bands

Map the 10% MdAPE target to product behaviour: exact suggestion, range, or none.

3 · Pilot safely

Budapest rental flats only; monitor input coverage and prediction drift.

Bottom line: a credible, reproducible offline artifact and a practical roadmap. The remaining decision is product governance, not more exploratory modelling.

[§A]Appendix

Diagnostics, feature policy, data quality

SHAP group importance — magnitudes are in log-points, not raw HUF. For explanation and sanity checks, not causal claims.

Permutation importance for the best model — Permutation importance — drop in CV score when each feature is shuffled.

Feature policy

included

Core structure: area, rooms, floor, age.
Quality / amenity: condition, comfort, subtype, furnishment, heating, balcony, aircon.
Location: district / zone aggregates, distance to centre.
Fold-safe 90-day neighbourhood prior + local deltas.
Lagged WorldPop and NDVI features.

excluded / constrained

Rent per sqm is the target, never a feature.
Total rent is not a feature.
No fine-grained street or geo-cell identifiers.
Listing text bodies not used; only language-presence flags.
Model-specific filters stay downstream of the modelling table.

Data-quality summary

check	result	action
Duplicate records	None observed	Passes basic uniqueness.
Non-positive target / area	None after filters	OK after downstream scope rules.
Weak location quality	~1% of rows	Coverage guardrail; avoid forcing location features.
Building floor count missing	~22%	Resolved earlier contradiction; imputed inside CV folds.
Common charges missing	~72%	Useful when present; not a mandatory UI dependency.
Utility costs missing	~48%	Impute and monitor; potentially improve upload completion.

Source repo: github.com/balintdecsi/ceu-public-thesis · Original slide deck: html deck ↗