A Data Scientist goes…House Hunting (Part 1 – Motivation and Database Schema)
Note: The finished dashboard for this project can be seen here.
There are some markets that we only interact with occasionally in our lifetimes. For many of us, buying a home is one of those markets. In the US, where I lived for many years, it is quite typical when buying a property to hire a Buyers Agent – A Realtor who knows the market in which you are considering buying, can understand what you’re looking for and can search listings with that knowledge and find properties of interest. Of course, you pay for this knowledge but not usually directly – the sellers typically pay their Selling Realtor to sell their home and the Selling Realtor will split their commission (typically around 6%) with the Buyers Realtor on some basis (often 50/50). Great for the buyer but kind of sucks for the seller who has to fork out all that commission when the property sells.
In the UK, where I live now, having a Realtor (or Real Estate Agent as they are more commonly known here) act on the buyers behalf is not common at all. So, prospective buyers are somewhat on their own, researching the market and trying to find a home that meets their needs in terms of size, location and so on. The problem is that if you aren’t familiar with the market, how do you know what represents fair value for a particular property that’s listed? Sure, after a while you’ll get some anecdotal knowledge of the market as you start to filter listings on websites, in the estate agency shop window and perhaps even go to some viewings. However, it can be a very frustrating process – aggressively-priced homes can come on the market and sell before you ever realised they were there or got round to viewing them. It can take weeks and weeks of hard work and a lot of time before you really feel like you understand the market and feel truly comfortable putting in offers.
Motivation
In this project we’re going to approach the problem using Data Science. The aim is to provide an immediate full-market context to the buyer so they know from “day one” which properties represent good value and those which do not. As with all my projects, I’m not aiming to just repeat what’s already out there on the existing listing services, I’m looking to add unique additional insight and value – something genuinely different and useful.
Here’s what we’re aiming to achieve:
- A comprehensive picture of the overall market for the specific area of interest – all properties listed for sale, under offer and sold. Prices broken down by location, number of bedrooms, flats vs houses and so on.
- For any particular property for sale, it’s full listing history – days on market, history of price changes, whether or not it’s gone under offer, failed to complete and come back on the market again. We’re looking for the motivated sellers!
- For any particular property for sale, it’s full sales history going back 20 years – all the previous sales, dates and prices.
- Lists of properties of specific interest that meet certain criteria – size, location, new to market, back on market.
- Trends of the market over time – number of properties for sale by type, price type (fixed or offers over), prices by property size, houses vs flats.
- A predictive model that uses current market and historical data, by location, to predict the expected price of a home vs the asking price and therefore aids to assess as to whether it is under-valued, over-valued or fair.
We’re going to use the Edinburgh, Scotland housing market as the basis for this project. Edinburgh, the capital city of Scotland, is a vibrant city with many different housing types and types of locations to live in (inner-city, subrbs, coastal regions etc) and has a diverse, international population. It’s also where I happen to live so I have a vested interest in the project myself!
Data
Our data comes from a mysqlite database with the following schema:
The “addresses” table has all the details that describe each listing (mostly address related) and that don’t change over time:
| Field | Description | Type |
|---|---|---|
| pid | UNIQUE Property ID | INT |
| area | First part of Postcode (e.g. eh1, eh2 etc) | TEXT |
| address | Full postal address as displayed on listing | TEXT |
| dwelling_type | Description of listing – e.g. “3 bedroom flat for sale” | TEXT |
| dwelling_subtype | Higher level of categorisation of property type – e.g. “Flat”, “House”, “Town House” etc | TEXT |
| new_build | Marker to indicate if new build or not (1 or 0) | INT |
| beds | Number of bedrooms | INT |
| baths | Number of bathrooms | INT |
| latitude | Latitude coordinates | FLOAT |
| longitude | Longitude coordinates | FLOAT |
The “prices” table contains all the pricing records for each listing. Each listing can have multiple price records:
| Field | Description | Type |
|---|---|---|
| pid | Property ID. Note: Not UNIQUE. This table will have multiple rows with the same pid as prices get updated. | INT |
| dt_of_price | Date and time of price entry in ISO 8601 format YYYY-MM-DD HH:MM:SS | TEXT |
| price_type | Price type of entry – e.g. “Offers Over”, “Fixed”, “Offers Around” etc | TEXT |
| price | Price in GBP | INT |
The “status” table contains the status change records of each listing. Each listing can have multiple records:
| Field | Description | Type |
|---|---|---|
| pid | Property ID. Note: Not UNIQUE. This table will have multiple rows with the same pid as status gets updated. | INT |
| dt_of_status | Date and time of price entry in ISO 8601 format YYYY-MM-DD HH:MM:SS | TEXT |
| status | Status of listing – e.g. ‘Active’, ‘Under offer’, ‘Sold’ | TEXT |
The “dates” table contains the date / times of each listing:
| Field | Description | Type |
|---|---|---|
| pid | UNIQUE Property ID | INT |
| dt_first_listed | Date and time of first time this pid was listed. Timezone aware format – e.g. “2023-08-21T09:09:03Z” | TEXT |
| dt_first_seen | Date and time this pid was first seen. ISO 8601 format – e.g. YYYY-MM-DD HH:MM:SS | TEXT |
| dt_last_seen | Date and time this pid was last seen. ISO 8601 format – e.g. YYYY-MM-DD HH:MM:SS | TEXT |
The database_update table contains the date / time each time the database was updated with new data:
| Field | Description | Type |
|---|---|---|
| dt_of_update | Date and time this pid was first seen. ISO 8601 format – e.g. YYYY-MM-DD HH:MM:SS | TEXT |
The tables are typically updated once per day with the latest market data drawn from a variety of web sources.
In general, the approach we are going to follow is to use SQL queries to pull a view of the data which we then analyse using the Python Data Science stack – pandas, numpy, matplotlib, Bokeh etc. The reason for doing it this way is to imagine that the source database is too large to fit in memory or to load using Pandas directly which is more typical in real-life than in this limited dataset we are using here.


Pingback : Neil Darragh, PhD » How to Overlay UK Postcode Sectors onto a Map
Pingback : Neil Darragh, PhD » A Data Scientist goes…House Hunting (Part 2 – Data Prep and EDA)
Pingback : Neil Darragh, PhD » A Data Scientist goes…House Hunting (Part 3 – Geo-Location Visualisation)