Let's import the modules we'll need:
import sqlite3
import datetime
import pandas as pd
import numpy as np
First, open our database for the queries:
database_path = 'databases/forsale2_hashed.db'
con = sqlite3.connect(database_path)
cur = con.cursor()
Query when the last and prior database updates were:
res = cur.execute("SELECT dt_of_update FROM database_update ORDER BY dt_of_update DESC LIMIT 2")
db_updates = res.fetchall()
dt_last_db_update = db_updates[0][0]
dt_prior_db_update = db_updates[1][0]
Query how many current listings there are:
res = cur.execute("SELECT * FROM addresses NATURAL JOIN dates WHERE dates.dt_last_seen > ?", (dt_prior_db_update,))
actives = res.fetchall()
num_current = len(actives)
And, finally, let's query the total number of listings that have ever been in the database and then close the connection to the database:
res = cur.execute("SELECT COUNT(*) from addresses")
result = res.fetchall()
num_total_properties = result[0][0]
con.close()
Now, let's put all that information into a pretty table and display it. First we'll define a helper function that makes it easy to quickly make a simple table in Pandas with the styling we want:
def pretty_table(list_of_lists, caption='', columns=[], pctcol=''):
df = pd.DataFrame(list_of_lists)
if columns != []:
df.columns = columns
if pctcol != '':
total = df[pctcol].sum()
df['Pct'] = np.round(100 * df[pctcol] / total,2)
dfStyler = df.style
dfStyler.set_table_styles([{'selector':'caption', 'props':[("text-align", "center"),("font-size", "150%"),("color", 'black')]},
{'selector':'th', 'props':[('text-align', 'left')]},
{'selector': 'td', 'props': 'text-align: left; font-weight: bold;'},
], overwrite = False)
if pctcol != '':
dfStyler.format({'Pct': "{:.2f}"})
dfStyler.set_caption(caption)
dfStyler.hide(axis='index')
if columns == []:
dfStyler.hide(axis='columns')
display(dfStyler)
And now, display it:
pretty_table([['Date / Time of this analysis', str(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))],
['Date / Time of last DB update', dt_last_db_update],
['Number of properties in DB', num_total_properties],
['Number of current lisitings', num_current]
], caption = 'Database Summary')
| Date / Time of this analysis | 2023-11-13 11:34:13 |
| Date / Time of last DB update | 2023-11-03 22:09:30 |
| Number of properties in DB | 7670 |
| Number of current lisitings | 4780 |
We now want to query the tables in the database and make a dataframe out of the current listings. We can do that with the following SQL query directly into a pandas dataframe:
# Current listings are anything last seen after the prior database update
con = sqlite3.connect(database_path)
query = f"SELECT * FROM addresses NATURAL JOIN dates WHERE dates.dt_last_seen > \'{dt_prior_db_update}\'"
actives_df = pd.read_sql(query, con)
con.close()
Let's take a look at the first few records of our active listing dataset:
actives_df.set_index('pid', inplace=True)
actives_df.head(5).style.hide(axis='index')
| area | address | dwelling_type | dwelling_subtype | new_build | beds | baths | latitude | longitude | dt_first_listed | dt_first_seen | dt_last_seen |
|---|---|---|---|---|---|---|---|---|---|---|---|
| eh1 | Demo version: address hidden for privacy | 3 bedroom apartment for sale | Apartment | 1 | 3 | 2 | 55.95484 | -3.18843 | 2023-08-17 11:27:13 | 2023-08-24 18:45:17 | 2023-11-03 18:18:38 |
| eh1 | Demo version: address hidden for privacy | 7 bedroom town house for sale | Town House | 0 | 7 | None | 55.958904 | -3.187063 | 2023-04-28 15:09:46 | 2023-08-24 18:45:17 | 2023-11-03 18:18:38 |
| eh1 | Demo version: address hidden for privacy | 5 bedroom mews house for sale | Mews | 0 | 5 | 4 | 55.956736 | -3.191266 | 2023-08-24 10:14:06 | 2023-08-24 18:45:17 | 2023-11-03 18:18:38 |
| eh1 | Demo version: address hidden for privacy | 2 bedroom apartment for sale | Apartment | 1 | 2 | 2 | 55.95484 | -3.18843 | 2023-08-10 00:36:25 | 2023-08-24 18:45:17 | 2023-11-03 18:18:38 |
| eh1 | Demo version: address hidden for privacy | 2 bedroom apartment for sale | Apartment | 1 | 2 | 2 | 55.95484 | -3.18843 | 2023-08-10 00:36:23 | 2023-08-24 18:45:17 | 2023-11-03 18:18:38 |
Some notes: I've hidden the address for privacy reasons. Data collection started on 24th August and so for many properties the date first seen is this date even though they were first listed earlier. The date last seen is the date of this blog post update.
Now we are going to want to add some features. First, we want a further higher-level dwelling type that categorises all the different dwelling sub-types into just Houses, Flats or Other. Let's take a look at all the different ways these are described in the listings:
actives_df['dwelling_subtype'].unique()
array(['Apartment', 'Town House', 'Mews', 'Flat', 'Duplex', 'Maisonette',
'Ground Flat', 'Parking', 'Studio', 'Terraced', 'Semi-Detached',
'Not Specified', 'Plot', 'Retirement Property', 'Garages',
'Detached', 'Penthouse', 'Detached Bungalow', 'End of Terrace',
'House', 'Villa', 'Terraced Bungalow', 'Semi-Detached Bungalow',
'Ground Maisonette', 'Bungalow', 'Land', 'Link Detached House',
'Hotel Room', 'Cottage', 'Triplex', 'Detached Villa',
'Sheltered Housing', 'Park Home', 'Farm Land',
'Semi-detached Villa'], dtype=object)
We'll make two strings that respectively contain all the terms being used for Flats and Houses (we note that this requires some domain expertise). Anything that is not a Flat or a House will be given the category 'Other'. We'll call the new feature 'dwelling_cat', as follows:
flats_str = 'Flat|Apartment|Duplex|Ground Flat|Maisonette|Studio|Penthouse|Ground Maisonette|Triplex'
houses_str = 'Town House|Mews|Terraced|Semi-Detached|End of Terrace|Detached|House|Semi-Detached Bungalow|Detached Villa| \
Detached Bungalow|Villa|Bungalow|Terraced Bungalow|Cottage|Link Detached House|Semi-detached Villa'
# If it's not a Flat, test if it's a House and if not, it's an Other
actives_df['dwelling_cat'] = np.where(actives_df['dwelling_subtype'].str.contains(flats_str), 'Flat', \
np.where(actives_df['dwelling_subtype'].str.contains(houses_str), 'House', 'Other'))
Now we want to add a days-on-market feature which we will call 'dom'. It's the difference between the last date a property was seen online and the first date it was listed. Strictly speaking, this is only valid for those properties still for sale but we'll compute it for all (even those that are no longer online). The 12-hour timedelta is to force the computation to round to the nearest day.
actives_df['dom'] = (pd.to_datetime(actives_df['dt_last_seen']) - \
pd.to_datetime(actives_df['dt_first_listed']) + \
datetime.timedelta(hours=12)).dt.days
# Convert beds to integer
actives_df['beds'] = actives_df['beds'].astype('int')
Now we want to add the current price and current price type for each of the records in the dataframe. This involves querying the prices table in the database and then performing a GroupBy function to group the prices data by property id (pid). The values we want are the last entries in the GroupBy table (the current ones). First, let's get the pricing history data for all the current listings in the database:
con = sqlite3.connect(database_path)
query = f"SELECT * FROM prices NATURAL JOIN dates WHERE dates.dt_last_seen > \'{dt_prior_db_update}\'"
listings_prices_df = pd.read_sql(query, con)
listings_prices_df.set_index('pid', inplace = True)
con.close()
And then define the new features as follows:
actives_df['current_price_type'] = listings_prices_df.groupby('pid').tail(1)['price_type']
actives_df['current_price'] = listings_prices_df.groupby('pid').tail(1)['price']
# There's a small number of entries that have something other than a price - e.g. 'Coming Soon' or 'POA'
# They aren't of use to us so we'll remove them
print('Number of non-priced properties:', len(actives_df[~actives_df['current_price'].str.isdigit()]))
actives_df = actives_df[actives_df['current_price'].str.isdigit()]
actives_df['current_price'] = actives_df['current_price'].astype('int')
Number of non-priced properties: 14
Now do the same for the listings status:
con = sqlite3.connect(database_path)
query = f"SELECT * FROM status NATURAL JOIN dates WHERE dates.dt_last_seen > \'{dt_prior_db_update}\'"
listings_status_df = pd.read_sql(query, con)
listings_status_df.set_index('pid', inplace = True)
con.close()
actives_df['current_status'] = listings_status_df.groupby('pid').tail(1)['status']
Now let's take a look at the dataframe with our new features:
actives_df.head(5).style.hide(axis='index')
| area | address | dwelling_type | dwelling_subtype | new_build | beds | baths | latitude | longitude | dt_first_listed | dt_first_seen | dt_last_seen | dwelling_cat | dom | current_price_type | current_price | current_status |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| eh1 | Demo version: address hidden for privacy | 3 bedroom apartment for sale | Apartment | 1 | 3 | 2 | 55.95484 | -3.18843 | 2023-08-17 11:27:13 | 2023-08-24 18:45:17 | 2023-11-03 18:18:38 | Flat | 78 | Fixed Price | 1495000 | Active |
| eh1 | Demo version: address hidden for privacy | 7 bedroom town house for sale | Town House | 0 | 7 | None | 55.958904 | -3.187063 | 2023-04-28 15:09:46 | 2023-08-24 18:45:17 | 2023-11-03 18:18:38 | House | 189 | Offers Over | 1250000 | Under offer |
| eh1 | Demo version: address hidden for privacy | 5 bedroom mews house for sale | Mews | 0 | 5 | 4 | 55.956736 | -3.191266 | 2023-08-24 10:14:06 | 2023-08-24 18:45:17 | 2023-11-03 18:18:38 | House | 71 | Offers Over | 1125000 | Active |
| eh1 | Demo version: address hidden for privacy | 2 bedroom apartment for sale | Apartment | 1 | 2 | 2 | 55.95484 | -3.18843 | 2023-08-10 00:36:25 | 2023-08-24 18:45:17 | 2023-11-03 18:18:38 | Flat | 86 | Fixed Price | 1075000 | Under offer |
| eh1 | Demo version: address hidden for privacy | 2 bedroom apartment for sale | Apartment | 1 | 2 | 2 | 55.95484 | -3.18843 | 2023-08-10 00:36:23 | 2023-08-24 18:45:17 | 2023-11-03 18:18:38 | Flat | 86 | Fixed Price | 1005000 | Active |
Let's look at how the property types break down by category and dwelling sub-type within each category:
df = pd.DataFrame(actives_df['dwelling_cat'].value_counts())
pretty_table([x for x in zip(df.index.values, df['count'].values)],
caption = 'All Properties', columns=['Type','Num'], pctcol='Num')
| Type | Num | Pct |
|---|---|---|
| Flat | 2644 | 55.48 |
| House | 2012 | 42.22 |
| Other | 110 | 2.31 |
df = pd.DataFrame(actives_df[actives_df['dwelling_cat'] == 'Flat']['dwelling_subtype'].value_counts())
pretty_table([x for x in zip(df.index.values, df['count'].values)], \
caption = 'Flats', columns=['Type','Num'], pctcol='Num')
| Type | Num | Pct |
|---|---|---|
| Flat | 1690 | 63.92 |
| Apartment | 446 | 16.87 |
| Ground Flat | 368 | 13.92 |
| Maisonette | 69 | 2.61 |
| Duplex | 32 | 1.21 |
| Penthouse | 26 | 0.98 |
| Studio | 8 | 0.30 |
| Ground Maisonette | 4 | 0.15 |
| Triplex | 1 | 0.04 |
df = pd.DataFrame(actives_df[actives_df['dwelling_cat'] == 'House']['dwelling_subtype'].value_counts())
pretty_table([x for x in zip(df.index.values, df['count'].values)], \
caption = 'Houses', columns=['Type','Num'], pctcol='Num')
| Type | Num | Pct |
|---|---|---|
| Detached | 650 | 32.31 |
| Semi-Detached | 405 | 20.13 |
| Terraced | 354 | 17.59 |
| End of Terrace | 189 | 9.39 |
| Villa | 71 | 3.53 |
| Detached Bungalow | 56 | 2.78 |
| Town House | 53 | 2.63 |
| Detached Villa | 51 | 2.53 |
| House | 43 | 2.14 |
| Semi-Detached Bungalow | 40 | 1.99 |
| Bungalow | 34 | 1.69 |
| Cottage | 26 | 1.29 |
| Mews | 16 | 0.80 |
| Terraced Bungalow | 11 | 0.55 |
| Link Detached House | 10 | 0.50 |
| Semi-detached Villa | 3 | 0.15 |
df = pd.DataFrame(actives_df[actives_df['dwelling_cat'] == 'Other']['dwelling_subtype'].value_counts())
pretty_table([x for x in zip(df.index.values, df['count'].values)], \
caption = 'Other', columns=['Type','Num'], pctcol='Num')
| Type | Num | Pct |
|---|---|---|
| Retirement Property | 59 | 53.64 |
| Garages | 13 | 11.82 |
| Not Specified | 12 | 10.91 |
| Land | 10 | 9.09 |
| Parking | 4 | 3.64 |
| Plot | 4 | 3.64 |
| Park Home | 4 | 3.64 |
| Sheltered Housing | 2 | 1.82 |
| Hotel Room | 1 | 0.91 |
| Farm Land | 1 | 0.91 |
From this point on, let's not include the 'Other' property types as we're not really interested in them.
actives_df = actives_df[actives_df['dwelling_cat'] != 'Other']
Let's take a look at the pricing and listing status for the currently listed properties:
pretty_table([
['Number of properties with Fixed Prices:', len(actives_df[actives_df['current_price_type'].str.contains('Fixed')])],
['Number of properties Offers Over / Around / From / Guide:', len(actives_df[actives_df['current_price_type'].str.contains('Offers|Guide|From')])],
['Number of properties with no price type (usually New Builds):',len(actives_df[~actives_df['current_price_type'].str.contains('Fixed|Offers|Guide|From')])],
], caption = 'Price Type Summary', columns=['','Num'], pctcol='Num')
| Num | Pct | |
|---|---|---|
| Number of properties with Fixed Prices: | 1134 | 24.36 |
| Number of properties Offers Over / Around / From / Guide: | 3254 | 69.89 |
| Number of properties with no price type (usually New Builds): | 268 | 5.76 |
pretty_table([
['Number Active / For Sale:', len(actives_df[actives_df['current_status'] == 'Active'])],
['Number Under Offer:', len(actives_df[actives_df['current_status'] == 'Under offer'])],
['Number Sold STC:', len(actives_df[actives_df['current_status'].str.contains('Sold')])],
['Number Reserved:', len(actives_df[actives_df['current_status'].str.contains('Reserved')])],
], caption = 'Sales Status Summary', columns=['','Num'], pctcol='Num')
| Num | Pct | |
|---|---|---|
| Number Active / For Sale: | 2333 | 50.11 |
| Number Under Offer: | 1256 | 26.98 |
| Number Sold STC: | 1055 | 22.66 |
| Number Reserved: | 12 | 0.26 |
actives_df['current_price'] = actives_df['current_price'].astype('int')
pretty_table([
['Overall Min price:', '£'+str(int(actives_df['current_price'].min()))],
['Overall Max price:', '£'+str(int(actives_df['current_price'].max()))],
['Overall Median price:', '£'+str(int(actives_df['current_price'].median()))],
['Overall Mean price:', '£'+str(int(actives_df['current_price'].mean()))]
],
caption = 'Overall Price Summary'
)
| Overall Min price: | £38000 |
| Overall Max price: | £6000000 |
| Overall Median price: | £290000 |
| Overall Mean price: | £364304 |
Let's look at the distribution of prices for flats vs houses. For this we can use either histograms, or as I have done here, violin plots. I find violin plots easier to compare distributions with as each attribute gets its own y-axis and all three distributions can be drawn using kernel density estimation more compactly in the same chart.
import matplotlib.pyplot as plt
all_prices = actives_df['current_price'] / 1000
house_prices = actives_df[actives_df['dwelling_cat'] == 'House']['current_price'] / 1000
flat_prices = actives_df[actives_df['dwelling_cat'] == 'Flat']['current_price'] / 1000
p = [0, 0.25, 0.5, 0.75, 0.95]
plt.figure(figsize=(10,4))
plt.violinplot([all_prices, house_prices, flat_prices], vert=False, showmedians=False, widths=0.9, quantiles=[p, p, p])
ax = plt.gca()
ax.xaxis.grid(True)
ax.set_yticks([y + 1 for y in range(3)],
labels=['All', 'Houses', 'Flats'])
ax.set_xlabel('Price £k')
plt.title('All vs House vs Flat Prices (Percentiles: 0%, 25%, 50%, 75%, 95%)')
ax.set_xlim(0, 1100)
plt.show()
We see that the distribution for houses vs flats is much more spread out, with the 95th percentile being over £1m vs that for flats being just over £600k.
Now let's develop a visualisation of the mean price split out by the number of beds and what price type the properties are - Fixed Price, Offers Over or New Build. We're going to use Bokeh for plotting here as ultimately we will be putting all the visualisations together into an interactive dashboard - something Bokeh is particularly suited to. First, we'll import the bokeh modules we'll need:
from bokeh.models import ColumnDataSource, LabelSet
from bokeh.plotting import figure
from bokeh.transform import dodge
import bokeh.palettes as bp
from bokeh.io import show
from bokeh.io import output_notebook
output_notebook(hide_banner=True)
beds = [str(x) for x in range(1,6)]
mean_prices_fixed = [actives_df[(actives_df['beds'] == x) & (actives_df['current_price_type'].str.contains('Fixed'))]\
['current_price'].mean()/1000 for x in range(1,6)]
mean_prices_offers = [actives_df[(actives_df['beds'] == x) & (actives_df['current_price_type'].str.contains('Offers'))]\
['current_price'].mean()/1000 for x in range(1,6)]
mean_prices_newbuild = [actives_df[(actives_df['beds'] == x) & (actives_df['new_build'] == '1')]\
['current_price'].mean()/1000 for x in range(1,6)]
data = {'Beds' : beds,
'Offers' : mean_prices_offers,
'Fixed' : mean_prices_fixed,
'New Build' : mean_prices_newbuild,
}
source = ColumnDataSource(data=data)
p = figure(x_range=beds, title='Mean Price vs Number of Beds over Price Type',
width=800, height=400, toolbar_location=None, tools="")
# Offers over bars
p.vbar(x=dodge('Beds', -0.25, range=p.x_range), top='Offers', source=source,
width=0.2, color=bp.Light6[0], legend_label="Offers Over")
# Fixed price bars
p.vbar(x=dodge('Beds', 0.0, range=p.x_range), top='Fixed', source=source,
width=0.2, color=bp.Light6[1], legend_label="Fixed Price")
# New build bars
p.vbar(x=dodge('Beds', 0.25, range=p.x_range), top='New Build', source=source,
width=0.2, color=bp.Light6[2], legend_label="New Build")
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None
p.legend.location = "top_left"
p.legend.orientation = "horizontal"
p.yaxis.axis_label = 'Mean Price £k'
p.xaxis.axis_label = 'Num Beds'
p.y_range.start = 0
show(p)
Here we see that the number of bedrooms is strongly correlated to the mean price - this is a primary feature people look at when assessing the suitability of a home. We can also see the discount an offers-over price has to a fixed price - in the range of 10% for 1 - 3 bedroom properties. Something interesting happens for 4 and 5 bedroom properties in that the offers-over price is higher than the fixed price. We also see the pricing of the second-hand inventory vs new-builds is lower for 1 - 3 bedroom properties but that this trend is also mostly reversed for 4 and 5 bedroom properties. We will look into the reasons for this a little later after we have developed some more visualisation tools.
Let's go ahead and save the dataframe we've developed - we'll re-use it in part 3 of this project.
actives_df.to_pickle('actives_df_part2.pkl')
# jupyter nbconvert --to html --no-prompt adsg_house_hunting2_part2.ipynb