Encoding Seasonality to Improve Our Property Valuation Models

Nikhil Bhargava

November 28, 2022

Introduction

At Homebound, the Machine Learning & Pricing team is responsible for developing and maintaining an Automated Valuation Model (AVM) to accurately predict the value of a property. Dubbed here as the Homebound Valuation Model (HVM), characteristics such as the size of a home, property view, proximity to major roads, and much more are used to estimate the price of a property. This enables our underwriting team to quickly and accurately value properties for customers looking to sell their homes through the Homebound Property Purchase Portal.

Our team is continuously working on improving the HVM by engineering new features, modifying model architecture, and improving data quality. One feature we’ve recently incorporated in our models is seasonality. The time of year in which you decide to purchase or sell your house can have an impact on how much your property is worth. Depending on the location of a home, prices tend to be higher during the spring and summer months and lower during the winter time. While our initial models naively captured some aspects of seasonality through features such as the amount of time since a home was last sold, we wanted to explicitly encode aspects of seasonality into our models.

‍

Encoding Seasonality

Traditionally, a home’s listing price is dependent on recently sold, comparable homes nearby, or comps. While a seller listing their home may not be planning for when they’re selling their home, an appraiser or real-estate agent factors the season into the listing price of their home. For machine learning models, this seasonality has to be explicitly included in the model, and can be done in numerous ways.

‍

Monthly

Encoding seasonality through the month in which an event occurred, ordinal or categorical, is probably the most common and simplest way to extract seasonal features. At Homebound, this is the month in which a property was listed for sale. Although this is easy to implement and interpret, this method often lacks what we want the model to understand about seasonality. For example, if this data is one-hot encoded, each month is considered independent, whereas in reality we know a relationship in time exists. If this data is ordinal, we account for the similarity between months such as January and February, while distancing dissimilar months such as January and June, but then run into the issue of January and December as being represented as the least similar months.

‍

Figure 1: Linear representation of encoding seasonality as monthly features (categorical and ordinal).

‍

Bucketed

Another way to represent seasonality is by bucketing, or grouping time that may be relevant to the outcome being predicted. For example, we manually bucketed list dates into two potentially relevant ways, quarterly and seasonally. While this reduces dimensionality and manually encodes aspects of perceived seasonality, the same issues discussed above persists with both categorical and ordinal representations of these bucketed groups.

‍

Figure 2: Linear representation of encoding seasonality as bucketed groups (quarterly and seasonal).

‍

Cyclic

While both monthly and bucketed representations of time help in interpreting and representing seasonality as model features, they struggle with capturing the cyclic nature of time and seasons. An alternative way to capture this seasonal cycle is by transforming dates in a year into a two dimensional, sine and cosine, feature space. This is done by normalizing the numerical day of year to a value between 0 and 1, transforming those values between the 0 and 2π sine and cosine intervals, and then obtaining the sine and cosine of those values (think back to the unit circle from trigonometry). Each day in the year is then represented by a unique sine and cosine value pair that forms a circle, as seen in the figure below.

‍

Figure 3: Cyclic representation of time where each day in the year is represented by a unique sine and cosine value pair. Closer points in time are more visually (color gradient) and numerically similar.

‍

One downside to this methodology is a lack of interpretability. Having two features representing time makes it more difficult to decipher feature importance and its general effect on the outcome variable. Additionally, some models such as tree-based methods, may have trouble creating useful splits on two features representing one element, seasonality. One potential solution to this issue is creating a similar, one dimensional cyclic feature that is interpreted as N days from a specific, chosen point in time. This is the equivalent to representing time as just the normalized cosine representation of time from above, where each day is N days away from either January 1st until N converges in July. This does however, lead to two different times of the year being represented by the same value.

‍

Figure 4: Cyclic representation of time where two days in the year are represented by a unique cosine value. Each cosine value is representative of N days away from January 1st, until the values converge in July. Closer points in time are more visually (color gradient) and numerically similar.

‍

Ultimately, there is no “right” way to encode seasonality. Each of these methods have their pros and cons based on the problem at hand, business requirements, and the model chosen. Therefore, the only way we could fairly evaluate which method was the best for the HVM was through experimentation.

‍

Evaluating Experiments

To determine how to best represent seasonality in the HVM, we ran experiments to determine the feature that minimized error, as defined by the Mean Average Precision Error (MAPE), and ultimately providing the greatest lift in model accuracy. At Homebound, we use mlflow to track experiments, store model details, and evaluate key metrics across different regions. The table below contains information on the seasonality experiments run.

‍

Table 1: Model lift results for each seasonality feature representation discussed.

‍

Given cyclic feature encodings are generally used in non-tree based models, often in deep learning models, we were surprised to see it give us the greatest lift in model performance. The HVM is an ensemble of tree based methods, and therefore we were unsure about its ability to pick up on the seasonality of home prices from a two-dimensional feature space.

If you’re unsure about how to encode seasonality for your projects give these methods a try, you may be surprised with what works!

‍

Next Steps & Future Work

As mentioned previously, one of our teams primary goals is to assist Homebound’s underwriting team in making quick, accurate, and fair offers to potential Homebound customers. Providing more explainability into HVM predictions is one way we can improve the partnership between our teams. While incorporating the two-dimensional cyclic features was great for boosting model performance, it somewhat decreased model interpretability.

One way the Machine Learning team has been gaining insight into individual predictions and the effects of seasonality on a prediction is through Shapley values. In the future, providing the same level of interpretability to HVM predictions for the underwriting team as well could help us iterate and improve our understanding of home price seasonality better in different regional markets and improve our models.

‍

Python Implementation

This section will focus primarily on implementing the different seasonality features discussed above, in Python.

We start by importing the packages needed to create a dummy Pandas DataFrame containing dates over the past three years (2019–2021) to extract seasonality features from.

‍

# import packages
import pandas as pd
import numpy as np

# create dummy dataframe of dates over the past 3 years (2019-2021)
df = pd.DataFrame()
df['date'] = pd.Series(pd.date_range('2019', periods=1096, freq='D'))

‍

For extracting months from dates, Pandas has convenient functionality that allows us to grab both month numbers and names from dates. Note it is not necessary to grab month names for the categorical representation of this feature as one-hot encoding month numbers will do, but is cleaner when done this way.

‍

# extract month number for ordinal representation of months
df['month_ordinal'] = df['date'].dt.month

# extract month name for categorical representation of months
df['month_categorical'] = df['date'].dt.month_name()

‍

Fortunately, bucketing time into groups such as quarters is also relatively straight forward, and can be done in a similar manner as extracting month numbers and names.

‍

# extract quarters for bucketed representation of time
# note: the last two characters are removed to keep the quarter without year
df['quarter'] = df['date'].dt.to_period('Q').astype('str').str[-2:]

‍

Bucketing months into manually created groups, such as seasons, requires a little more work and is done by mapping months to its desired group.

‍

# create list of desired month number to group mapping 
seasons = ['Winter', 'Winter', 'Spring', 'Spring', 'Spring', 'Summer',
          'Summer', 'Summer', 'Fall', 'Fall', 'Fall', 'Winter']

# map month number (1-12) to desired group (list created above)
season_month_map = dict(zip(range(1,len(seasons)+1), seasons))

# map extracted month number (month_ordinal) to desired group
df['season'] = df['month_ordinal'].apply(lambda x: season_month_map[x]):]

‍

Lastly, to create the cyclic features discussed above, the day of year from each date needs to be extracted. We can then create the sine and cosine time feature space by normalizing the day of year to a value between 0 and 1, transforming those values between the 0 and 2π sine and cosine intervals, and then obtaining the sine and cosine of those values.

‍

# extract day of year from date
df['day_of_year'] = df['date'].dt.dayofyear

# create cyclic sine and cosine normalized representation of time using extracted day of year
# note: divide by 366 to normalize & account for leap years (max number of days in a year)
# note: using just cos_date feature is equivalent to the one-dimnesional feature described
df['sin_date'] = np.sin(2 * np.pi * df['day_of_year']/366)
df['cos_date'] = np.cos(2 * np.pi * df['day_of_year']/366)

‍

Ultimately, the transformed seasonality features engineered should look something like the observations in the table below.

‍

Table 2: Example rows of seasonality features extracted from the first day of each month in 2021.

‍

Homebound is hiring! Interested in working at Homebound? Check out our careers page!

To learn more about Homebound, visit the Homebound website or go over to the Homebound Technology Blog to view more stories from our team.

‍

‍References

‍

[1] Araj, Victoria. “The Best Time of Year to Buy a House.” Rocket Mortgage, 1 Nov. 2022, https://www.rocketmortgage.com/learn/best-time-of-year-to-buy-a-house.

‍

[2] Bescond, Pierre-Louis. “Cyclical Features Encoding, It’s about Time!”, Towards Data Science, 8 June 2020, https://towardsdatascience.com/cyclical-features-encoding-its-about-time-ce23581845ca.

‍

[3] Hopper. “Ai in Travel, Part 2: Representing Cyclic and Geographic Features.” Medium — Life at Hopper, 27 Apr. 2018, https://medium.com/life-at-hopper/ai-in-travel-part-2-representing-cyclic-and-geographic-features-4ada33dd0b22.

Technology

Real Estate

Proptech

Building

Encoding Seasonality to Improve Our Property Valuation Models

The Homebound Newsletter