by Mamy Ratsimbazafy | April 26, 2017 7:59 pm
3 months ago, Data Science competition website Kaggle published a challenge from Two Sigma, an investment fund, and Renthop, a startup they invested in.
The goal was to predict interest in New York apartments listed on Renthop’s website using a very rich dataset with:
The challenge just finished and I want to share my approach. Code is in my github.
This solution features Gradient Boosted Trees (
LightGBM) and does not use stacking, due to lack of time.
Features can be activated and deactivated by a single comment in
From the datetime field I created several features:
Furthermore, day, month, hour are cyclical.
To tell the classifier than after Sunday (day 6) there is Monday (day 0), I’ve projected the time information on a circle by taking the cos and sin.
From the latitude and longitude, I created clusters using Density-based clustering ([HDBSCAN](https://hdbscan.readthedocs.io/en/latest/)).
I would have preferred
DBSCAN and setting epsilon to 200 meters but unfortunately,
DBSCAN is not properly optimized. Trying to get 40000 (train set) or 70000 (test set) pairwise haversine distance goes KABOOM on my memory.
(HDBSCAN creates cluster fully automatically from density, but NYC is too dense)
From the public kernels I’ve also taken the coordinate of Central Park, Brooklyn, Queens …. to compute the distance of each apartment from those center.
Apartment features (cat, dog, doorman, laundry in building …) were deduplicated and encoded using a 4-letter encoding scheme to reduce duplication further.
Furthermore Sklearn CountVectorizer to One-Hot-Encode + Expose their frequency to the classifier
The description field was one of my big focus, I did:
TextBlob(unused at the end)
On price, number of bathrooms, bedrooms, the usual combinations of price per room, etc were done.
Address, manager, building id were numerically encoded.
Furthermore for manager and building id, various other encoding scheme were tested (Bayesian target label encoding, low/mid/high interest count from the Kaggle Forum, manager skill and building hype).
In the end, after multiple leaks on cross-validation, I simply binned managers/building with their frequency (top 1%, 2%, 5% …).
This way target labels were not used, I ensure no leak and performance seemed to be similar to Bayesian encoding.
Detected Outliers were corrected from the test set (117 bathrooms :O)
Prices > 13000 were clipped
Like many other I didn’t process the image at ll, besides using the magic leak (folder creation time).
The biggest issue was that the number of images per apartment was irregular, some had a floor plans, other had furnitures, other had nothing.
I did extract metadata from the images to process add resolution, image height and width to my model.
Unfortunately the json file was 800MB or 1.4GB in CSV with thousands of sparse columns. Pandas couldn’t load that in my machine. The workaround would be to a. buy more RAM, b. use a dictionary structure but it was clunky and time consuming.
Example metadata are available in my 000_Data_Exploration.ipynb notebook.
I ran early in scalability issues and cross-validation issues with
In Sklearn, you can use Pipelines to apply modifications on the train and test set independently,
but it’s not trivial to use pipelines on a validation set (split from train set) that you will use as input for
LightGBM early stopping.
Furthermore, most features are not inherently leaky and do not need to be recomputed for each fold as
Sklearn has no caching framework
I wrote my own code so that adding each features is easy and independant, check the
star_command.py pipe function.
Now each transformation can be applied with:
# Feature extraction - sequence of transformations
tr_pipeline = feat_extraction_pipe(
Feature selection was done the same way, with a framework that can deal with dataframe and sparse array, there is even a glimpse of feature selection on multiple processes, but it was slower due to Python’s Global Interpreter Lock
Each features can be chained with Scikit’s transformers like TfIdf or PCA.
Multiple features can be declared at the same time.
select_feat = [
TruncatedSVD(2), # 2 or 3
# Normalizer(copy=False) # Not needed for trees ensemble and Leaky on CV
# TfidfVectorizer(tokenizer=identity, preprocessor=None, lowercase=False)]
("description", CountVectorizer(vocabulary=vocab_metro_lines,binary=True, lowercase=False)),
(['top_' + str(p) + '_manager' for p in [1,2,5,10,15,20,25,30,50]],None)
(['top_' + str(p) + '_building' for p in [1,2,5,10,15,20,25,30,50]],None)
Each transformation can be cached in a “database” with
shelve and retrieved easily with a key. See
And finally I wrote my own cross-validation and out of fold prediction code.
Thank you for your attention
Also published on Medium.
Source URL: https://andre-ratsimbazafy.com/predicting-apartment-interest/
Copyright ©2017 Marie & Mamy's Insights unless otherwise noted.