Predicting apartment interest from listing with structured data, free text, geolocalization, time data and images

by Mamy Ratsimbazafy | April 26, 2017 7:59 pm

3 months ago, Data Science competition website Kaggle published a challenge from Two Sigma, an investment fund, and Renthop, a startup they invested in.

The goal was to predict interest in New York apartments listed on Renthop’s website using a very rich dataset with:

The challenge[1] just finished and I want to share my approach. Code is in my github[2].

Overview of my solution

This solution features Gradient Boosted Trees (XGBoost and LightGBM) and does not use stacking, due to lack of time.

Feature engineering

Features can be activated and deactivated by a single comment in

Time features

From the datetime field I created several features:

Furthermore, day, month, hour are cyclical.
To tell the classifier than after Sunday (day 6) there is Monday (day 0), I’ve projected the time information on a circle by taking the cos and sin.

Geo-localization features

From the latitude and longitude, I created clusters using Density-based clustering ([HDBSCAN](

I would have preferred DBSCAN and setting epsilon to 200 meters but unfortunately, Scikitlearn‘s DBSCAN is not properly optimized. Trying to get 40000 (train set) or 70000 (test set) pairwise haversine distance goes KABOOM on my memory.
(HDBSCAN creates cluster fully automatically from density, but NYC is too dense)

From the public kernels I’ve also taken the coordinate of Central Park, Brooklyn, Queens …. to compute the distance of each apartment from those center.

Apartment features

Apartment features (cat, dog, doorman, laundry in building …) were deduplicated and encoded using a 4-letter encoding scheme to reduce duplication further.
Furthermore Sklearn CountVectorizer to One-Hot-Encode + Expose their frequency to the classifier

Description features (NLP / Text-mining)

The description field was one of my big focus, I did:

Categorical features

On price, number of bathrooms, bedrooms, the usual combinations of price per room, etc were done.
Address, manager, building id were numerically encoded.

Furthermore for manager and building id, various other encoding scheme were tested (Bayesian target label encoding, low/mid/high interest count from the Kaggle Forum, manager skill and building hype).

In the end, after multiple leaks on cross-validation, I simply binned managers/building with their frequency (top 1%, 2%, 5% …).
This way target labels were not used, I ensure no leak and performance seemed to be similar to Bayesian encoding.

Outliers removal

Detected Outliers were corrected from the test set (117 bathrooms :O)
Prices > 13000 were clipped


Like many other I didn’t process the image at ll, besides using the magic leak (folder creation time).
The biggest issue was that the number of images per apartment was irregular, some had a floor plans, other had furnitures, other had nothing.

I did extract metadata from the images to process add resolution, image height and width to my model.
Unfortunately the json file was 800MB or 1.4GB in CSV with thousands of sparse columns. Pandas couldn’t load that in my machine. The workaround would be to a. buy more RAM, b. use a dictionary structure but it was clunky and time consuming.

Example metadata are available in my 000_Data_Exploration.ipynb notebook.

Overview of the architecture

I ran early in scalability issues and cross-validation issues with Scikit-Learn.

In Sklearn, you can use Pipelines to apply modifications on the train and test set independently,
but it’s not trivial to use pipelines on a validation set (split from train set) that you will use as input for XGBoostor LightGBM early stopping.
Furthermore, most features are not inherently leaky and do not need to be recomputed for each fold as Sklearn does.
Lastly, Sklearn has no caching framework


I wrote my own code so that adding each features is easy and independant, check the pipe function.
Now each transformation can be applied with:


Feature selection was done the same way, with a framework that can deal with dataframe and sparse array, there is even a glimpse of feature selection on multiple processes, but it was slower due to Python’s Global Interpreter Lock
Each features can be chained with Scikit’s transformers like TfIdf or PCA.
Multiple features can be declared at the same time.


Each transformation can be cached in a “database” with shelve and retrieved easily with a key. See
And finally I wrote my own cross-validation and out of fold prediction code.

Thank you for your attention


Also published on Medium[3].

  1. The challenge:
  2. github:
  3. Medium:

Source URL: