Labeling neighborhoods with KNN algorithm

4 min readMar 27, 2020

This is the final project for the Coursera-IBM Data Science capstone. Its main propose is to train and use the skills learned along the course. In this project we will try to build a recommendation system that is able to find the most similar neighborhood in one city given another one in another city, using data from the venues in this cities.

Introduction

Imagine that person A wants to visit a new city, given that this city is too big, this person does not have the time to look for all the places she could go and visit inside the city. More than that, she doesn’t even know what exactly she wants to see there. However, person A have been to a lot of places and she knows which ones she liked. With that in mind, can we based on some data build a model that is capable of predicting the places she would probably like to see?

I used the data provided by the Foursquare API to query venues from Toronto and Manhattan, using the categories of this venues (if they are Restaurants or Gyms…) we can find similarity between neighborhoods.

Methodology

I putted the data from the venues in two dataframes, one for Toronto and the other for Manhattan. Looking at the above graph we can see the frequency of the 15 most common venues from both cities, and we can guess that there is some intraclass correlation here. What it may indicate is that the two cities are not very different by the end.

I removed the venues that didn’t exist in the both cities from the data and than transformed the category column into a bunch of dummie columns. The KNN algorithm in sklearn don’t handle categorical variables, so it is necessary to transform it into numerical ones. After that I grouped all the venues by neighborhood and took the mean from all the categories, this new dataframes are than ready to be used in the KNN algorithm.

KNN algorithm

The KNN is an algorithm of supervised learning, that means it must be provided with labeled data. It works by assigned a class to a sample when this class is the one with K nearest neighbors from the sample (from there comes its name). Hence the features passed were the mean of categories and the labels were simply the neighborhoods. I decided to train it with the neighborhoods of Manhattan and try to label the ones from Toronto, it could be otherwise, but I did it like this. The samples will be labeled by euclidean distance, where the dimensions are the categories, and I needed to choose just one neighbor or K=1, that’s because of course I had just one sample of each neighborhood (because they are unique). Running the algorithm we can then assign the labels to the samples. It is natural that we different neighborhoods in Toronto being classified as the same one in Manhattan. Here is a bar chart showing how many times the neighborhoods in Manhattan were assigned.

We can see that ‘Upper West Side’ where the most assigned neighborhood. Let’s try to see why.

Analysis

We can see here the most common venues from Upper West Side. Here is another chart showing both the venues from Upper West Side and Adelaine (that was labeled as similar to Upper West Side).

It is kind o messy but we can try to see some patter in the bars.

Conclusions

It is hard to discuss results from this kind of experiment, because the similarity between neighborhoods can be very subjective. What is most important in this case is the idea. With more data and using more features I believe it is a good start point for a recommendation system.