A walk through Boston’s Airbnb data

Airbnb had made renting your house, apartment, room (or even boat!) more accessible to a lot of people. Boston is a vibrant city with a lot of visitors and travelers, so let’s see what can we tell about Boston given some Airbnb data.

For this walkthrough, I am using Boston Airbnb data found on kaggle. The data I used has 3585 samples (number of rows) and 95 features (number of columns). Each sample is a specific listing on Boston Airbnb, and each one of the 95 features describe something about that specific listing, for example, it’s location, the host’s response rate, it’s price,…etc.

Before we start, let’s walk through some data preprocessing.

I removed all features that has URL and/or IDs information, since won’t be helpful for this analysis. Also, I removed any other non relevant features and features that I found redundant (for example, zipcode and neighborhood convey almost the same information, so I just kept neighborhood and dropped the zipcode), and finally, I removed any feature with more than 30% missing values. All these removals made me end with 27 features.

I then plotted the price distribution to see if there are any outliers, and indeed the plot(shown below on left) suggests the existence of outliers. So I decided to keep only data-points with prices between the 5th and 95th percentile, and drop any samples not within that range which resulted in a total of 353 samples being dropped. I then replotted the distribution again (shown below on the right). The new plots suggests less outliers are present so I went ahead and used the remaining samples, which ended up being 3232 samples.

Now Let’s start!

First, A walk through Boston’s neighborhoods: which neighborhoods have the highest and lowest prices and why?

To answer this question, I aggregated all the available data by neighborhood and added up all the prices of the listings in the neighborhood, taking the average over the number of listings in the neighborhood, then I sorted the prices to get the plot below.

It seems like the Financial district is on top when it comes to average listing price, and Chestnut Hill is in the bottom. But let’s take a deeper look into the types of listings found in those neighborhoods. First using the room type, and using the same order as in the plot before, we get the following figure below.

People in both the Financial District and Chestnut Hill, seem to be renting entire house/apartments or private rooms. The average prices in both categories are very different in both of the neighborhoods. But when it comes to the most expensive average price for an entire home/apartment Brookline seems to take that spot, and for private rooms, it’s Downtown Crossing. When it comes to Shared rooms, they don’t seem to be very prevalent in all Boston’s neighborhoods, and they’re very close when it comes to their average prices.

Now let’s move to analyzing our data by property type.

Boston has everything to offer from apartments to boats!

The Financial district and Chestnut Hill are keeping their places when it comes to highest and lowest average apartment pricing. The Theater district has the highest average price when it comes to Houses and Lofts. Beacon Hill has the highest average price when it comes to Townhouses, and for Villas it’s Downtown. And if you fancy a boat to rent, you can go to either East Boston or North End with both of them sharing very close average prices.

Second, a walkthrough Boston’s Airbnb prices: Can we make accurate price predictions using our data?

In the dataset, we have access to other features that can help in predicting the price of a listing besides the neighborhoods. Some features features are categorical like the the cancellation policy, the property type,…etc. and some are quantitative like the review score value, the host response rate,…etc.

For the categorical features, we have a total of 12 categorical features, which I chose to encode with dummy variables, which lead to a total of 71 features after the encoding was done.

For the quantitative features, we have a total of 14 features.

With our target variable being the price, and a total of 85 features, we first being by dropping any row containing a missing value, which leads us to have 2314 samples. Following that we split our data into training and testing with ratios 70% to 30% respectively, and train a linear regression model on 5 different splits, and observe the average R2 score across those splits for the testing data, which was 0.675, which is a good value (the best R2 score possible is 1).

I also plot the true values vs predictions for 4 of those splits to visualize the results better as shown in the figure below.

Next, I move to using ridge regression, which is similar to the linear regression but with uses regularization to prevent overfitting. I plot the average R2 score values (averaged across 5 splits on the testing data) for different values of the regularization parameter alpha.

The best possible R2 score was 0.677 which happened for a regularization value of alpha = 1. This is very close to the average R2 score we got from using linear regression. So we didn’t gain much improvement when using regularization.

Do we gain improvement when imputing the missing data instead of dropping it? To answer this I repeated the above experiments with imputing the missing values by the column means instead of dropping them. For the linear regression the average R2 score was 0.609, and for the ridge regression the best possible value was for alpha = 1, R2 = 0.613. Thus in both cases, the R2 score deteriorated when using column mean imputation on the missing values instead of dropping them.

Third, a walkthrough potential factors that affect the listing price: Which of the features we processed affect the price the most?

I did some analysis on how the neighborhood and the property type affect the average price. But I wanted to use the price prediction model here to learn which features that the model deemed important for predicting the price. To do this I plotted the coefficients that the linear model learned by training it on all the data available. Since we have a total of 85 features, the plot is rather big and hard to interpret, so instead I picked only features that has a coefficient of 25 and higher, and I plotted the results below.

According to the linear regression model, having a property in Back Bay, Beacon Hill, Cambridge, Downtown Crossing, the Financial District, or the Leather District, would increase the likelihood of a higher price. That’s also true if the property type is a Boat or a guest house, or if you’re renting an entire room/apartment and have more bedrooms. These are all factors that lead to an increase in price according to the linear regression model.

Hope you enjoyed the walk!
Please post your feedback or comments :)

All code is available in my Github repo here.

Grad student, looking for opportunities in Data Science, when I am not being a nerd, I like to bake, paint, listen to audiobooks and have existential crises