Grilled Lamb Burgers

Grilled Lamb Burgers. This is a simple recipe for BBQ amateur cooks to try it out. This was my first recipe, and I designed this recipe to understand the heating….

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转

Machine Learning for your flat hunt

Part 1 — Visualisation of data, Preprocessing and a basic Prediction Model

Have you ever looked for a flat? Would you like to add some machine learning and make a process more interesting?

Today I will tell you about applying Machine Learning for finding an optimal flat.

First of all, I want to clarify this moment and explain what “an optimal flat” does mean. It is a flat with a set of different characteristics like “area”, “district”, “number of balconies” and so on. And for these features of the flat, we expect a specific price. Looks like a function which takes several parameters and returns a number. Or maybe a black box which provides some magic.

But… there is big “but”, sometimes you can face a flat which is overpriced because of a set of reasons like a good geo-position. Also, there are more prestigious districts in the center of a city and districts outside the town. Or… sometimes people want to sell their apartments because they move to another point of Earth. In other words, there are many factors which can affect the price. Does it sound familiar?

Before I continue, let me make a little lyrical digression.

I lived in the Yekaterinburg (the city between Europe and Asia, one of the cities which had held The Football World Championship in 2018) for 5 years.
I was in love with these concrete jungles. And I hated that city for Russian winter and public transport. It is a growing city and every month there are thousands and thousands of flats to be sold.

Yes, it is an overcrowded, polluted city. At the same time — it is a good place for analysing a real estate market. I received a lot of advertisements for flats, from the Internet. And I will use that information to a further extent.

Also, I tried to visualize different offers on Yekaterinburg’s map.

There are over 2 thousand 1-bedroom flat’s which had been sold in July of 2019 in Yekaterinburg. They had a different price, from less than a million to almost 14 million roubles.

Colour of points on the map represents the price, the lower is the price near to blue colour, the higher is the price near to red. You can consider it as an analogy with cold and warm colours, the warmer colour is the bigger is price. Please, remember that moment, the redder is the colour, the higher is a value of something. The same idea works for blue but in the direction of the lowest price.

Now you are having a general overview of the picture and time to analyse is coming.

What did I want when l lived in Yekaterinburg? I looked for a good enough flat, or if we talk in terms of ML — I wanted to build a model which will give me a recommendation about buying.
On the one hand, if a flat is overpriced, the model should recommend waiting for price decreasing by showing the expected price for that flat.
On the other hand — if a price is good enough, according to the market state — perhaps I should consider that offer.

Of course, there is nothing ideal and I was ready to accept a mistake in calculations. Usually, for this kind of task use mean error of prediction and I was ready to 10% error. For example, if you have 2–3 million Russian roubles, you can ignore mistake in 200–300 thousand, you can afford it. As it seemed to me.

As I mentioned before, there were a lot of apartments, let’s look at them closely.

2310 flats for one month, we could extract something useful from that. What about a general data overview?

There is not something extraordinary — longitude, latitude, price of a flat(the label “cost”), and so on. Yes, for that moment I used “cost” instead of “price”, I hope It will not lead to misunderstanding, please consider them as same.

Does every record have the same meaning? Some of them are represented flats like a cubicle, you can work there, but you would not like to live there. They are small cramped rooms, not a real flat. Let remove them.

Prediction price of flat comes from the oldest issues in economics and related fields. There was nothing related to the term “ML” and people tried to guess price based on square meters/feet.
So, we look at these columns/labels and try to get the distribution of them.

Well… there is nothing special, looks like a normal distribution. Perhaps we need to go deeper?

Oops… something wrong is there. Clean outliers in these fields and try to analyse our data again.

Well, now it looks better and a distribution seems quite good.

The label “year”, which is pointed at a year of construction should be transformed into something more informative. Let it be the age of building, in other words how a specific house is old.

Let have a look at the result.

There are all kinds of data, categorical, Nan-values, text-description and some geo-information (longitude and latitude). Let us put aside the last ones because on that stage they useless. We will back to them later.

Well, there are over twenty possible districts, could we add over 20 additional variables in our model? Of course, we could, but… should we? We are people and we could compare things, is not it?

First of all — not every district is equivalent to another. In the centre of city prices for one square meter is higher, further from downtown — it becomes to decrease. Does it sound logical? Could we use that?

Yes, definitely we could match any district with a specific coefficient and the further district is the cheaper flats are. After matching the city and using another web service map changed and has a similar view

I used the same idea as for flat’s visualization. The most “prestigious” and “expensive” district coloured in red and the least — blue. A colour temperature, do you remember about it?

Also, we ought to do some manipulation over our dataframe.

The same approach will be used for describing the internal quality of the flat. Sometimes it needs some repair, sometimes flat is quite well and ready for living. And in other cases, you should spend additional money on making it look better(to change taps, to paint walls). There also could be use coefficients.

By the way, about walls. Of course, it also influences to price of flat. Modern material is better than older, brick is better than concrete. Walls from wood is quite a controversial moment, perhaps it is a good choice for the countryside, but not so good for urban life.

We use the same approach as before, plus make a suggestion about rows we do not know anything. Yes, sometimes people do not provide all the information about their flat. But, based on history we can try to guess about the material of walls. In a specific period of time (for example period Khrushchev’s leading) — we know about typical material for building.

Also, there is information about the balcony. In my humble opinion — the balcony is a really useful thing, so I could not help myself considering it.

Unfortunately, there are some null values. If the author of an advertisement had checked information about it, we would have more realistic information.

Well, if there is no information it will mean “there is not a balcony”.

After that, we drop columns with information about the year of building(we have a good alternative for its). Also, we remove column with information about the type of building because it has a lot of NaN-values and I have not found any opportunity to fill these gaps. And we drop all rows with NaN which we have.

So… we used a not standard approach and replace categorical values to their numerical representation. And now we finished with a transformation of our data.

A part of the data has been dropped, but in general, it is a quite good dataset. Let look what happened with it.

Erm… it became very interesting.

Positive correlation

Total area — balconies. Why not? If our flat is big there will be a balcony.

Negative correlation

Total area — age. The newer is flat, the bigger is an area for living. Sound logical, new are more spacious flat than older ones.

Age — balcony. The older is flat the fewer balconies it has. Seem like a correlation through another variable. Perhaps it is a triangle Age-Balcony-Area where one variable has an implicit influence on another. Put that on hold for a time.

Age — district. The older flat is the big probability that will be placed in the more prestigious districts. Could it be related to higher price near the center?

Also, we could see the correlation between dependent and independent variables

The correlation with the dependent variable

Here we go…

The very strong correlation between the area of flat and price. If you want to have a bigger place for living it will require more money.

There is a negative correlation between pairs “age/cost” and “district/cost”. A flat in a newer house less affordable than the old one. And in the countryside flats are cheaper.

Anyhow, it seems clear and understandable, so I decided to go with it.

For tasks related to prediction flat’s price usually, use linear regression. According to significant correlation from a previous stage, we could try to use it as well. It is a workhorse which is suitable for many tasks.

Prepare our data for next actions

Also, we create some simple functions for prediction and evaluation of the result. Let do our first try to predict price!

The result of a prediction

Well… 76.67% of accuracy. Is it a big number or not? According to my point of view, it is not bad. Moreover, it is a good starting point. Of course, it is not ideal, and there is a potential for improvement.

At the same time — we tried to predict only one part of the data. What about applying the same strategy for other data? Yes, time for cross-validation.

Now we take another result. 73 is less than 76. But, it also a good candidate until a moment when we will have a better one. Also, it means that a linear regression works quite stable on our dataset.

And time for the last step. We will look at the best feature of linear regression — interpretability.

This family of models, in opposite to more complex ones, has a better ability to for understanding. There are just some numbers with coefficients and you can put your numbers in the equation, make some simple math and have a result.

Let try to interpret our model

The picture looks quite logical. Balcony/Walls/Area/Repair give a positive contribution to a flat price.

The further flat is the bigger a negative contribution. Also applicable for age. The older flat is the lower price will be.

So, it was a fascinating journey.
We started from the ground, use the untypical approach for data transformation based on the human point of view(numbers instead dummy variables), checked variables and their relation to each other. After that, we build our simple model, used cross-validation for testing its. And as the cherry on the cake — look at the internals of model, what gives us confidence about our way.

But! It is not the finish our journey but only a break. We will try to change our model in the future and maybe (just maybe) it increases the accuracy of prediction.

Thanks for reading

Grilled Lamb Burgers

Machine Learning for your flat hunt

Part 1 — Visualisation of data, Preprocessing and a basic Prediction Model

Add a comment

Related posts:

The Alternate Reality of the Dems Impeachment Gambit

Inspirational Sayings

1. Feeling Overwhelmed