Ungava Bay

Who Wants Some Ocean Frontage?. “Ungava Bay” is published by Charlie Teljeur.

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Machine Learning Fundamentals using Analogies

The intention of this article is to explain very basic concepts like Precision, Recall, Training and Testing sets, Cross Validation using analogies, to make them easier for ML students and enthusiasts understand. I aim to use such basic analogies that no prior knowledge of ML or Data Science should be required to understand what I will be talking about.

Imagine it’s your birthday and you have to blow out candles on top of it, but you have to leave the last one burning for good luck. If you blow out all the candles, it means you have 100% recall, as you successfully managed to blow out all the candles you wanted to. However, in doing so, you also blew out the “good luck” candle. This means that your precision is low, as you also took out the candle you weren’t supposed to. Instead, if you blew out only 5 of the candles on your cake, you have 100% precision, as you precisely blew out the candles you had to. You missed a fair chunk of the candles that you were supposed to get, so you recall will be horrible. This is a fundamental trade-off in ML. For example, if you are writing code for fraud detection for a credit card company, then you want to mark as many fraudulent swipes as possible. You might be okay with marking some non-fraudulent swipes, because those people can use another card, or call the company to make sure it goes through. In this situation, you would want a very high recall, which might come at the cost of precision. In an alternate example, if you are working for Youtube and have to create a filter for videos that are PG-13, it would be okay to miss a few videos that fell into that category, but allowing ones that weren’t safe could lead to a lot of backlash. Thus, you value precision over recall here.

Precision and recall are two of the most common performance measures of classification models.

Train and Test Sets

Imagine you are taking a Data Structures course, and your professor has just come up with 20 tests. All 20 of these tests have the same types of questions, with slight variations. He breaks them up into two sets, the ones he gives out for practice(Training Set), and the ones he intends to use for the examination(Testing Set). The practice set is available to you before the test, so you can use it to learn how to take the test and then you can go and use your knowledge earned from practice during the exam. This is how training a model in ML works. The train set is used to design the model, and the test set is used to measure its performance. The latter is more of a blackbox, in that you are not allowed to change your model by looking at it.

Cross Validation

Using the same analogy as above, imagine you are given 15 of the tests for practice. Cross validation is the concept that you test yourself before the test, to improve your understanding of the material. Take the 15 tests you are given and divide them into two subsets, 12 for learning and 3 for practice. You use your learning set by having your tests and answer keys open, and figuring out how the professor came up with solutions to the problems. Once you think you have figured out how to get the answers, you test your knowledge on the practice set(Cross Validation Set). Close the answer keys to these tests, and try to solve the problems. Once you have taken the practice tests, you look at the answer keys to make sure you got things correctly, and if you didn’t you make adjustments to your understanding of the questions(Tuning your hyperparameters).

Overfitting

Now imagine if the actual test is nothing like the practice test, but you are adept at taking that practice test. Given the practice test, your score is incredibly high as you’re very well trained to do well on it, but when you take a different type of test, you fail. In other words, you are fit to take the practice test, but don’t generalize well. This is the concept of overfitting.

All data is usually split up into Train and Test sets for machine learning, and it is one of the first steps of coming up with the model. Cross Validation is a very commonly used technique for confirming that your model is well trained for optimizing hyperparameters. Overfitting is a common problem in model training, and you can use regularization to tackle it.

Add a comment

Related posts:

Language and therapy

Living in Berlin does not equate to living your life in German. Most people moving to Berlin never become fluent in German. In my case as a Dane, I reached a basic small-talk Späti level, and then I…

The art of grabbing attention

We all have seen people that have an aura around them, people who grab everyone’s attention the moment they enter a room, people who exert a certain kind of power, influence or authority over a crowd…

Finding harmony among product chaos

Product is an evolving discipline. It’s hard to stick to any format to build products and features. This doesn’t scale well. There is a need to find harmony among all this chaos.