14.2 K-Nearest Neighbour
Nearest Neighbour classification (Quick Introduction)
Idea: Things that are alike are likely to have similar properties.
ML uses this principle to classify data by placing it in the same category or similar, or “nearest” neighbours.
Computers apply a human like ability to recall past experiences to make conclusions about current circumstances.
Nearest neighbour methods are based on a simple idea, yet they can be extremely powerful.
In general, well suited for classification tasks.
If a concept is difficult to define, but you know it when you see it, then nearest neighbours might be appropriate.
If the data is noisy and thus no clear distinction exists among the groups, nearest neighbour algorithms may struggle to identify the class boundaries.
K-nearest neighbor (KNN) is a very simple algorithm in which each observation is predicted based on its “similarity” to other observations.
The letter k is a variable term implying that any number of nearest neighbors could be used.
After choosing k, the algorithm requires a training dataset made up of examples that have been classified into several categories, as labeled by a nominal variable
Unlike most methods in ML, KNN is a memory-based algorithm and cannot be summarized by a closed-form model.
Measuring similarity in distance
Locating the nearest neighbours require a distance function; measuring similarity between two instances
k-NN uses Euclidean distance, a distance between two points. Euclidean distance is measured as the shortest direct route.
For p and q instances (to be compared) with n features
The balance between overfitting and underfitting the training data is a problem known as the bias-variance tradeoff.
Choosing a large k reduces the impact or variance caused by noisy data but can bias the learner such that it runs the risk of ignoring small, but important patterns.
There is no general rule about the best k as it depends greatly on the nature of the data.
A grid search can be performed to assess the best value of k
Grid Search: A grid search is an automated approach to searching across many combinations of hyperparameter values.
- A grid search would predefine a candidate set of values for k (e.g., k=1,2,…,j ) and perform a resampling method (e.g., k-fold CV) to estimate which k value generalizes the best to unseen data.
The caret package can be used to implement the knn method.
The next section demonstrate an application of Logistic Regression and KNN using an example from the applied finance domain.