## 14.2 K-Nearest Neighbour

### Nearest Neighbour classification (Quick Introduction)

• Idea: Things that are alike are likely to have similar properties.

• ML uses this principle to classify data by placing it in the same category or similar, or “nearest” neighbours.

• Computers apply a human like ability to recall past experiences to make conclusions about current circumstances.

• Nearest neighbour methods are based on a simple idea, yet they can be extremely powerful.

• In general, well suited for classification tasks.

• If a concept is difficult to define, but you know it when you see it, then nearest neighbours might be appropriate.

• If the data is noisy and thus no clear distinction exists among the groups, nearest neighbour algorithms may struggle to identify the class boundaries.

### k-NN Algorithm

• K-nearest neighbor (KNN) is a very simple algorithm in which each observation is predicted based on its “similarity” to other observations.

• The letter k is a variable term implying that any number of nearest neighbors could be used.

• After choosing k, the algorithm requires a training dataset made up of examples that have been classified into several categories, as labeled by a nominal variable

• Unlike most methods in ML, KNN is a memory-based algorithm and cannot be summarized by a closed-form model.

Measuring similarity in distance

• Locating the nearest neighbours require a distance function; measuring similarity between two instances

• k-NN uses Euclidean distance, a distance between two points. Euclidean distance is measured as the shortest direct route.

• For p and q instances (to be compared) with n features

$$$dist(p,q)=\sqrt{(p_{1}-q_{1})^{2}+(p_{2}-q_{2})^{2}+\ldots+(p_{n}-q_{n})^{2}} \tag{14.2}$$$

Choosing k

• The balance between overfitting and underfitting the training data is a problem known as the bias-variance tradeoff.

• Choosing a large k reduces the impact or variance caused by noisy data but can bias the learner such that it runs the risk of ignoring small, but important patterns.

• There is no general rule about the best k as it depends greatly on the nature of the data.

• A grid search can be performed to assess the best value of k

• Grid Search: A grid search is an automated approach to searching across many combinations of hyperparameter values.

• A grid search would predefine a candidate set of values for k (e.g., k=1,2,…,j ) and perform a resampling method (e.g., k-fold CV) to estimate which k value generalizes the best to unseen data.
• The caret package can be used to implement the knn method.

The next section demonstrate an application of Logistic Regression and KNN using an example from the applied finance domain.