14.2 K-Nearest Neighbour

Nearest Neighbour classification (Quick Introduction)

  • Idea: Things that are alike are likely to have similar properties.

  • ML uses this principle to classify data by placing it in the same category or similar, or “nearest” neighbours.

  • Computers apply a human like ability to recall past experiences to make conclusions about current circumstances.

  • Nearest neighbour methods are based on a simple idea, yet they can be extremely powerful.

  • In general, well suited for classification tasks.

  • If a concept is difficult to define, but you know it when you see it, then nearest neighbours might be appropriate.

  • If the data is noisy and thus no clear distinction exists among the groups, nearest neighbour algorithms may struggle to identify the class boundaries.

k-NN Algorithm

  • K-nearest neighbor (KNN) is a very simple algorithm in which each observation is predicted based on its “similarity” to other observations.

  • The letter k is a variable term implying that any number of nearest neighbors could be used.

  • After choosing k, the algorithm requires a training dataset made up of examples that have been classified into several categories, as labeled by a nominal variable

  • Unlike most methods in ML, KNN is a memory-based algorithm and cannot be summarized by a closed-form model.

Measuring similarity in distance

  • Locating the nearest neighbours require a distance function; measuring similarity between two instances

  • k-NN uses Euclidean distance, a distance between two points. Euclidean distance is measured as the shortest direct route.

  • For p and q instances (to be compared) with n features

\[\begin{equation} dist(p,q)=\sqrt{(p_{1}-q_{1})^{2}+(p_{2}-q_{2})^{2}+\ldots+(p_{n}-q_{n})^{2}} \tag{14.2} \end{equation}\]

Choosing k

  • The balance between overfitting and underfitting the training data is a problem known as the bias-variance tradeoff.

  • Choosing a large k reduces the impact or variance caused by noisy data but can bias the learner such that it runs the risk of ignoring small, but important patterns.

  • There is no general rule about the best k as it depends greatly on the nature of the data.

  • A grid search can be performed to assess the best value of k

  • Grid Search: A grid search is an automated approach to searching across many combinations of hyperparameter values.

    • A grid search would predefine a candidate set of values for k (e.g., k=1,2,…,j ) and perform a resampling method (e.g., k-fold CV) to estimate which k value generalizes the best to unseen data.
  • The caret package can be used to implement the knn method.

The next section demonstrate an application of Logistic Regression and KNN using an example from the applied finance domain.