13.2 Data Splitting: Sampling

  • Training set: data examples that are used to learn or build a classifier.

  • Validation set: data examples that are verified against the built classifier and can help tune the accuracy of the output.

  • Testing set: data examples that help assess the performance of the classifier.

Machine Learning requires the data to be split in mainly three categories. The first two (training and validation sets) are usually from the portion of the data selected to build the model on.

  • Two most common ways of splitting data Overfitting: Building a model that memorizes the training data, and does not generalize well to new data. Generalisation error > Training error.

    • Simple random sampling
    • Stratified sampling
  • Typical recommendations for splitting your data into training-test splits include 60% (training)–40% (testing), 70%–30%, or 80%–20%. Its is good to keep the following points in mind:

  • Spending too much in training (e.g., >80%) won’t allow us to get a good assessment of predictive performance. We may find a model that fits the training data very well, but is not generalizable (overfitting).

  • Sometimes too much spent in testing (>40% ) won’t allow us to get a good assessment of model parameters.