13.4 Stratified Sampling

  • Random sampling does not control for the proportion of the target variables in the sampling process.
  • Machine Learning methods may require similar proportions in the training and testing set to avoid imbalanced response variable.
  • Stratified sampling is able to obtain similar distributions for the response variable.
  • It can be applied to both, classification or regression problems.
  • With a continuous response variable, stratified sampling will segment Y (response variable) into quantiles and randomly sample from each. Consequently, this will help ensure a balanced representation of the response distribution in both the training and test sets.
  • rsample package can be used to create stratified samples.
  • The following code demonstrates that on a dataset suitable for classification.
data("GermanCredit")  #credit risk data from the caret package
idx4 = initial_split(GermanCredit, prop = 0.7, strata = "Class")  #Class is the binary response variable
train4 = training(idx4)
test4 = testing(idx4)
# check the proportion of outcomes

prop.table(table(train4$Class))  #training set

      Bad      Good 
0.3004292 0.6995708 
prop.table(table(test4$Class))  #testing set

      Bad      Good 
0.2990033 0.7009967 

The above training and testing set will have the same proportion of the class values.