13.4 Stratified Sampling
- Random sampling does not control for the proportion of the target variables in the sampling process.
- Machine Learning methods may require similar proportions in the training and testing set to avoid imbalanced response variable.
- Stratified sampling is able to obtain similar distributions for the response variable.
- It can be applied to both, classification or regression problems.
- With a continuous response variable, stratified sampling will segment Y (response variable) into quantiles and randomly sample from each. Consequently, this will help ensure a balanced representation of the response distribution in both the training and test sets.
rsamplepackage can be used to create stratified samples.
- The following code demonstrates that on a dataset suitable for classification.
set.seed(999) data("GermanCredit") #credit risk data from the caret package = initial_split(GermanCredit, prop = 0.7, strata = "Class") #Class is the binary response variable idx4 = training(idx4) train4 = testing(idx4) test4 # check the proportion of outcomes prop.table(table(train4$Class)) #training set
Bad Good 0.3004292 0.6995708
prop.table(table(test4$Class)) #testing set
Bad Good 0.2990033 0.7009967
The above training and testing set will have the same proportion of the class values.