Capitule 3. Division of the dataset

We need two main subsets before classification, i.e., training and testing. To obtain these subsets is necessary to implement a technique as K-fold cross-validation stratified, Hold-Out stratified, or Leave One-Out.

  • K-fold cross-validation stratified: The first step is stratified all the dataset. Second, ordering aleatory each class. Then, divide the dataset in K folds of the same size, K-1 folds constituted the training set, and the rest formed the test set. The procedure is repeated K times. The elements that compose each K folds is per class.
  • Hold-Out stratified: The first step is stratified all the dataset. Second, ordering aleatory each class. Then, select (aleatory) the percent assigned for the training set; the rest will be the test set.
  • Leave One-Out: This technique does not employ the randomness. It is a particular case of the K-fold cross-validation technique, where the value of K is the total number of the instances (a.k.a. patterns).
Here an illustrative example of K-fold cross-validation stratified.
K-fold cross-validation stratified with K=5.



Selection of instances with K=5.


THE TASK FOR YOU:
  • Get the recent list of the magazines that belong to the Journal Citation Reports (JCR).
  • Investigate in JCR papers the typical values that machine learning researches use for K-fold cross-validation (e.g., K=5) and Hold-Out.
  • Which are the typical values for K-fold cross-validation and Hold-Out?
  • Which division method is most common to use?