
Assess optimal number of components for sPLS-DA on omics dataset from MultiDataSet object
Source:R/prefiltering.R
perf_splsda.RdPerforms cross-validation for a PLS-DA run (implemented in the mixOmics
package) on an omics dataset from a MultiDataSet object. This allows to
estimate the optimal number of latent components to construct. This is
intended for feature preselection in the omics dataset (see examples below).
Usage
perf_splsda(
splsda_input,
ncomp_max = 5,
validation = "Mfold",
folds = 5,
nrepeat = 50,
measure = "BER",
distance = "centroids.dist",
cpus = 1,
progressBar = TRUE,
seed = NULL
)Arguments
- splsda_input
Input for the sPLS-DA functions from mixOmics, created with
get_input_splsda().- ncomp_max
Integer, the maximum number of latent components to test when estimating the number of latent components to use. Default value is
5.- validation
Character, which cross-validation method to use, can be one of
"Mfold"or"loo"(seemixOmics::perf()). Default value is"Mfold".- folds
Integer, number of folds to use in the M-fold cross-validation (see
mixOmics::perf()). Default value is 5.- nrepeat
Integer, number of times the cross-validation is repeated (see
mixOmics::perf()).- measure
Performance measure used to select the optimal value of
ncomp, can be one of"BER"or"overall"(seemixOmics::perf()). Default value is"BER".- distance
Distance metric used to select the optimal value of
ncomp, can be one of"max.dist","centroids.dist"or"mahalanobis.dist"(seemixOmics::perf()). Default value is"centroids.dist".- cpus
Integer, number of cpus to use.
- progressBar
Logical, whether to display a progress bar during the optimisation of
ncomp. Default value isTRUE.- seed
Integer, seed to use. Default is
NULL, i.e. no seed is set inside the function.
Value
A list as per the output of the mixOmics::perf() function, with
the following additional elements:
dataset_name: the name of the dataset analysed;group: column name in the samples information data-frame used as samples group;optim_ncomp: the optimal number of latent components as per themeasureanddistancespecified;optim_measure: the measure used to select the optimal number of latent components;optim_distance: the distance metric used to select the optimal number of latent components. In addition, the name of the dataset analysed and the column name in the samples information data-frame used as samples group as stored as attributesdataset_nameandgroup, respectively.
Details
This function uses the mixOmics::plsda() and mixOmics::perf() function
from the mixOmics package.