The moiraine R package user manual

Author

Olivia Angelin-Bonnet

Introduction

Omics datasets provide an overview of the content of cells for a specific molecular layer (e.g. transcriptome, proteome, metabolome). By integrating different omics datasets obtained on the same biological samples, we can gain a deeper understanding of the interactions between these molecular layers, and shed light on the regulations occurring both within and between layers.

A number of statistical methods have been developed to extract such information from multi-omics datasets, and many have been implemented as R packages. However, these tools differ conceptually, in terms of the input data they require, the assumptions they make, the statistical approaches they use or even the biological questions they seek to answer. They also differ at a practical level in terms of the format required for data input, the parameters to tune or select, and the format in which the results are returned. These differences render the application of several integration tools to the same multi-omics dataset and the comparison of their results complex and time-consuming.

The moiraine package aims at alleviating these issues, by providing a framework to easily and consistently apply different integration tools to a same multi-omics dataset, and to compare the results from different integration tools. It implements numerous visualisation and reporting functions to facilitate the interpretation of the integration results as well as the comparison of these results across integration methods. In addition, in an effort to make these computations reproducible, moiraine heavily relies on the targets package for the construction of reproducible analysis pipelines.

The moiraine package

The workflow for a typical multi-omics integration analysis handled with moiraine includes the following steps:

  • Data import: this covers the import of omics measurements as well as associated metadata (i.e. information about the omics features and samples) – moiraine relies on the MultiDataSet package to store this information in a consistent format;

  • Inspection of the omics datasets: including checking values density distribution, samples overlap between omics datasets, or presence of missing values;

  • Preprocessing of the omics datasets: missing values imputation, transformation, and pre-filtering of samples and omics features;

  • Integration of the omics datasets by one or more of the supported tools; currently, the following integration methods are covered in moiraine:

    • sPLS and DIABLO from the mixOmics package

    • sO2PLS from the OmicsPLS package

    • MOFA and MEFISTO from the MOFA2 package

  • Interpretation of the integration results using standardised visualisations enriched with features and samples metadata;

  • Comparison of the integration results obtained by different methods or pre-processing approaches.

Note that in the moiraine package, we refer to the different biological entities measured in a given dataset (e.g. genes, transcripts, metabolic compounds, etc) as features, and the observations in a dataset as samples. Samples metadata and features metadata denote information about the samples (such as treatment, collection date, etc) and features (e.g. name, biological function, etc), respectively.

About this manual

In this manual, we are showcasing the functionalities of the moiraine package by walking through an in-depth example of a multi-omics integration analysis. We will be covering not only the functionalities of moiraine, but also discuss the different integration tools and provide recommendations for parameters setting and interpretation.

The moiraine package was designed to be used in the context of targets pipelines, therefore some familiarity with the targets package is necessary to follow this manual. Nevertheless, for users preferring R scripts to targets pipelines, alternative code will be provided to translate the target factories implemented in moiraine. To learn about targets, we refer readers to the excellent targets manual.

For clarity, throughout this manual, any code that belongs in a targets script (e.g. in the targets list in _targets.R) will be shown in a blue chunk, e.g.:

tar_target(
    my_first_target,
    c(2 + 2)
)

Any code shown in a regular code chunk (see below) is code that should be run in the command line, or saved in a regular R script. It often showcases how to inspect a certain target output after having run tar_make() to execute the targets pipeline, e.g.:

library(targets)

tar_read(my_first_target)
#> [1] 4

Note that for readers that prefer to use regular R scripts to targets pipelines, the target command in the example target chunk above can be converted to R script-compatible code as follows:

my_first_target <- 2 + 2

and the calls to tar_read() and tar_load() ignored.

Lastly, throughout the manual, the following options are set to improve upon the default colour scales:

options(
  ggplot2.continuous.colour = "viridis",
  ggplot2.continuous.fill = "viridis",
  ggplot2.discrete.colour = function() {
    ggplot2::scale_colour_brewer(
      palette = "Paired", 
      na.value = "grey"
    )
  } ,
  ggplot2.discrete.fill = function() {
    ggplot2::scale_fill_brewer(
      palette = "Paired",
      na.value = "grey"
    )
  } 
)