3  Importing data

The first step of the pipeline is to read in the different omics datasets and associated metadata. Once imported into R, they will be combined in a MultiDataSet object from the MultiDataSet package (Hernandez-Ferrer et al. (2017)) which will be used in the rest of the analysis pipeline.

For each omics dataset, we need to import:

Typically, omics datasets as well as associated metadata are stored in csv files, although there are some other file formats that moiraine can read, and we will see examples of this in the following sections.

3.1 The example dataset files

The dataset analysed this manual is presented in Chapter 2. The associated files that we will use here are:

  • Genomics data:

    • genomics_dataset.csv: contains the genomic variants’ dosage, with genomic variants as rows and samples as columns.

    • genomics_features_info.csv: contains information about the genomic variants (chromosome, genomic position, etc, as well as the results of a GWAS analysis).

  • Transcriptomics data:

    • transcriptomics_dataset.csv: contains the raw read counts for the measured genes – rows correspond to transcripts, and columns to samples.

    • bos_taurus_gene_model.gff3: the genome annotation file used to map the transcriptomics reads to gene models.

    • transcriptomics_de_results.csv: the results of a differential expression analysis run on the transcriptomics dataset to compare healthy and diseased animals.

    • transcriptomics_go_annotation.csv: contains the correspondence between genes and GO terms in a long format (one row per gene/GO term pair).

  • Metabolomics data:

    • metabolomics_dataset.csv: contains the area peak values – rows correspond to samples, and columns to compounds.

    • metabolomics_features_info.csv: contains information about the compounds (such as mass, retention time, and formula and name if the compounds has been identified) as well as the results of a differential expression analysis run on the metabolomics dataset to compare healthy and diseased animals.

  • Samples information: stored in the samples_info.csv file, in which each row corresponds to a sample.

Each of these files is available through the moiraine package, and can be retrieved via system.file("extdata/genomics_dataset.csv", package = "moiraine").

3.2 Importing the datasets

We will show how to import the datasets, first manually, and then in an automated way (using a target factory function).

3.2.1 Manually

We can start by creating targets that track the different data files. This ensures that when a data file changes, the target is considered outdated and any analysis relying on this data file will be re-run (see here for more information). For example, for the genomics dataset, we write:

tar_target(
  dataset_file_geno,
  system.file("extdata/genomics_dataset.csv", package = "moiraine"),
  format = "file"
)

The created target, called dataset_file_geno, takes as value the path to the file:

tar_read(dataset_file_geno)
#> [1] "/powerplant/workspace/hrpoab/RENV_CACHE/v5/R-4.2/x86_64-pc-linux-gnu/moiraine/1.0.0.9000/314a3529519bccd65ee1fc0b4352b7b4/moiraine/extdata/genomics_dataset.csv"

The next step is to import this dataset in R. We use the import_dataset_csv() function for that, rather than the readr::read_csv() or similar functions, as it ensures that the data is imported with the correct format for further use with the moiraine package. When importing a dataset, we need to specify the path to the file, as well as the name of the column in the csv file that contains the row names (through the col_id argument). In addition, we need to specify whether the features are represented in rows in the csv file, or in columns. This is done through the argument features_as_rows. For example, we can load the genomics dataset through:

tar_target(
  data_geno,
  import_dataset_csv(
    dataset_file_geno, 
    col_id = "marker", 
    features_as_rows = TRUE)
)

The function returns a matrix in which the rows correspond to the features measured, and the columns correspond to the samples:

tar_read(data_geno) |> dim()
#> [1] 23036   139
tar_read(data_geno)[1:5, 1:3]
#>                             R21 Y3660 Y3243
#> 1_41768691                    1     0     2
#> 10-27008241-A-C-rs42918694    2     2     2
#> 10-37505419-T-C-rs136559242   0     1     0
#> 10-49904259-G-A-rs471723345   1     2     2
#> 1-109550832-G-A-rs209732846   2     2     1

Note that import_dataset_csv() uses readr::read_csv() to read in the data. It accepts arguments that will be passed on to read_csv(), which can be useful to control how the data file must be read, e.g. by specifying the columns’ type, or which characters must be considered as missing values.

3.2.2 Using a target factory function

Creating a target to track the raw file and using the import_dataset_csv() function to read it can be a bit cumbersome if we want to import several datasets. Luckily, this process can be automated with the import_dataset_csv_factory() function. It takes as an input a vector of files path, and for each file creates:

  • a target named dataset_file_XX (XX explained below), which tracks the raw data file;

  • a target named data_XX, which corresponds to the data matrix that has been imported through the import_dataset_csv function.

For each file, we need to specify the name of the column giving the row names (argument col_ids), and whether the features are stored as rows or as columns (argument features_as_rowss). Note that these arguments are the same as in the primary function import_dataset_csv(), except that they have an additional ‘s’ at the end of their name. This will be the case for most of the target factory functions from the package.

In addition, we have to provide a unique suffix which will be appended to the name of the targets created (i.e. the XX mentioned above) through the target_name_suffixes argument. This allows us to track which target corresponds to which dataset.

So the following code (note that it is not within a tar_target() call):

import_dataset_csv_factory(
  files = c(
    system.file("extdata/genomics_dataset.csv", package = "moiraine"),
    system.file("extdata/transcriptomics_dataset.csv", package = "moiraine"),
    system.file("extdata/metabolomics_dataset.csv", package = "moiraine")
  ),
  col_ids = c("marker", "gene_id", "sample_id"),
  features_as_rowss = c(TRUE, TRUE, FALSE),
  target_name_suffixes = c("geno", "transcripto", "metabo")
)

will create the following targets:

  • dataset_file_geno, dataset_file_transcripto, dataset_file_metabo

  • data_geno, data_metabo, data_transcripto

tar_read(data_geno) |> dim()
#> [1] 23036   139
tar_read(data_transcripto) |> dim()
#> [1] 20335   143
tar_read(data_metabo) |> dim()
#> [1]  55 139

With this factory function, it is not possible to pass arguments to read_csv(). If you want to control how the files are read, please use the import_dataset_csv() function directly instead, as shown in Section 3.2.1.

Converting targets factory to R script

There is no simple way to convert this target factory to regular R script using loops, so we can instead write the code separately for each omics dataset.

dataset_file_geno <- system.file(
  "extdata/genomics_dataset.csv",
  package = "moiraine"
)
data_geno <- import_dataset_csv(
  dataset_file_geno,
  col_id = "marker",
  features_as_rows = TRUE
)

dataset_file_transcripto <- system.file(
  "extdata/transcriptomics_dataset.csv", 
  package = "moiraine"
)
data_transcripto <- import_dataset_csv(
  dataset_file_transcripto,
  col_id = "gene_id",
  features_as_rows = TRUE
)

dataset_file_metabo <- system.file(
  "extdata/metabolomics_dataset.csv", 
  package = "moiraine"
)
data_metabo <- import_dataset_csv(
  dataset_file_metabo,
  col_id = "sample_id",
  features_as_rows = FALSE
)

3.3 Importing the features metadata

Similarly to how we imported the datasets, there are two ways of importing features metadata: either manually, or using a target factory function. The two options are illustrated below.

3.3.1 Manually

As shown in the previous section, we can start by creating a target that tracks the raw features metadata file, then read the file into R using the import_fmetadata_csv() function. It has the similar arguments as the import_dataset_csv() function, but returns a data-frame (rather than a matrix); and does not have the options to read a csv where the features are columns (they must be in rows):

list(
  tar_target(
    fmetadata_file_geno,
    system.file("extdata/genomics_features_info.csv", package = "moiraine"),
    format = "file"
  ),
  
  tar_target(
    fmetadata_geno,
    import_fmetadata_csv(
      fmetadata_file_geno,
      col_id = "marker",
      col_types = c("chromosome" = "c")
    )
  )
)

Notice that in the import_fmetadata_csv() call, we’ve added an argument (col_types) which will be passed on to read_csv(). This is to ensure that the chromosome column will be read as character (even though the chromosomes are denoted with integers).

tar_read(fmetadata_geno) |> head()
#>                                                feature_id chromosome  position
#> 1_41768691                                     1_41768691          1  42139849
#> 10-27008241-A-C-rs42918694     10-27008241-A-C-rs42918694         10  26971270
#> 10-37505419-T-C-rs136559242   10-37505419-T-C-rs136559242         10  37388728
#> 10-49904259-G-A-rs471723345   10-49904259-G-A-rs471723345          0         0
#> 1-109550832-G-A-rs209732846   1-109550832-G-A-rs209732846          1 108696486
#> 11-104555023-A-G-rs109353933 11-104555023-A-G-rs109353933         11 104498929
#>                              gen_train_score ref alt ilmn_strand
#> 1_41768691                            0.6786   T   G         BOT
#> 10-27008241-A-C-rs42918694            0.8050   A   C         TOP
#> 10-37505419-T-C-rs136559242           0.7890   A   G         TOP
#> 10-49904259-G-A-rs471723345           0.7970   A   G         TOP
#> 1-109550832-G-A-rs209732846           0.8909   T   C         BOT
#> 11-104555023-A-G-rs109353933          0.8673   T   C         BOT
#>                              customer_strand norm_id    qtl_type qtl_effect
#> 1_41768691                               BOT       2 non signif.         NA
#> 10-27008241-A-C-rs42918694               TOP       1 non signif.         NA
#> 10-37505419-T-C-rs136559242              BOT       1 non signif.         NA
#> 10-49904259-G-A-rs471723345              TOP       2 non signif.         NA
#> 1-109550832-G-A-rs209732846              TOP       3 non signif.         NA
#> 11-104555023-A-G-rs109353933             TOP       1 non signif.         NA
#>                              p_value fdr
#> 1_41768691                        NA  NA
#> 10-27008241-A-C-rs42918694        NA  NA
#> 10-37505419-T-C-rs136559242       NA  NA
#> 10-49904259-G-A-rs471723345       NA  NA
#> 1-109550832-G-A-rs209732846       NA  NA
#> 11-104555023-A-G-rs109353933      NA  NA

You can see that in the data-frame of features metadata, the feature IDs are present both as row names and in the feature_id column. This makes it easier to subset the datasets later on.

3.3.2 Using a target factory function

Alternatively, we can use a target factory function that automates the process when we have to read in several features metadata files. In our case, we have to do it for the genomics and metabolomics datasets only, as the transcriptomics dataset has a different features metadata format. However because we need to specify the column types for the genomics dataset, we will use the targets factory function to read in the metabolomics features metadata only. The arguments are almost the same as for import_dataset_csv_factory() (except for features_as_rowss):

import_fmetadata_csv_factory(
  files = c(
    system.file("extdata/metabolomics_features_info.csv", package = "moiraine")
  ),
  col_ids = c("feature_id"),
  target_name_suffixes = c("metabo")
)

The targets created are:

  • fmetadata_file_metabo

  • fmetadata_metabo

tar_read(fmetadata_metabo) |> head()
#>           feature_id     hmdb_id                  name chemical_formula
#> HMDB00001  HMDB00001 HMDB0000001     1-Methylhistidine        C7H11N3O2
#> HMDB00008  HMDB00008 HMDB0000008 2-Hydroxybutyric acid           C4H8O3
#> HMDB00357  HMDB00357 HMDB0000011 3-Hydroxybutyric acid           C4H8O3
#> HMDB00042  HMDB00042 HMDB0000042           Acetic acid           C2H4O2
#> HMDB00043  HMDB00043 HMDB0000043               Betaine         C5H12NO2
#> HMDB00060  HMDB00060 HMDB0000060      Acetoacetic acid           C4H6O3
#>           monisotopic_molecular_weight cas_registry_number
#> HMDB00001                    169.08513            332-80-9
#> HMDB00008                    104.04734           3347-90-8
#> HMDB00357                    104.04734            625-72-9
#> HMDB00042                     60.02113             64-19-7
#> HMDB00043                    118.08680           6915-17-9
#> HMDB00060                    102.03169            541-50-4
#>                                smiles                    inchikey kegg_id
#> HMDB00001 CN1C=NC(C[C@H](N)C(O)=O)=C1 BRMWTNUJHUMWMS-LURJTMIESA-N  C01152
#> HMDB00008            CC[C@H](O)C(O)=O AFENDNXGAFYKQO-VKHMYHEASA-N  C05984
#> HMDB00357           C[C@@H](O)CC(O)=O WHBMMWSBFZVSSR-GSVOUGTGSA-N  C01089
#> HMDB00042                     CC(O)=O QTBSBXVTEAMEQO-UHFFFAOYSA-N  C00033
#> HMDB00043          C[N+](C)(C)CC(O)=O KWIUHFFTVRNATP-UHFFFAOYSA-O    <NA>
#> HMDB00060               CC(=O)CC(O)=O WDJHALXBUFZDSR-UHFFFAOYSA-N  C00164
#>                                    direct_parent                   super_class
#> HMDB00001              Histidine and derivatives Organic acids and derivatives
#> HMDB00008    Alpha hydroxy acids and derivatives Organic acids and derivatives
#> HMDB00357     Beta hydroxy acids and derivatives Organic acids and derivatives
#> HMDB00042                       Carboxylic acids Organic acids and derivatives
#> HMDB00043                      Alpha amino acids Organic acids and derivatives
#> HMDB00060 Short-chain keto acids and derivatives Organic acids and derivatives
#>               t_value      p_value         padj de_signif     de_status
#> HMDB00001  -0.5557020 5.797635e-01 6.784466e-01    Not DE        Not DE
#> HMDB00008   0.2181562 8.276321e-01 8.925444e-01    Not DE        Not DE
#> HMDB00357  -9.7388879 2.353250e-17 2.157146e-16        DE downregulated
#> HMDB00042 -12.5323491 1.753101e-24 4.821028e-23        DE downregulated
#> HMDB00043  -7.9073179 7.827088e-13 3.913544e-12        DE downregulated
#> HMDB00060  -0.4369834 6.628164e-01 7.439776e-01    Not DE        Not DE

Again, the targets factory function does not allow to pass arguments to read_csv() (if you need them, please use import_fmetadata_csv() directly as we have done in Section 3.3.1).

Converting targets factory to R script
fmetadata_file_metabo <- system.file(
  "extdata/metabolomics_features_info.csv", 
  package = "moiraine"
)
fmetadata_metabo <- import_fmetadata_csv(
  fmetadata_file_metabo,
  col_id = "feature_id"
)

3.3.3 Importing features metadata from a GTF/GFF file

The moiraine package can also extract features metadata from a genome annotation file (.gtf or .gff). We’ll demonstrate that for the transcriptomics dataset, for which information about the position and name of the transcripts can be found in the genome annotation used to map the reads. The function is called import_fmetadata_gff() (it is also the function you would use to read in information from a .gtf file). The type of information to extract from the annotation file is specified through the feature_type argument, which can be either 'genes' or 'transcripts'. In addition, if the function does not extract certain fields from the annotation file, these can be explicitly called using the add_fields parameter.

In this example, we want to extract information about the genes from the gtf file. We also want to make sure that the Name and descriptionfield are imported, as they give the name and description of the genes. To read in this information “manually”, we create the following targets:

list(
  tar_target(
    fmetadata_file_transcripto,
    system.file("extdata/bos_taurus_gene_model.gff3", package = "moiraine"),
    format = "file"
  ),
  
  tar_target(
    fmetadata_transcripto,
    import_fmetadata_gff(
      fmetadata_file_transcripto,
      feature_type = "genes",
      add_fields = c("Name", "description")
    )
  )
)

As for the other import functions, there exists a more succinct target factory version, called import_fmetadata_gff_factory():

import_fmetadata_gff_factory(
  files = system.file("extdata/bos_taurus_gene_model.gff3", package = "moiraine"),
  feature_types = "genes",
  add_fieldss = c("Name", "description"),
  target_name_suffixes = "transcripto"
)

This will create two targets: fmetadata_file_transcripto and fmetadata_transcripto.

As with import_fmetadata, the function returns a data-frame of features information:

tar_read(fmetadata_transcripto) |> head()
#>                            feature_id chromosome    start      end  width
#> ENSBTAG00000000005 ENSBTAG00000000005         17 65389743 65505336 115594
#> ENSBTAG00000000008 ENSBTAG00000000008         29 32214439 32244810  30372
#> ENSBTAG00000000009 ENSBTAG00000000009         18 12338037 12342272   4236
#> ENSBTAG00000000010 ENSBTAG00000000010         21 34209956 34223394  13439
#> ENSBTAG00000000011 ENSBTAG00000000011          8  7950815  7971600  20786
#> ENSBTAG00000000012 ENSBTAG00000000012         20 33708626 33732944  24319
#>                    strand  Name
#> ENSBTAG00000000005      +  GRK3
#> ENSBTAG00000000008      - KCNJ1
#> ENSBTAG00000000009      + FOXF1
#> ENSBTAG00000000010      +  UBL7
#> ENSBTAG00000000011      -   TDH
#> ENSBTAG00000000012      + TTC33
#>                                                                                                       description
#> ENSBTAG00000000005                        G protein-coupled receptor kinase 3 [Source:VGNC Symbol;Acc:VGNC:53904]
#> ENSBTAG00000000008 potassium inwardly rectifying channel subfamily J member 1 [Source:VGNC Symbol;Acc:VGNC:30453]
#> ENSBTAG00000000009                                            forkhead box F1 [Source:VGNC Symbol;Acc:VGNC:29084]
#> ENSBTAG00000000010                                           ubiquitin like 7 [Source:VGNC Symbol;Acc:VGNC:50128]
#> ENSBTAG00000000011                                 L-threonine dehydrogenase [Source:VGNC Symbol;Acc:VGNC:108945]
#> ENSBTAG00000000012                         tetratricopeptide repeat domain 33 [Source:VGNC Symbol;Acc:VGNC:36471]
Converting targets factory to R script
fmetadata_file_transcripto <- system.file(
  "extdata/bos_taurus_gene_model.gff3", 
  package = "moiraine"
)
fmetadata_transcripto <- import_fmetadata_gff(
  fmetadata_file_transcripto,
  feature_type = "genes",
  add_fields = c("Name", "description")
)

3.4 Importing the samples metadata

As for importing datasets or features metadata, the import_smetadata_csv() function reads in a csv file that contains information about the samples measured. Similarly to import_fmetadata_csv(), this function assumes that the csv file contains samples as rows. In this example, we have one samples information file for all of our omics datasets, but it is possible to have one separate samples metadata csv file for each omics dataset (if there are some omics-specific information such as batch, technology specifications, etc).

We can do this by manually creating the following targets:

list(
  tar_target(
    smetadata_file_all,
    system.file("extdata/samples_info.csv", package = "moiraine"),
    format = "file"
  ),

  tar_target(
    smetadata_all,
    import_smetadata_csv(
      smetadata_file_all,
      col_id = "animal_id"
    )
  )
)

which is equivalent to the (more succinct) command:

import_smetadata_csv_factory(
  files = system.file("extdata/samples_info.csv", package = "moiraine"),
  col_ids = "animal_id",
  target_name_suffixes = "all"
)

The latter command creates the targets smetadata_file_all and smetadata_all. smetadata_all stores the samples metadata imported as a data-frame:

tar_read(smetadata_all) |> head()
#>          id feedlot gender  status day_on_feed rnaseq_batch geno_comp_1
#> R21     R21      F1 female Control          31           B2    0.007853
#> Y3660 Y3660      F1   male Control          19           B2    0.852220
#> Y3243 Y3243      F1   male Control          16           B2    0.044171
#> R5764 R5764      F2   male Control          46           B1    0.213094
#> P4669 P4669      F3   male     BRD          35           B2    0.389393
#> R5452 R5452      F2   male Control          49           B1    0.265998
#>       geno_comp_2 geno_comp_3 geno_comp_cluster
#> R21      0.820691    0.171456                K3
#> Y3660    0.093585    0.054195                K2
#> Y3243    0.190262    0.765567                K1
#> R5764    0.676299    0.110607                K3
#> P4669    0.503471    0.107136                K3
#> R5452    0.733992    0.000010                K3

Note that in the samples metadata data-frame, the sample IDs are present both as row names and in the id column. This makes it easier to subset the datasets later on.

As for the other import functions, import_smetadata_csv() accepts arguments that will be passed to read_csv() in order to specify how the file should be read. The targets factory version does not have this option.

Converting targets factory to R script
smetadata_file_all <- system.file("extdata/samples_info.csv", package = "moiraine")
smetadata_all <- import_smetadata_csv(
  smetadata_file_all,
  col_id = "animal_id"
)

3.5 Creating the omics sets

Once each dataset and associated features and samples metadata have been imported, we need to combine them into omics sets. In practice, this means that for each omics dataset, we will create an R object that stores the actual dataset alongside its relevant metadata. moiraine relies on the Biobase containers derived from Biobase::eSet to store the different omics datasets; for example, Biobase::ExpressionSet objects are used to store transcriptomics measurements. Currently, moiraine support four types of omics containers:

  • genomics containers, which are Biobase::SnpSet objects. The particularity of this data type is that the features metadata data-frame must contain a column named chromosome and a column named position, which store the chromosome and genomic position within the chromosome (in base pairs) of a given genomic marker or variant.

  • transcriptomics containers, which are Biobase::ExpressionSet objects. The particularity of this data type is that the features metadata data-frame must contain the following columns: chromosome, start, end, giving the chromosome, start and end positions (in base pairs) of the genes or transcripts. Moreover, the values in start and end must be integers, and for each row the value in end must be higher than the value in start.

  • metabolomics containers, which are MetabolomeSet objects (implemented within moiraine). There are no restrictions on the features metadata table for this type of containers.

  • phenotype containers, which are PhenotypeSet objects (implemented within moiraine). There are no restrictions on the features metadata table for this type of containers.

In practice, the nuances between these different containers are not very important, and the type of container used to store a particular dataset will have no impact on the downstream analysis apart from the name that will be given to the omics dataset. So in order to create a container for a transcriptomics dataset in the absence of features metadata, we have to create a dummy data-frame with the columns chromosome, start and end containing the values ch1, 1, and 10 (for example) and use that as features metadata. Alternately, or for other omics data (e.g. proteomics), it is possible to use a PhenotypeSet object instead.

3.5.1 Creating a single omics set

The function create_omics_set() provides a convenient wrapper to create such container objects from the imported datasets and metadata. It has two mandatory arguments: the dataset, which should be in the form of a matrix where the rows correspond to features and the columns to samples; and the type of omics data that the dataset represents ('genomics', 'transcriptomics', 'metabolomics' or 'phenomics'). The latter determines which type of container will be generated. Optionally, a features metadata and/or a samples metadata data-frame can be passed on via the features_metadata and samples_metadata arguments, respectively. For example, let’s create a set for the genomics data:

tar_target(
  set_geno,
  create_omics_set(
    data_geno,
    omics_type = "genomics",
    features_metadata = fmetadata_geno,
    samples_metadata = smetadata_all
  )
)

If executed, this command will return the following warning:

#> Warning: 5 samples in samples metadata not in dataset, will be removed from
#> metadata.

This is because, when providing features and samples metadata information, the function makes sure that the feature or sample IDs present in the metadata tables match those used in the dataset. In our case, 5 sample IDs from the metadata data-frame are not present in the dataset. We can confirm that by comparing the column names of the genomics dataset to the row names of the samples metadata:

setdiff(
  tar_read(smetadata_all) |> rownames(),
  tar_read(data_geno) |> colnames()
)
#> [1] "P4744" "P4772" "R8953" "U5416" "R9909"

Rather than throwing an error, the function will add a row for each missing sample ID to the metadata data-frame, with a NA in every column, and will remove from the metadata data-frame any sample not present in the dataset. The same applies for features metadata.

The resulting object is a SnpSet:

tar_read(set_geno)
#> SnpSet (storageMode: lockedEnvironment)
#> assayData: 23036 features, 139 samples 
#>   element names: call, callProbability 
#> protocolData: none
#> phenoData
#>   rowNames: R21 Y3660 ... O5108 (139 total)
#>   varLabels: id feedlot ... geno_comp_cluster (10 total)
#>   varMetadata: labelDescription
#> featureData
#>   featureNames: 1_41768691 10-27008241-A-C-rs42918694 ... STAT5_13516_2
#>     (23036 total)
#>   fvarLabels: feature_id chromosome ... fdr (13 total)
#>   fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:

which can be queried using specific methods from the Biobase package, e.g.:

tar_load(set_geno)

dim(set_geno)
#> Features  Samples 
#>    23036      139

featureNames(set_geno) |> head()
#> [1] "1_41768691"                   "10-27008241-A-C-rs42918694"  
#> [3] "10-37505419-T-C-rs136559242"  "10-49904259-G-A-rs471723345" 
#> [5] "1-109550832-G-A-rs209732846"  "11-104555023-A-G-rs109353933"

sampleNames(set_geno) |> head()
#> [1] "R21"   "Y3660" "Y3243" "R5764" "P4669" "R5452"

fData(set_geno) |> head() ## extracts features metadata
#>                                                feature_id chromosome  position
#> 1_41768691                                     1_41768691          1  42139849
#> 10-27008241-A-C-rs42918694     10-27008241-A-C-rs42918694         10  26971270
#> 10-37505419-T-C-rs136559242   10-37505419-T-C-rs136559242         10  37388728
#> 10-49904259-G-A-rs471723345   10-49904259-G-A-rs471723345          0         0
#> 1-109550832-G-A-rs209732846   1-109550832-G-A-rs209732846          1 108696486
#> 11-104555023-A-G-rs109353933 11-104555023-A-G-rs109353933         11 104498929
#>                              gen_train_score ref alt ilmn_strand
#> 1_41768691                            0.6786   T   G         BOT
#> 10-27008241-A-C-rs42918694            0.8050   A   C         TOP
#> 10-37505419-T-C-rs136559242           0.7890   A   G         TOP
#> 10-49904259-G-A-rs471723345           0.7970   A   G         TOP
#> 1-109550832-G-A-rs209732846           0.8909   T   C         BOT
#> 11-104555023-A-G-rs109353933          0.8673   T   C         BOT
#>                              customer_strand norm_id    qtl_type qtl_effect
#> 1_41768691                               BOT       2 non signif.         NA
#> 10-27008241-A-C-rs42918694               TOP       1 non signif.         NA
#> 10-37505419-T-C-rs136559242              BOT       1 non signif.         NA
#> 10-49904259-G-A-rs471723345              TOP       2 non signif.         NA
#> 1-109550832-G-A-rs209732846              TOP       3 non signif.         NA
#> 11-104555023-A-G-rs109353933             TOP       1 non signif.         NA
#>                              p_value fdr
#> 1_41768691                        NA  NA
#> 10-27008241-A-C-rs42918694        NA  NA
#> 10-37505419-T-C-rs136559242       NA  NA
#> 10-49904259-G-A-rs471723345       NA  NA
#> 1-109550832-G-A-rs209732846       NA  NA
#> 11-104555023-A-G-rs109353933      NA  NA

pData(set_geno) |> head() ## extracts samples metadata
#>          id feedlot gender  status day_on_feed rnaseq_batch geno_comp_1
#> R21     R21      F1 female Control          31           B2    0.007853
#> Y3660 Y3660      F1   male Control          19           B2    0.852220
#> Y3243 Y3243      F1   male Control          16           B2    0.044171
#> R5764 R5764      F2   male Control          46           B1    0.213094
#> P4669 P4669      F3   male     BRD          35           B2    0.389393
#> R5452 R5452      F2   male Control          49           B1    0.265998
#>       geno_comp_2 geno_comp_3 geno_comp_cluster
#> R21      0.820691    0.171456                K3
#> Y3660    0.093585    0.054195                K2
#> Y3243    0.190262    0.765567                K1
#> R5764    0.676299    0.110607                K3
#> P4669    0.503471    0.107136                K3
#> R5452    0.733992    0.000010                K3

Note that these methods can also be applied to the other types of containers.

3.5.2 Using a target factory for creating omics sets

The function create_omics_set_factory() allows us to create several omics sets at once. It returns a list of targets, each storing one of the created omics set container. It takes as input arguments vectors that give for each omics set the arguments required by create_omics_set().

create_omics_set_factory(
  datasets = c(data_geno, data_transcripto, data_metabo),
  omics_types = c("genomics", "transcriptomics", "metabolomics"),
  features_metadatas = c(fmetadata_geno, fmetadata_transcripto, fmetadata_metabo),
  samples_metadatas = c(smetadata_all, smetadata_all, smetadata_all)
)

Again, the warnings raised by the function originate from discrepancies between the datasets and associated metadata. It is always good practice to double-check manually to make sure that it is not due to a typo in the IDs or similar error.

If one of the datasets has no associated features or samples metadata, use NULL in the corresponding input arguments, e.g.:

create_omics_set_factory(
  datasets = c(data_geno, data_transcripto, data_metabo),
  omics_types = c("genomics", "transcriptomics", "metabolomics"),
  features_metadatas = c(NULL, fmetadata_transcripto, fmetadata_metabo),
  samples_metadatas = c(smetadata_all, NULL, smetadata_all)
)

The create_omics_set_factory() function has a target_name_suffixes argument to customise the name of the created targets. However, if this argument is not provided, the function will attempt to read the suffixes to use from the name of the dataset targets. So in this case, it knows that the suffixes to use are 'geno', 'transcripto' and 'metabo'. Consequently, the function creates the following targets: set_geno, set_transcripto, set_metabo.

tar_read(set_geno)
#> SnpSet (storageMode: lockedEnvironment)
#> assayData: 23036 features, 139 samples 
#>   element names: call, callProbability 
#> protocolData: none
#> phenoData
#>   rowNames: R21 Y3660 ... O5108 (139 total)
#>   varLabels: id feedlot ... geno_comp_cluster (10 total)
#>   varMetadata: labelDescription
#> featureData
#>   featureNames: 1_41768691 10-27008241-A-C-rs42918694 ... STAT5_13516_2
#>     (23036 total)
#>   fvarLabels: feature_id chromosome ... fdr (13 total)
#>   fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:
tar_read(set_transcripto)
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 20335 features, 143 samples 
#>   element names: exprs 
#> protocolData: none
#> phenoData
#>   rowNames: R9497 R5969 ... Y9816 (143 total)
#>   varLabels: id feedlot ... geno_comp_cluster (10 total)
#>   varMetadata: labelDescription
#> featureData
#>   featureNames: ENSBTAG00000000005 ENSBTAG00000000008 ...
#>     ENSBTAG00000055314 (20335 total)
#>   fvarLabels: feature_id chromosome ... description (8 total)
#>   fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:
tar_read(set_metabo)
#> MetabolomeSet (storageMode: lockedEnvironment)
#> assayData: 55 features, 139 samples 
#>   element names: call 
#> protocolData: none
#> phenoData
#>   rowNames: R21 Y3660 ... U5416 (139 total)
#>   varLabels: id feedlot ... geno_comp_cluster (10 total)
#>   varMetadata: labelDescription
#> featureData
#>   featureNames: HMDB00001 HMDB00008 ... HMDB01881 (55 total)
#>   fvarLabels: feature_id hmdb_id ... de_status (16 total)
#>   fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:
Converting targets factory to R script

Again there not an easy way to use loops to convert this targets factory, so instead we’ll write the code for each omics dataset.

set_geno <- create_omics_set(
  data_geno,
  omics_type = "genomics",
  features_metadata = fmetadata_geno,
  samples_metadata = smetadata_all
)

set_transcripto <- create_omics_set(
  data_transcripto,
  omics_type = "transcriptomics",
  features_metadata = fmetadata_transcripto,
  samples_metadata = smetadata_all
)

set_metabo <- create_omics_set(
  data_metabo,
  omics_type = "metabolomics",
  features_metadata = fmetadata_metabo,
  samples_metadata = smetadata_all
)

3.6 Creating the multi-omics set

Finally, we can combine the different omics sets into one multi-omics set object. moiraine makes use of the MultiDataSet package for that. MultiDataSet (Hernandez-Ferrer et al. 2017) implements a multi-omics data container that collects, in one R object, several omics datasets alongside their associated features and samples metadata. One of the main advantages of using a MultiDataSet container is that we can pass all of the information associated with a set of related omics datasets with only one R object. In addition, the MultiDataSet package implements a number of very useful functions. For example, it is possible to assess the samples that are common to several omics sets. This is particularly useful for data integration, as the moiraine package can automatically discard samples missing from one or more datasets prior to the integration step if needed. Note that sample matching between the different omics datasets is based on sample IDs, so they must be consistent between the different datasets.

We will create the multi-omics set with the create_multiomics_set() function. It requires a list of the omics sets (that we created via either create_omics_set() or create_omics_set_factory()) to include, and returns a MultiDataSet::MultiDataSet-class object.

tar_target(
  mo_set,
  create_multiomics_set(
    list(set_geno,
         set_transcripto,
         set_metabo)
  )
)
tar_read(mo_set)
#> Object of class 'MultiDataSet'
#>  . assayData: 3 elements
#>     . snps: 23036 features, 139 samples 
#>     . rnaseq: 20335 features, 143 samples 
#>     . metabolome: 55 features, 139 samples 
#>  . featureData:
#>     . snps: 23036 rows, 13 cols (feature_id, ..., p_value)
#>     . rnaseq: 20335 rows, 8 cols (feature_id, ..., Name)
#>     . metabolome: 55 rows, 16 cols (feature_id, ..., de_signif)
#>  . rowRanges:
#>     . snps: YES
#>     . rnaseq: YES
#>     . metabolome: NO
#>  . phenoData:
#>     . snps: 139 samples, 10 cols (id, ..., geno_comp_3)
#>     . rnaseq: 143 samples, 10 cols (id, ..., geno_comp_3)
#>     . metabolome: 139 samples, 10 cols (id, ..., geno_comp_3)

Within the MultiDataSet object, each omics set is assigned a name. The name depends first on the omics container type: a SnpSet set will be named snps, an ExpressionSet set will be named rnaseq, a MetabolomeSet will be named metabolome and a PhenotypeSet will be called phenotypes. If several sets of the same type are provided, they will be assigned unique names, e.g. snps+1 and snps+2 (the + symbol used as separator is set in the MultiDataSet package and cannot be changed). Alternatively, we can provide custom names for the datasets, using the datasets_names argument. These will be added to the type name (e.g. snps+customname). For example:

tar_target(
  mo_set_with_names,
  create_multiomics_set(
    list(set_geno,
         set_transcripto,
         set_metabo),
    datasets_names = c("CaptureSeq", "RNAseq", "LCMS")
  )
)

returns:

tar_read(mo_set_with_names)
#> Object of class 'MultiDataSet'
#>  . assayData: 3 elements
#>     . snps+CaptureSeq: 23036 features, 139 samples 
#>     . rnaseq+RNAseq: 20335 features, 143 samples 
#>     . metabolome+LCMS: 55 features, 139 samples 
#>  . featureData:
#>     . snps+CaptureSeq: 23036 rows, 13 cols (feature_id, ..., p_value)
#>     . rnaseq+RNAseq: 20335 rows, 8 cols (feature_id, ..., Name)
#>     . metabolome+LCMS: 55 rows, 16 cols (feature_id, ..., de_signif)
#>  . rowRanges:
#>     . snps+CaptureSeq: YES
#>     . rnaseq+RNAseq: YES
#>     . metabolome+LCMS: NO
#>  . phenoData:
#>     . snps+CaptureSeq: 139 samples, 10 cols (id, ..., geno_comp_3)
#>     . rnaseq+RNAseq: 143 samples, 10 cols (id, ..., geno_comp_3)
#>     . metabolome+LCMS: 139 samples, 10 cols (id, ..., geno_comp_3)

Importantly, the create_multiomics_set() function makes sure that samples metadata is consistent across the datasets for common samples. That is, if the same column (i.e. with the same name) is present in the samples metadata of several omics datasets, the values in this column must match for each sample present in all datasets. Otherwise, the function returns an error.

In the following chapter on Inspecting the MultiDataSet object, we will see how to handle the MultiDataSet object we just created. Alternatively, the MultiDataSet package vignette provides examples of constructing, querying and subsetting MultiDataSet objects.

3.7 Recap – targets list

For convenience, here is the list of targets that we created in this section:

Targets list for data import
list(
  ## Data import using a target factory
  import_dataset_csv_factory(
    files = c(
      system.file("extdata/genomics_dataset.csv", package = "moiraine"),
      system.file("extdata/transcriptomics_dataset.csv", package = "moiraine"),
      system.file("extdata/metabolomics_dataset.csv", package = "moiraine")
    ),
    col_ids = c("marker", "gene_id", "sample_id"),
    features_as_rowss = c(TRUE, TRUE, FALSE),
    target_name_suffixes = c("geno", "transcripto", "metabo")
  ),
  
  ## Genomics features metadata file
  tar_target(
    fmetadata_file_geno,
    system.file("extdata/genomics_features_info.csv", package = "moiraine"),
    format = "file"
  ),
  
  ## Genomics features metadata import
  tar_target(
    fmetadata_geno,
    import_fmetadata_csv(
      fmetadata_file_geno,
      col_id = "marker",
      col_types = c("chromosome" = "c")
    )
  ),
  
  ## Metabolomics features metadata import
  import_fmetadata_csv_factory(
    files = c(
      system.file("extdata/metabolomics_features_info.csv", package = "moiraine")
    ),
    col_ids = c("feature_id"),
    target_name_suffixes = c("metabo")
  ),
  
  ## Transcriptomics features metadata import
  import_fmetadata_gff_factory(
    files = system.file("extdata/bos_taurus_gene_model.gff3", package = "moiraine"),
    feature_types = "genes",
    add_fieldss = c("Name", "description"),
    target_name_suffixes = "transcripto"
  ),
  
  ## Samples metadata import
  import_smetadata_csv_factory(
    files = system.file("extdata/samples_info.csv", package = "moiraine"),
    col_ids = "animal_id",
    target_name_suffixes = "all"
  ),
  
  ## Creating omics sets for each dataset
  create_omics_set_factory(
    datasets = c(data_geno, data_transcripto, data_metabo),
    omics_types = c("genomics", "transcriptomics", "metabolomics"),
    features_metadatas = c(fmetadata_geno, fmetadata_transcripto, fmetadata_metabo),
    samples_metadatas = c(smetadata_all, smetadata_all, smetadata_all)
  ),
  
  ## Creating the MultiDataSet object
  tar_target(
    mo_set,
    create_multiomics_set(
      list(set_geno,
           set_transcripto,
           set_metabo)
    )
  )
)