3 Importing data
The first step of the pipeline is to read in the different omics datasets and associated metadata. Once imported into R, they will be combined in a MultiDataSet
object from the MultiDataSet
package (Hernandez-Ferrer et al. (2017)) which will be used in the rest of the analysis pipeline.
For each omics dataset, we need to import:
- the dataset itself (i.e. the matrix of omics measurements);
the features metadata, i.e. information about the features measured in the dataset;
the samples metadata, i.e. information about the samples measured in the dataset.
Typically, omics datasets as well as associated metadata are stored in csv
files, although there are some other file formats that moiraine
can read, and we will see examples of this in the following sections.
3.1 The example dataset files
The dataset analysed this manual is presented in Chapter 2. The associated files that we will use here are:
-
Genomics data:
genomics_dataset.csv
: contains the genomic variants’ dosage, with genomic variants as rows and samples as columns.genomics_features_info.csv
: contains information about the genomic variants (chromosome, genomic position, etc, as well as the results of a GWAS analysis).
-
Transcriptomics data:
transcriptomics_dataset.csv
: contains the raw read counts for the measured genes – rows correspond to transcripts, and columns to samples.bos_taurus_gene_model.gff3
: the genome annotation file used to map the transcriptomics reads to gene models.transcriptomics_de_results.csv
: the results of a differential expression analysis run on the transcriptomics dataset to compare healthy and diseased animals.transcriptomics_go_annotation.csv
: contains the correspondence between genes and GO terms in a long format (one row per gene/GO term pair).
-
Metabolomics data:
metabolomics_dataset.csv
: contains the area peak values – rows correspond to samples, and columns to compounds.metabolomics_features_info.csv
: contains information about the compounds (such as mass, retention time, and formula and name if the compounds has been identified) as well as the results of a differential expression analysis run on the metabolomics dataset to compare healthy and diseased animals.
Samples information: stored in the
samples_info.csv
file, in which each row corresponds to a sample.
Each of these files is available through the moiraine
package, and can be retrieved via system.file("extdata/genomics_dataset.csv", package = "moiraine")
.
3.2 Importing the datasets
We will show how to import the datasets, first manually, and then in an automated way (using a target factory function).
3.2.1 Manually
We can start by creating targets that track the different data files. This ensures that when a data file changes, the target is considered outdated and any analysis relying on this data file will be re-run (see here for more information). For example, for the genomics dataset, we write:
tar_target(
dataset_file_geno,
system.file("extdata/genomics_dataset.csv", package = "moiraine"),
format = "file"
)
The created target, called dataset_file_geno
, takes as value the path to the file:
tar_read(dataset_file_geno)
#> [1] "/powerplant/workspace/hrpoab/RENV_CACHE/v5/R-4.2/x86_64-pc-linux-gnu/moiraine/1.0.0.9000/314a3529519bccd65ee1fc0b4352b7b4/moiraine/extdata/genomics_dataset.csv"
The next step is to import this dataset in R. We use the import_dataset_csv()
function for that, rather than the readr::read_csv()
or similar functions, as it ensures that the data is imported with the correct format for further use with the moiraine
package. When importing a dataset, we need to specify the path to the file, as well as the name of the column in the csv file that contains the row names (through the col_id
argument). In addition, we need to specify whether the features are represented in rows in the csv file, or in columns. This is done through the argument features_as_rows
. For example, we can load the genomics dataset through:
tar_target(
data_geno,
import_dataset_csv(
dataset_file_geno,
col_id = "marker",
features_as_rows = TRUE)
)
The function returns a matrix in which the rows correspond to the features measured, and the columns correspond to the samples:
Note that import_dataset_csv()
uses readr::read_csv()
to read in the data. It accepts arguments that will be passed on to read_csv()
, which can be useful to control how the data file must be read, e.g. by specifying the columns’ type, or which characters must be considered as missing values.
3.2.2 Using a target factory function
Creating a target to track the raw file and using the import_dataset_csv()
function to read it can be a bit cumbersome if we want to import several datasets. Luckily, this process can be automated with the import_dataset_csv_factory()
function. It takes as an input a vector of files path, and for each file creates:
a target named
dataset_file_XX
(XX
explained below), which tracks the raw data file;a target named
data_XX
, which corresponds to the data matrix that has been imported through theimport_dataset_csv
function.
For each file, we need to specify the name of the column giving the row names (argument col_ids
), and whether the features are stored as rows or as columns (argument features_as_rowss
). Note that these arguments are the same as in the primary function import_dataset_csv()
, except that they have an additional ‘s’ at the end of their name. This will be the case for most of the target factory functions from the package.
In addition, we have to provide a unique suffix which will be appended to the name of the targets created (i.e. the XX
mentioned above) through the target_name_suffixes
argument. This allows us to track which target corresponds to which dataset.
So the following code (note that it is not within a tar_target()
call):
import_dataset_csv_factory(
files = c(
system.file("extdata/genomics_dataset.csv", package = "moiraine"),
system.file("extdata/transcriptomics_dataset.csv", package = "moiraine"),
system.file("extdata/metabolomics_dataset.csv", package = "moiraine")
),
col_ids = c("marker", "gene_id", "sample_id"),
features_as_rowss = c(TRUE, TRUE, FALSE),
target_name_suffixes = c("geno", "transcripto", "metabo")
)
will create the following targets:
dataset_file_geno
,dataset_file_transcripto
,dataset_file_metabo
data_geno
,data_metabo
,data_transcripto
With this factory function, it is not possible to pass arguments to read_csv()
. If you want to control how the files are read, please use the import_dataset_csv()
function directly instead, as shown in Section 3.2.1.
Converting targets factory to R script
There is no simple way to convert this target factory to regular R script using loops, so we can instead write the code separately for each omics dataset.
dataset_file_geno <- system.file(
"extdata/genomics_dataset.csv",
package = "moiraine"
)
data_geno <- import_dataset_csv(
dataset_file_geno,
col_id = "marker",
features_as_rows = TRUE
)
dataset_file_transcripto <- system.file(
"extdata/transcriptomics_dataset.csv",
package = "moiraine"
)
data_transcripto <- import_dataset_csv(
dataset_file_transcripto,
col_id = "gene_id",
features_as_rows = TRUE
)
dataset_file_metabo <- system.file(
"extdata/metabolomics_dataset.csv",
package = "moiraine"
)
data_metabo <- import_dataset_csv(
dataset_file_metabo,
col_id = "sample_id",
features_as_rows = FALSE
)
3.3 Importing the features metadata
Similarly to how we imported the datasets, there are two ways of importing features metadata: either manually, or using a target factory function. The two options are illustrated below.
3.3.1 Manually
As shown in the previous section, we can start by creating a target that tracks the raw features metadata file, then read the file into R using the import_fmetadata_csv()
function. It has the similar arguments as the import_dataset_csv()
function, but returns a data-frame (rather than a matrix); and does not have the options to read a csv where the features are columns (they must be in rows):
list(
tar_target(
fmetadata_file_geno,
system.file("extdata/genomics_features_info.csv", package = "moiraine"),
format = "file"
),
tar_target(
fmetadata_geno,
import_fmetadata_csv(
fmetadata_file_geno,
col_id = "marker",
col_types = c("chromosome" = "c")
)
)
)
Notice that in the import_fmetadata_csv()
call, we’ve added an argument (col_types
) which will be passed on to read_csv()
. This is to ensure that the chromosome
column will be read as character (even though the chromosomes are denoted with integers).
tar_read(fmetadata_geno) |> head()
#> feature_id chromosome position
#> 1_41768691 1_41768691 1 42139849
#> 10-27008241-A-C-rs42918694 10-27008241-A-C-rs42918694 10 26971270
#> 10-37505419-T-C-rs136559242 10-37505419-T-C-rs136559242 10 37388728
#> 10-49904259-G-A-rs471723345 10-49904259-G-A-rs471723345 0 0
#> 1-109550832-G-A-rs209732846 1-109550832-G-A-rs209732846 1 108696486
#> 11-104555023-A-G-rs109353933 11-104555023-A-G-rs109353933 11 104498929
#> gen_train_score ref alt ilmn_strand
#> 1_41768691 0.6786 T G BOT
#> 10-27008241-A-C-rs42918694 0.8050 A C TOP
#> 10-37505419-T-C-rs136559242 0.7890 A G TOP
#> 10-49904259-G-A-rs471723345 0.7970 A G TOP
#> 1-109550832-G-A-rs209732846 0.8909 T C BOT
#> 11-104555023-A-G-rs109353933 0.8673 T C BOT
#> customer_strand norm_id qtl_type qtl_effect
#> 1_41768691 BOT 2 non signif. NA
#> 10-27008241-A-C-rs42918694 TOP 1 non signif. NA
#> 10-37505419-T-C-rs136559242 BOT 1 non signif. NA
#> 10-49904259-G-A-rs471723345 TOP 2 non signif. NA
#> 1-109550832-G-A-rs209732846 TOP 3 non signif. NA
#> 11-104555023-A-G-rs109353933 TOP 1 non signif. NA
#> p_value fdr
#> 1_41768691 NA NA
#> 10-27008241-A-C-rs42918694 NA NA
#> 10-37505419-T-C-rs136559242 NA NA
#> 10-49904259-G-A-rs471723345 NA NA
#> 1-109550832-G-A-rs209732846 NA NA
#> 11-104555023-A-G-rs109353933 NA NA
You can see that in the data-frame of features metadata, the feature IDs are present both as row names and in the feature_id
column. This makes it easier to subset the datasets later on.
3.3.2 Using a target factory function
Alternatively, we can use a target factory function that automates the process when we have to read in several features metadata files. In our case, we have to do it for the genomics and metabolomics datasets only, as the transcriptomics dataset has a different features metadata format. However because we need to specify the column types for the genomics dataset, we will use the targets factory function to read in the metabolomics features metadata only. The arguments are almost the same as for import_dataset_csv_factory()
(except for features_as_rowss
):
import_fmetadata_csv_factory(
files = c(
system.file("extdata/metabolomics_features_info.csv", package = "moiraine")
),
col_ids = c("feature_id"),
target_name_suffixes = c("metabo")
)
The targets created are:
fmetadata_file_metabo
fmetadata_metabo
tar_read(fmetadata_metabo) |> head()
#> feature_id hmdb_id name chemical_formula
#> HMDB00001 HMDB00001 HMDB0000001 1-Methylhistidine C7H11N3O2
#> HMDB00008 HMDB00008 HMDB0000008 2-Hydroxybutyric acid C4H8O3
#> HMDB00357 HMDB00357 HMDB0000011 3-Hydroxybutyric acid C4H8O3
#> HMDB00042 HMDB00042 HMDB0000042 Acetic acid C2H4O2
#> HMDB00043 HMDB00043 HMDB0000043 Betaine C5H12NO2
#> HMDB00060 HMDB00060 HMDB0000060 Acetoacetic acid C4H6O3
#> monisotopic_molecular_weight cas_registry_number
#> HMDB00001 169.08513 332-80-9
#> HMDB00008 104.04734 3347-90-8
#> HMDB00357 104.04734 625-72-9
#> HMDB00042 60.02113 64-19-7
#> HMDB00043 118.08680 6915-17-9
#> HMDB00060 102.03169 541-50-4
#> smiles inchikey kegg_id
#> HMDB00001 CN1C=NC(C[C@H](N)C(O)=O)=C1 BRMWTNUJHUMWMS-LURJTMIESA-N C01152
#> HMDB00008 CC[C@H](O)C(O)=O AFENDNXGAFYKQO-VKHMYHEASA-N C05984
#> HMDB00357 C[C@@H](O)CC(O)=O WHBMMWSBFZVSSR-GSVOUGTGSA-N C01089
#> HMDB00042 CC(O)=O QTBSBXVTEAMEQO-UHFFFAOYSA-N C00033
#> HMDB00043 C[N+](C)(C)CC(O)=O KWIUHFFTVRNATP-UHFFFAOYSA-O <NA>
#> HMDB00060 CC(=O)CC(O)=O WDJHALXBUFZDSR-UHFFFAOYSA-N C00164
#> direct_parent super_class
#> HMDB00001 Histidine and derivatives Organic acids and derivatives
#> HMDB00008 Alpha hydroxy acids and derivatives Organic acids and derivatives
#> HMDB00357 Beta hydroxy acids and derivatives Organic acids and derivatives
#> HMDB00042 Carboxylic acids Organic acids and derivatives
#> HMDB00043 Alpha amino acids Organic acids and derivatives
#> HMDB00060 Short-chain keto acids and derivatives Organic acids and derivatives
#> t_value p_value padj de_signif de_status
#> HMDB00001 -0.5557020 5.797635e-01 6.784466e-01 Not DE Not DE
#> HMDB00008 0.2181562 8.276321e-01 8.925444e-01 Not DE Not DE
#> HMDB00357 -9.7388879 2.353250e-17 2.157146e-16 DE downregulated
#> HMDB00042 -12.5323491 1.753101e-24 4.821028e-23 DE downregulated
#> HMDB00043 -7.9073179 7.827088e-13 3.913544e-12 DE downregulated
#> HMDB00060 -0.4369834 6.628164e-01 7.439776e-01 Not DE Not DE
Again, the targets factory function does not allow to pass arguments to read_csv()
(if you need them, please use import_fmetadata_csv()
directly as we have done in Section 3.3.1).
Converting targets factory to R script
fmetadata_file_metabo <- system.file(
"extdata/metabolomics_features_info.csv",
package = "moiraine"
)
fmetadata_metabo <- import_fmetadata_csv(
fmetadata_file_metabo,
col_id = "feature_id"
)
3.3.3 Importing features metadata from a GTF/GFF file
The moiraine
package can also extract features metadata from a genome annotation file (.gtf
or .gff
). We’ll demonstrate that for the transcriptomics dataset, for which information about the position and name of the transcripts can be found in the genome annotation used to map the reads. The function is called import_fmetadata_gff()
(it is also the function you would use to read in information from a .gtf
file). The type of information to extract from the annotation file is specified through the feature_type
argument, which can be either 'genes'
or 'transcripts'
. In addition, if the function does not extract certain fields from the annotation file, these can be explicitly called using the add_fields
parameter.
In this example, we want to extract information about the genes from the gtf file. We also want to make sure that the Name
and description
field are imported, as they give the name and description of the genes. To read in this information “manually”, we create the following targets:
list(
tar_target(
fmetadata_file_transcripto,
system.file("extdata/bos_taurus_gene_model.gff3", package = "moiraine"),
format = "file"
),
tar_target(
fmetadata_transcripto,
import_fmetadata_gff(
fmetadata_file_transcripto,
feature_type = "genes",
add_fields = c("Name", "description")
)
)
)
As for the other import functions, there exists a more succinct target factory version, called import_fmetadata_gff_factory()
:
import_fmetadata_gff_factory(
files = system.file("extdata/bos_taurus_gene_model.gff3", package = "moiraine"),
feature_types = "genes",
add_fieldss = c("Name", "description"),
target_name_suffixes = "transcripto"
)
This will create two targets: fmetadata_file_transcripto
and fmetadata_transcripto
.
As with import_fmetadata
, the function returns a data-frame of features information:
tar_read(fmetadata_transcripto) |> head()
#> feature_id chromosome start end width
#> ENSBTAG00000000005 ENSBTAG00000000005 17 65389743 65505336 115594
#> ENSBTAG00000000008 ENSBTAG00000000008 29 32214439 32244810 30372
#> ENSBTAG00000000009 ENSBTAG00000000009 18 12338037 12342272 4236
#> ENSBTAG00000000010 ENSBTAG00000000010 21 34209956 34223394 13439
#> ENSBTAG00000000011 ENSBTAG00000000011 8 7950815 7971600 20786
#> ENSBTAG00000000012 ENSBTAG00000000012 20 33708626 33732944 24319
#> strand Name
#> ENSBTAG00000000005 + GRK3
#> ENSBTAG00000000008 - KCNJ1
#> ENSBTAG00000000009 + FOXF1
#> ENSBTAG00000000010 + UBL7
#> ENSBTAG00000000011 - TDH
#> ENSBTAG00000000012 + TTC33
#> description
#> ENSBTAG00000000005 G protein-coupled receptor kinase 3 [Source:VGNC Symbol;Acc:VGNC:53904]
#> ENSBTAG00000000008 potassium inwardly rectifying channel subfamily J member 1 [Source:VGNC Symbol;Acc:VGNC:30453]
#> ENSBTAG00000000009 forkhead box F1 [Source:VGNC Symbol;Acc:VGNC:29084]
#> ENSBTAG00000000010 ubiquitin like 7 [Source:VGNC Symbol;Acc:VGNC:50128]
#> ENSBTAG00000000011 L-threonine dehydrogenase [Source:VGNC Symbol;Acc:VGNC:108945]
#> ENSBTAG00000000012 tetratricopeptide repeat domain 33 [Source:VGNC Symbol;Acc:VGNC:36471]
Converting targets factory to R script
fmetadata_file_transcripto <- system.file(
"extdata/bos_taurus_gene_model.gff3",
package = "moiraine"
)
fmetadata_transcripto <- import_fmetadata_gff(
fmetadata_file_transcripto,
feature_type = "genes",
add_fields = c("Name", "description")
)
3.4 Importing the samples metadata
As for importing datasets or features metadata, the import_smetadata_csv()
function reads in a csv file that contains information about the samples measured. Similarly to import_fmetadata_csv()
, this function assumes that the csv file contains samples as rows. In this example, we have one samples information file for all of our omics datasets, but it is possible to have one separate samples metadata csv file for each omics dataset (if there are some omics-specific information such as batch, technology specifications, etc).
We can do this by manually creating the following targets:
list(
tar_target(
smetadata_file_all,
system.file("extdata/samples_info.csv", package = "moiraine"),
format = "file"
),
tar_target(
smetadata_all,
import_smetadata_csv(
smetadata_file_all,
col_id = "animal_id"
)
)
)
which is equivalent to the (more succinct) command:
import_smetadata_csv_factory(
files = system.file("extdata/samples_info.csv", package = "moiraine"),
col_ids = "animal_id",
target_name_suffixes = "all"
)
The latter command creates the targets smetadata_file_all
and smetadata_all
. smetadata_all
stores the samples metadata imported as a data-frame:
tar_read(smetadata_all) |> head()
#> id feedlot gender status day_on_feed rnaseq_batch geno_comp_1
#> R21 R21 F1 female Control 31 B2 0.007853
#> Y3660 Y3660 F1 male Control 19 B2 0.852220
#> Y3243 Y3243 F1 male Control 16 B2 0.044171
#> R5764 R5764 F2 male Control 46 B1 0.213094
#> P4669 P4669 F3 male BRD 35 B2 0.389393
#> R5452 R5452 F2 male Control 49 B1 0.265998
#> geno_comp_2 geno_comp_3 geno_comp_cluster
#> R21 0.820691 0.171456 K3
#> Y3660 0.093585 0.054195 K2
#> Y3243 0.190262 0.765567 K1
#> R5764 0.676299 0.110607 K3
#> P4669 0.503471 0.107136 K3
#> R5452 0.733992 0.000010 K3
Note that in the samples metadata data-frame, the sample IDs are present both as row names and in the id
column. This makes it easier to subset the datasets later on.
As for the other import functions, import_smetadata_csv()
accepts arguments that will be passed to read_csv()
in order to specify how the file should be read. The targets factory version does not have this option.
Converting targets factory to R script
smetadata_file_all <- system.file("extdata/samples_info.csv", package = "moiraine")
smetadata_all <- import_smetadata_csv(
smetadata_file_all,
col_id = "animal_id"
)
3.5 Creating the omics sets
Once each dataset and associated features and samples metadata have been imported, we need to combine them into omics sets. In practice, this means that for each omics dataset, we will create an R object that stores the actual dataset alongside its relevant metadata. moiraine
relies on the Biobase
containers derived from Biobase::eSet
to store the different omics datasets; for example, Biobase::ExpressionSet
objects are used to store transcriptomics measurements. Currently, moiraine
support four types of omics containers:
genomics containers, which are
Biobase::SnpSet
objects. The particularity of this data type is that the features metadata data-frame must contain a column namedchromosome
and a column namedposition
, which store the chromosome and genomic position within the chromosome (in base pairs) of a given genomic marker or variant.transcriptomics containers, which are
Biobase::ExpressionSet
objects. The particularity of this data type is that the features metadata data-frame must contain the following columns:chromosome
,start
,end
, giving the chromosome, start and end positions (in base pairs) of the genes or transcripts. Moreover, the values instart
andend
must be integers, and for each row the value inend
must be higher than the value instart
.metabolomics containers, which are
MetabolomeSet
objects (implemented withinmoiraine
). There are no restrictions on the features metadata table for this type of containers.phenotype containers, which are
PhenotypeSet
objects (implemented withinmoiraine
). There are no restrictions on the features metadata table for this type of containers.
In practice, the nuances between these different containers are not very important, and the type of container used to store a particular dataset will have no impact on the downstream analysis apart from the name that will be given to the omics dataset. So in order to create a container for a transcriptomics dataset in the absence of features metadata, we have to create a dummy data-frame with the columns chromosome
, start
and end
containing the values ch1
, 1
, and 10
(for example) and use that as features metadata. Alternately, or for other omics data (e.g. proteomics), it is possible to use a PhenotypeSet
object instead.
3.5.1 Creating a single omics set
The function create_omics_set()
provides a convenient wrapper to create such container objects from the imported datasets and metadata. It has two mandatory arguments: the dataset, which should be in the form of a matrix where the rows correspond to features and the columns to samples; and the type of omics data that the dataset represents ('genomics'
, 'transcriptomics'
, 'metabolomics'
or 'phenomics'
). The latter determines which type of container will be generated. Optionally, a features metadata and/or a samples metadata data-frame can be passed on via the features_metadata
and samples_metadata
arguments, respectively. For example, let’s create a set for the genomics data:
tar_target(
set_geno,
create_omics_set(
data_geno,
omics_type = "genomics",
features_metadata = fmetadata_geno,
samples_metadata = smetadata_all
)
)
If executed, this command will return the following warning:
#> Warning: 5 samples in samples metadata not in dataset, will be removed from
#> metadata.
This is because, when providing features and samples metadata information, the function makes sure that the feature or sample IDs present in the metadata tables match those used in the dataset. In our case, 5 sample IDs from the metadata data-frame are not present in the dataset. We can confirm that by comparing the column names of the genomics dataset to the row names of the samples metadata:
Rather than throwing an error, the function will add a row for each missing sample ID to the metadata data-frame, with a NA
in every column, and will remove from the metadata data-frame any sample not present in the dataset. The same applies for features metadata.
The resulting object is a SnpSet
:
tar_read(set_geno)
#> SnpSet (storageMode: lockedEnvironment)
#> assayData: 23036 features, 139 samples
#> element names: call, callProbability
#> protocolData: none
#> phenoData
#> rowNames: R21 Y3660 ... O5108 (139 total)
#> varLabels: id feedlot ... geno_comp_cluster (10 total)
#> varMetadata: labelDescription
#> featureData
#> featureNames: 1_41768691 10-27008241-A-C-rs42918694 ... STAT5_13516_2
#> (23036 total)
#> fvarLabels: feature_id chromosome ... fdr (13 total)
#> fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:
which can be queried using specific methods from the Biobase
package, e.g.:
tar_load(set_geno)
dim(set_geno)
#> Features Samples
#> 23036 139
featureNames(set_geno) |> head()
#> [1] "1_41768691" "10-27008241-A-C-rs42918694"
#> [3] "10-37505419-T-C-rs136559242" "10-49904259-G-A-rs471723345"
#> [5] "1-109550832-G-A-rs209732846" "11-104555023-A-G-rs109353933"
sampleNames(set_geno) |> head()
#> [1] "R21" "Y3660" "Y3243" "R5764" "P4669" "R5452"
fData(set_geno) |> head() ## extracts features metadata
#> feature_id chromosome position
#> 1_41768691 1_41768691 1 42139849
#> 10-27008241-A-C-rs42918694 10-27008241-A-C-rs42918694 10 26971270
#> 10-37505419-T-C-rs136559242 10-37505419-T-C-rs136559242 10 37388728
#> 10-49904259-G-A-rs471723345 10-49904259-G-A-rs471723345 0 0
#> 1-109550832-G-A-rs209732846 1-109550832-G-A-rs209732846 1 108696486
#> 11-104555023-A-G-rs109353933 11-104555023-A-G-rs109353933 11 104498929
#> gen_train_score ref alt ilmn_strand
#> 1_41768691 0.6786 T G BOT
#> 10-27008241-A-C-rs42918694 0.8050 A C TOP
#> 10-37505419-T-C-rs136559242 0.7890 A G TOP
#> 10-49904259-G-A-rs471723345 0.7970 A G TOP
#> 1-109550832-G-A-rs209732846 0.8909 T C BOT
#> 11-104555023-A-G-rs109353933 0.8673 T C BOT
#> customer_strand norm_id qtl_type qtl_effect
#> 1_41768691 BOT 2 non signif. NA
#> 10-27008241-A-C-rs42918694 TOP 1 non signif. NA
#> 10-37505419-T-C-rs136559242 BOT 1 non signif. NA
#> 10-49904259-G-A-rs471723345 TOP 2 non signif. NA
#> 1-109550832-G-A-rs209732846 TOP 3 non signif. NA
#> 11-104555023-A-G-rs109353933 TOP 1 non signif. NA
#> p_value fdr
#> 1_41768691 NA NA
#> 10-27008241-A-C-rs42918694 NA NA
#> 10-37505419-T-C-rs136559242 NA NA
#> 10-49904259-G-A-rs471723345 NA NA
#> 1-109550832-G-A-rs209732846 NA NA
#> 11-104555023-A-G-rs109353933 NA NA
pData(set_geno) |> head() ## extracts samples metadata
#> id feedlot gender status day_on_feed rnaseq_batch geno_comp_1
#> R21 R21 F1 female Control 31 B2 0.007853
#> Y3660 Y3660 F1 male Control 19 B2 0.852220
#> Y3243 Y3243 F1 male Control 16 B2 0.044171
#> R5764 R5764 F2 male Control 46 B1 0.213094
#> P4669 P4669 F3 male BRD 35 B2 0.389393
#> R5452 R5452 F2 male Control 49 B1 0.265998
#> geno_comp_2 geno_comp_3 geno_comp_cluster
#> R21 0.820691 0.171456 K3
#> Y3660 0.093585 0.054195 K2
#> Y3243 0.190262 0.765567 K1
#> R5764 0.676299 0.110607 K3
#> P4669 0.503471 0.107136 K3
#> R5452 0.733992 0.000010 K3
Note that these methods can also be applied to the other types of containers.
3.5.2 Using a target factory for creating omics sets
The function create_omics_set_factory()
allows us to create several omics sets at once. It returns a list of targets, each storing one of the created omics set container. It takes as input arguments vectors that give for each omics set the arguments required by create_omics_set()
.
create_omics_set_factory(
datasets = c(data_geno, data_transcripto, data_metabo),
omics_types = c("genomics", "transcriptomics", "metabolomics"),
features_metadatas = c(fmetadata_geno, fmetadata_transcripto, fmetadata_metabo),
samples_metadatas = c(smetadata_all, smetadata_all, smetadata_all)
)
Again, the warnings raised by the function originate from discrepancies between the datasets and associated metadata. It is always good practice to double-check manually to make sure that it is not due to a typo in the IDs or similar error.
If one of the datasets has no associated features or samples metadata, use NULL
in the corresponding input arguments, e.g.:
create_omics_set_factory(
datasets = c(data_geno, data_transcripto, data_metabo),
omics_types = c("genomics", "transcriptomics", "metabolomics"),
features_metadatas = c(NULL, fmetadata_transcripto, fmetadata_metabo),
samples_metadatas = c(smetadata_all, NULL, smetadata_all)
)
The create_omics_set_factory()
function has a target_name_suffixes
argument to customise the name of the created targets. However, if this argument is not provided, the function will attempt to read the suffixes to use from the name of the dataset targets. So in this case, it knows that the suffixes to use are 'geno'
, 'transcripto'
and 'metabo'
. Consequently, the function creates the following targets: set_geno
, set_transcripto
, set_metabo
.
tar_read(set_geno)
#> SnpSet (storageMode: lockedEnvironment)
#> assayData: 23036 features, 139 samples
#> element names: call, callProbability
#> protocolData: none
#> phenoData
#> rowNames: R21 Y3660 ... O5108 (139 total)
#> varLabels: id feedlot ... geno_comp_cluster (10 total)
#> varMetadata: labelDescription
#> featureData
#> featureNames: 1_41768691 10-27008241-A-C-rs42918694 ... STAT5_13516_2
#> (23036 total)
#> fvarLabels: feature_id chromosome ... fdr (13 total)
#> fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:
tar_read(set_transcripto)
#> ExpressionSet (storageMode: lockedEnvironment)
#> assayData: 20335 features, 143 samples
#> element names: exprs
#> protocolData: none
#> phenoData
#> rowNames: R9497 R5969 ... Y9816 (143 total)
#> varLabels: id feedlot ... geno_comp_cluster (10 total)
#> varMetadata: labelDescription
#> featureData
#> featureNames: ENSBTAG00000000005 ENSBTAG00000000008 ...
#> ENSBTAG00000055314 (20335 total)
#> fvarLabels: feature_id chromosome ... description (8 total)
#> fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:
tar_read(set_metabo)
#> MetabolomeSet (storageMode: lockedEnvironment)
#> assayData: 55 features, 139 samples
#> element names: call
#> protocolData: none
#> phenoData
#> rowNames: R21 Y3660 ... U5416 (139 total)
#> varLabels: id feedlot ... geno_comp_cluster (10 total)
#> varMetadata: labelDescription
#> featureData
#> featureNames: HMDB00001 HMDB00008 ... HMDB01881 (55 total)
#> fvarLabels: feature_id hmdb_id ... de_status (16 total)
#> fvarMetadata: labelDescription
#> experimentData: use 'experimentData(object)'
#> Annotation:
Converting targets factory to R script
Again there not an easy way to use loops to convert this targets factory, so instead we’ll write the code for each omics dataset.
set_geno <- create_omics_set(
data_geno,
omics_type = "genomics",
features_metadata = fmetadata_geno,
samples_metadata = smetadata_all
)
set_transcripto <- create_omics_set(
data_transcripto,
omics_type = "transcriptomics",
features_metadata = fmetadata_transcripto,
samples_metadata = smetadata_all
)
set_metabo <- create_omics_set(
data_metabo,
omics_type = "metabolomics",
features_metadata = fmetadata_metabo,
samples_metadata = smetadata_all
)
3.6 Creating the multi-omics set
Finally, we can combine the different omics sets into one multi-omics set object. moiraine
makes use of the MultiDataSet
package for that. MultiDataSet
(Hernandez-Ferrer et al. 2017) implements a multi-omics data container that collects, in one R object, several omics datasets alongside their associated features and samples metadata. One of the main advantages of using a MultiDataSet
container is that we can pass all of the information associated with a set of related omics datasets with only one R object. In addition, the MultiDataSet
package implements a number of very useful functions. For example, it is possible to assess the samples that are common to several omics sets. This is particularly useful for data integration, as the moiraine
package can automatically discard samples missing from one or more datasets prior to the integration step if needed. Note that sample matching between the different omics datasets is based on sample IDs, so they must be consistent between the different datasets.
We will create the multi-omics set with the create_multiomics_set()
function. It requires a list of the omics sets (that we created via either create_omics_set()
or create_omics_set_factory()
) to include, and returns a MultiDataSet::MultiDataSet-class
object.
tar_target(
mo_set,
create_multiomics_set(
list(set_geno,
set_transcripto,
set_metabo)
)
)
tar_read(mo_set)
#> Object of class 'MultiDataSet'
#> . assayData: 3 elements
#> . snps: 23036 features, 139 samples
#> . rnaseq: 20335 features, 143 samples
#> . metabolome: 55 features, 139 samples
#> . featureData:
#> . snps: 23036 rows, 13 cols (feature_id, ..., p_value)
#> . rnaseq: 20335 rows, 8 cols (feature_id, ..., Name)
#> . metabolome: 55 rows, 16 cols (feature_id, ..., de_signif)
#> . rowRanges:
#> . snps: YES
#> . rnaseq: YES
#> . metabolome: NO
#> . phenoData:
#> . snps: 139 samples, 10 cols (id, ..., geno_comp_3)
#> . rnaseq: 143 samples, 10 cols (id, ..., geno_comp_3)
#> . metabolome: 139 samples, 10 cols (id, ..., geno_comp_3)
Within the MultiDataSet
object, each omics set is assigned a name. The name depends first on the omics container type: a SnpSet
set will be named snps
, an ExpressionSet
set will be named rnaseq
, a MetabolomeSet
will be named metabolome
and a PhenotypeSet
will be called phenotypes
. If several sets of the same type are provided, they will be assigned unique names, e.g. snps+1
and snps+2
(the +
symbol used as separator is set in the MultiDataSet
package and cannot be changed). Alternatively, we can provide custom names for the datasets, using the datasets_names
argument. These will be added to the type name (e.g. snps+customname
). For example:
tar_target(
mo_set_with_names,
create_multiomics_set(
list(set_geno,
set_transcripto,
set_metabo),
datasets_names = c("CaptureSeq", "RNAseq", "LCMS")
)
)
returns:
tar_read(mo_set_with_names)
#> Object of class 'MultiDataSet'
#> . assayData: 3 elements
#> . snps+CaptureSeq: 23036 features, 139 samples
#> . rnaseq+RNAseq: 20335 features, 143 samples
#> . metabolome+LCMS: 55 features, 139 samples
#> . featureData:
#> . snps+CaptureSeq: 23036 rows, 13 cols (feature_id, ..., p_value)
#> . rnaseq+RNAseq: 20335 rows, 8 cols (feature_id, ..., Name)
#> . metabolome+LCMS: 55 rows, 16 cols (feature_id, ..., de_signif)
#> . rowRanges:
#> . snps+CaptureSeq: YES
#> . rnaseq+RNAseq: YES
#> . metabolome+LCMS: NO
#> . phenoData:
#> . snps+CaptureSeq: 139 samples, 10 cols (id, ..., geno_comp_3)
#> . rnaseq+RNAseq: 143 samples, 10 cols (id, ..., geno_comp_3)
#> . metabolome+LCMS: 139 samples, 10 cols (id, ..., geno_comp_3)
Importantly, the create_multiomics_set()
function makes sure that samples metadata is consistent across the datasets for common samples. That is, if the same column (i.e. with the same name) is present in the samples metadata of several omics datasets, the values in this column must match for each sample present in all datasets. Otherwise, the function returns an error.
In the following chapter on Inspecting the MultiDataSet object, we will see how to handle the MultiDataSet
object we just created. Alternatively, the MultiDataSet package vignette provides examples of constructing, querying and subsetting MultiDataSet
objects.
3.7 Recap – targets list
For convenience, here is the list of targets that we created in this section:
Targets list for data import
list(
## Data import using a target factory
import_dataset_csv_factory(
files = c(
system.file("extdata/genomics_dataset.csv", package = "moiraine"),
system.file("extdata/transcriptomics_dataset.csv", package = "moiraine"),
system.file("extdata/metabolomics_dataset.csv", package = "moiraine")
),
col_ids = c("marker", "gene_id", "sample_id"),
features_as_rowss = c(TRUE, TRUE, FALSE),
target_name_suffixes = c("geno", "transcripto", "metabo")
),
## Genomics features metadata file
tar_target(
fmetadata_file_geno,
system.file("extdata/genomics_features_info.csv", package = "moiraine"),
format = "file"
),
## Genomics features metadata import
tar_target(
fmetadata_geno,
import_fmetadata_csv(
fmetadata_file_geno,
col_id = "marker",
col_types = c("chromosome" = "c")
)
),
## Metabolomics features metadata import
import_fmetadata_csv_factory(
files = c(
system.file("extdata/metabolomics_features_info.csv", package = "moiraine")
),
col_ids = c("feature_id"),
target_name_suffixes = c("metabo")
),
## Transcriptomics features metadata import
import_fmetadata_gff_factory(
files = system.file("extdata/bos_taurus_gene_model.gff3", package = "moiraine"),
feature_types = "genes",
add_fieldss = c("Name", "description"),
target_name_suffixes = "transcripto"
),
## Samples metadata import
import_smetadata_csv_factory(
files = system.file("extdata/samples_info.csv", package = "moiraine"),
col_ids = "animal_id",
target_name_suffixes = "all"
),
## Creating omics sets for each dataset
create_omics_set_factory(
datasets = c(data_geno, data_transcripto, data_metabo),
omics_types = c("genomics", "transcriptomics", "metabolomics"),
features_metadatas = c(fmetadata_geno, fmetadata_transcripto, fmetadata_metabo),
samples_metadatas = c(smetadata_all, smetadata_all, smetadata_all)
),
## Creating the MultiDataSet object
tar_target(
mo_set,
create_multiomics_set(
list(set_geno,
set_transcripto,
set_metabo)
)
)
)