Difference: CompendiumDB (1 vs. 42)

Revision 42
Changes from r40 to r42
Line: 1 to 1
 
META TOPICPARENT name="SystemsGenomicsBioLab"
Changed:
<
<

compendiumdb: Tools for retrieval and storage of functional genomics data

>
>

compendiumdb: tools for retrieval and storage of functional genomics data

 

Public repositories such as the Gene Expression Omnibus (GEO) contain thousands of high-throughput functional genomics datasets. These datasets are a rich source of useful biological information. Extraction of meaningful information often requires the integration of a large number of datasets from different studies and platforms. The R package compendiumdb provides a flexible platform for the systematic retrieval and storage of functional genomics data downloaded from GEO in the form of a MySQL database accessed via R functions. It provides functions to (i) download data from GEO, (ii) store data in the database and (iii) retrieve data from the database. The R package is available at CRAN. Hands-on information on how to install the package and its basic functionality is provided in the package vignette. Here, we provide more detailed information on the different file types that the package downloads from GEO, the structure of the BigMac directory used to store the downloaded data and the entity relationship schema of the MySQL database.
Revision 40
Changes from r38 to r40
Line: 1 to 1
 
META TOPICPARENT name="SystemsGenomicsBioLab"
Changed:
<
<

compendiumdb: a database and R package for storing and analyzing gene expression data

Introduction

>
>

compendiumdb: a database and R package for storing and analyzing gene expression data

U.K.Nandal(at)amc.uva.nl, P.D. Moerland(at)amc.uva.nl, A.H.C. Kampen(at)amc.uva.nl

Public repositories such as Gene Expression Omnibus (GEO) contain thousands of microarray experiment datasets. These datasets are a rich source of useful biological information. Extraction of meaningful information often requires the integration of a large number of datasets from different microarray studies and platforms. The package compendiumdb provides a flexible platform for the systematic collection, storage and retrieval of gene expression data downloaded from GEO in the form of a MySQL database accessed via R functions. It provides functions to (a) download data from GEO and load data into the database, (b) store data in the database and (c) retrieve data from the database. The R package is available for use at CRAN. This page gives explanation regarding different file types that the package download from GEO, the entity relationship schema of MySQL and structure of BigMac directory.
 
Changed:
<
<
Public repositories such as Gene Expression Omnibus (GEO) contain thousands of microarray experiment datasets. These datasets are a rich source of useful biological information. Extraction of meaningful information often requires the integration of a large number of datasets from different microarray studies and platforms. The package compendiumdb provides a flexible platform for the systematic collection, storage and retrieval of gene expression data downloaded from GEO in the form of a MySQL database accessed via R functions. It provides functions to (a) download data from GEO and load data into the database, (b) store data in the database and (c) retrieve data from the database. The R package is available for use at CRAN. Below on this page detailed information regarding different file types that the package download from GEO is provided, the entity relationship schema of MySQL and structure of BigMac directory is also explained.
>
>

Contents

 
Changed:
<
<

What is GEO ?

>
>

Gene Expression Omnibus (GEO)

 

GEO (Gene Expression Omnibus) is a curated database for high-throughput data from single and dual channel microarray based experiments measuring expression of mRNA, miRNA, genomic DNA and proteins. GEO comprises this data in the form of following types of records
  1. Platform record ( GPL): describes properties of the microchip e.g. cDNA or oligonucleotide probesets. Each platform has unique identifier (GPLxxx).
Line: 19 to 21
 
  1. Series record ( GSE): links a number of individual related samples together and provide description of the whole study, obtained data, analysis and conclusions. Unique identifier is GSExxx.
Fig 1 below shows the interrelation among these three types of records.
Changed:
<
<
Platform record (GPL) contains links or mapping of each spot on it with its corresponding gene. The submitter submits GPLxxxx file to GEO in order to provide mapping information of spots. However, the mapping of spots or annotation changes with time, hence GEO curates and provide GPLxxxx.annot file with the recent annotation of the probes of some of the platforms. These platform annotation files can then be used to map the spots of probes with its corresponding genes.
>
>
Platform record (GPL) contains links or mapping of each spot on it with its corresponding gene. The submitter submits GPLxxxx file to GEO in order to provide mapping information of spots. However, the mapping of spots or annotation changes with time, hence GEO curates and provide GPLxxxx.annot file with the recent annotation of the probes of some of the platforms. These platform annotation files can then be used to map the spots of probes with its corresponding genes.
 

Sample record (GSM) together with many other data usually holds information about level of fluorescence in each spot. Sometimes it can be other measurements.

Records in GEO are manually curated. The summarization of the experiment supplied by the submitter is stored in GEO as a GSExxx record. The GSE records, that are processed using the same platform and are biologically and statistically comparable are analyzed and reassembled by GEO staff into GDSxxx (GEO Datasets). GSE records that are not comparable with other GSE records are not assembled into GDS, hence some records does not have GDS number in GEO. Annotation means that Samples will be classified in groups: for example, some Samples will be called “normal” and some “diseased”.
Changed:
<
<
>
>
 

Fig 1. General schema of GEO. A single platform can be related to more than one sample records but not vice-versa. Comparable GSE records holding the information of related samples (GSM) are grouped into GDS by manual curation. Some sample records (GSM) can be the part of more than one GSE records as shown by dashed red arrow. GEO does not provide any GDS record for GSEs having no comparable set.
Changed:
<
<

Why compendiumdb?

>
>

Advantage of compendiumdb

  Accessing and analyzing GEO data has several shortcomings.

  • First, data to be analyzed from GEO needs to be downloaded manually to the local system. This could become more complicated if the analysis consists of multiple datasets. Managing and handling these data files into local folders is again a problem if the size of these data files is very large.

Line: 38 to 40
 

compendiumdb enables building a domain-specific compendium of expression datasets via a flexible and homogeneous framework in the form of a MySQL database. All the data is stored in the form of MySQL tables. The compendiumdb package consists of a number of R functions developed to access the database either locally or remotely. The database schema has been designed to be rich enough to store most of the information provided by MIAME-compliant expression databases such as GEO.
Deleted:
<
<

Structure of BigMac directory

BigMac directory is created by downloadGEOdata function of the package. It systematically stores .soft files of GSEs, GSMs, GPLs and GDSs downloaded from GEO to their respective directories, as shown in the figure below

 

Relationship diagram of compendium database tables

Changed:
<
<
>
>
 

The schema of MySQL database of compendiumdb can be classified into four parts:
  1. Description of the Experiment
Line: 77 to 75
 

Expression data

expressionset: compendiumdb R package creates the R BioConductor expressionSet object containing expression data and other assay data of the experiment. The binary file of the R object is then split into parts and stored into "expressionset" table.
Changed:
<
<

Files

>
>

Structure of BigMac directory

BigMac directory is created by downloadGEOdata function of the package. It systematically stores .soft files of GSEs, GSMs, GPLs and GDSs downloaded from GEO to their respective directories, as shown in the figure below

Documentation

A comprehensive documentation is available that illustrates steps to follow for the installation of the package and its basic functionality.
 

META TOPICMOVED by="PerryMoerland" date="1347628597" from="BioLab.CompendiumAnalysis" to="BioLab.CompendiumDB"
Revision 20
Changes from r18 to r20
Line: 1 to 1
 
META TOPICPARENT name="SystemsGenomicsBioLab"
Deleted:
<
<
 
Added:
>
>

compendiumdb: a database and R package for storing and analyzing gene expression data

 

Introduction

Changed:
<
<
Public repositories such as Gene Expression Omnibus (GEO) contain thousands of microarray experiment datasets. These datasets are a rich source of useful biological information. Extraction of meaningful information often requires the integration of a large number of datasets from different microarray studies and platforms. The package compendiumdb provides a flexible platform for the systematic collection, storage and retrieval of gene expression data downloaded from GEO in the form of a MySQL database accessed via R functions. It provides functions to (a) download data from GEO and load data into the database, (b) store data in the database and (c) retrieve data from the database.
>
>
Public repositories such as Gene Expression Omnibus (GEO) contain thousands of microarray experiment datasets. These datasets are a rich source of useful biological information. Extraction of meaningful information often requires the integration of a large number of datasets from different microarray studies and platforms. The package compendiumdb provides a flexible platform for the systematic collection, storage and retrieval of gene expression data downloaded from GEO in the form of a MySQL database accessed via R functions. It provides functions to (a) download data from GEO and load data into the database, (b) store data in the database and (c) retrieve data from the database. The R package is available for use at CRAN. Below on this page detailed information regarding different file types that the package download from GEO is provided, the entity relationship schema of MySQL and structure of BigMac directory is also explained.
 

What is GEO ?

Line: 36 to 36
 
  • Second, SOFT format that is typically for GEO records cannot be easily manipulated with analysis softwares such as R, Rosetta, Spotfire and GEO itself does not provide broad options for data analysis.

  • Third, probes or spots on the array are not mapped properly to its corresponding genes or information about the gene is missing for some spots in the platform files provided by the GEO i.e GPLxxx. Improper platform annotation poses difficulty in the streamline analysis of the data and in cross-platform or cross-species comparisons.

Changed:
<
<
compendiumdb enables building a domain-specific compendium of expression datasets via a flexible and homogeneous framework in the form of a MySQL database. All the data is stored in the form of MySQL tables. The compendiumdb package consists of a number of R functions developed to access the database either locally or remotely. The database schema has been designed to be rich enough to store most of the information provided by MIAME-compliant expression databases such as GEO.
>
>
compendiumdb enables building a domain-specific compendium of expression datasets via a flexible and homogeneous framework in the form of a MySQL database. All the data is stored in the form of MySQL tables. The compendiumdb package consists of a number of R functions developed to access the database either locally or remotely. The database schema has been designed to be rich enough to store most of the information provided by MIAME-compliant expression databases such as GEO.
 
Changed:
<
<

Structure of BigMac directory

BigMac directory is created by downloadGEOdata function of the package. It systematically stores .soft files of GSEs, GSMs, GPLs and GDSs downloaded from GEO to their respective directories, as shown in the figure below
>
>

Structure of BigMac directory

BigMac directory is created by downloadGEOdata function of the package. It systematically stores .soft files of GSEs, GSMs, GPLs and GDSs downloaded from GEO to their respective directories, as shown in the figure below
 

Relationship diagram of compendium database tables

Changed:
<
<
The schema of MySQL database of compendiumdb can be classified into four parts:
>
>
The schema of MySQL database of compendiumdb can be classified into four parts:
 
  1. Description of the Experiment
  2. Sample information
  3. Platform information
Line: 75 to 75
 
  1. organism: Organism data has also been imported using GPL soft file the same way it is done in above tables of this section.

Expression data

Changed:
<
<
expressionset: compendiumdb R package creates the R BioConductor expressionSet object containing expression data and other assay data of the experiment. The binary file of the R object is then split into parts and stored into "expressionset" table.
>
>
expressionset: compendiumdb R package creates the R BioConductor expressionSet object containing expression data and other assay data of the experiment. The binary file of the R object is then split into parts and stored into "expressionset" table.
 

Files

Revision 18
Changes from r16 to r18
Line: 1 to 1
 
META TOPICPARENT name="SystemsGenomicsBioLab"
Deleted:
<
<
Medical Bioinformatics