Analysis Match

Automatically discover other IPA Core Analyses with similar (or opposite) biological results as compared to yours, to help confirm your interpretation of the results or to provide unexpected insights into underlying shared biological mechanisms.

Analysis Match automatically compares your analysis against other analyses you have created (in your Project Manager) as well as thousands of other human and mouse expression analyses curated from public sources. This “analysis-to-analysis” matching is based on shared patterns of Canonical Pathways, Upstream Regulators, Causal Networks, and Diseases and Functions.
 

Source of matching analyses:

The analyses included in Analysis Match were generated in IPA from more than 49,000 highly curated and quality-controlled human and mouse disease and oncology datasets re-processed from SRA, GEO, Array Express, TCGA, LINCS and more. These datasets were generated by QIAGEN’s recently acquired company, OmicSoft, and are the “comparisons” found in DiseaseLand and OncoLand representing various contrasts between disease and normal, treatment vs. non-treatment and much more. Matches against your own analyses, analyses shared with you, and IPA's example analyses are also returned in Analysis Match.


Analysis Match results appear in a new tab

Analysis Match is presented in a new tab at the right side of the IPA Core Analysis window. If you have licensed the feature, it will be populated with a table that ranks other analyses you have access to versus the one you opened. 

By default, the analyses are ranked from most similar to least similar to the one you opened based on overall similarity score. The analyses are matched based on a set of signatures that are created for each analysis, namely for Canonical Pathways, Upstream Regulators, Causal Networks, and Diseases and Functions. Each signature is used independently to match against other analyses. See this section below for more detail on how the signatures are created and scored.

The Analysis Match tab is shown below for an analysis  the transcriptome of kidney tissue from mice treated with the NRF2 (NFE2L2) activating chemical CDDO-Me ratio’ed to DMSO-treated controls (PMID 26422507).

User-added image

The results above have been filtered to show only the most strongly similar and dissimilar analyses based on overall z-score percentage. Each of the first four colored columns represents the percentage similarity of each type of signature to the analysis you opened. The fuchsia color indicates similarity and cyan color indicates dissimilarity. The first scoring column (“CP”) is the match for the Canonical Pathway signature, the second (“UR”) is for Upstream Regulators, the third (“CN”) is for Causal Networks and the last (“DE”) is for Downstream Effects (i.e. Diseases and Functions). The final of the fuchsia and blue column is the average of those four signature matches. The white and purple columns to the right of the z-score columns display the Fisher's exact test p-value for each of the signature matches.


Filtering your results

You can filter the results by any of the columns by clicking on the funnel icon at the top of the column and entering numbers or text. In the case of the z-score columns, a cutoff value you enter is treated as an absolute value. For example if you enter 50, the results will be filtered to those with score >50 or <-50.

​You can limit the results to certain (or all) OmicSoft "Lands" and/or any of your own projects. 
Click on the Project filter funnel, then click on one or more Lands to select a subset, or click the OmicSoft icon to select all Lands at once.
​Or switch to wild card searches using the radio button.

User-added image

You can also filter on metadata that has been captured about each analysis. Each analysis has been annotated with the species, the type of comparison (Disease vs. Normal, Treatment vs. Control etc.) and much more. The full list of m
etadata columns that can be added / removed using the Customize Table menu are attached to this article.

For example, below is the Analysis Match tab with columns added and removed, and filtered on the "sampledatamode" field to limit to only RNA-seq analyses.

User-added image

Note that unlike this example, sometimes there are no Canonical Pathway signature matches. This is due to the relative sparsity of z-scores for Canonical Pathways.

The next step is to view the underlying details of the matches with a heatmap. Please see Related-analyses-heatmap for details.

 

The OmicSoft Dataset and Analysis Repository in IPA

This section describes what datasets are in the repository in IPA.
 

Scope of the repository

The OmicSoft repository is organized into several project folders in IPA:

DiseaseLand
  • HumanDisease
  • LINCS
  • MouseDisease
  • RatDisease
OncoLand
  • Hematology
  • MetastaticCancer
  • OncoGEO
  • Pediatrics
  • TCGA
  
Breakout of analyses (as of July 2019)

DiseaseLand

OncoLand

HumanDisease (8683)

•505 diseases
•249 tissues
•65 expression platforms
•1428 RNA-seq datasets

OncoGeo (2306)

•140 cancers
•72 tissues
•41 expression platforms
•391 RNA-seq datasets

MouseDisease (10,218)

•326 diseases
•219 tissues
•56 expression platforms
•4002 RNA-seq datasets

TCGA (4789)

•33 cancers
•27 tissues
•385 different mutational status / clinical signs

RatDisease (802)

•36 diseases
•59 tissues
•325 RNA-seq datasets

Pediatrics (444)

•47 cancers
•23 tissues

 

LINCS (28,234)

•23 cell lines
•374 chemical treatments or gene overexpression
•226 different targets (or groups of target genes) 
 

Metastatic Cancer (81)

•27 cancers
•18 tissues
 

Hematology (1013)

•36 cancers
•10 tissues

 




How OmicSoft datasets were analyzed in IPA

Omicsoft completely re-processes, normalizes, QA's and and annotates data from public repositories. The resulting datasets derive from a number of different experimental designs, cell types, tissue, array platforms and RNA-seq technologies. To analyze them in IPA, it was impossible to use one or even a small set of standard cutoffs across this diverse repository. Therefore, IPA uses the following strategy to obtain a fairly uniform set of analysis-ready genes for each dataset. 

User-added image

To "mark" analysis-ready genes, the datasets have been given a value of 1 in the Expr Other column for each analysis-ready gene. This appears as an up arrow in IPA when viewing the dataset, but is not treated as up-regulated by IPA. 

The repository contains over 56,000 datasets, with the majority containing ~1000 analysis-ready genes. A subset had less than that due to fewer than 1000 genes passing the p-value <0.01* cutoff. In each case, the reference set was assigned to the complete dataset, meaning both analysis-ready and all other genes in the dataset. The repository will be updated quarterly.

*Note that for LINCS, the p-value threshold was set to 0.05 rather than 0.01.
 

How signatures are created and compared

After the analysis is created, IPA creates a set of up to 4 signatures for the analysis, consisting of what is shown in the parentheses.
  • ​Canonical Pathways (up to 20 pathways)
  • ​Upstream Regulators (up to 100 regulators)
  • Causal Networks (up to 100 master regulators)
  • Diseases & Functions (up to 100 diseases or functions)
Each signature was created as described in the illustration below:

User-added image

Not every analysis has enough significant entities of each type to form a full signature for each. For example, there may only be 6 Canonical Pathways with significant z-scores for a particular analysis, and so for that analysis the Canonical Pathway Signature would only contain 6 entities (i.e. 6 pathways). 


Scoring of signature against other analyses 

IPA computes a z-score for the match of the "query" signature against the signatures of all other analyses as shown:

User-added image

That "raw" z-score is a hidden-by-default column in the Analysis Match tab. To make the score more useful, IPA normalizes the score by computing the maximum possible z-score for the query signature. This is the best match a signature could possibly have -- a match to itself:

User-added image
Then the actual match (the raw z-score) is calculated as a percentage of the max. I.e. a very strongly matching z-score might be 80% of the maximum, whereas a weakly matching signature might have a raw z-score that is 20% of the max.
 


You can also use the repository without your own analysis, just by searching for available analyses of interest.


The OmicSoft datasets and analyses are stored in the IPA Library, and you can use Dataset and Analysis Search to quickly find analyses of interest. Note that they are read-only and cannot be downloaded out of IPA.

User-added image
 

The image below shows a search for human asthma analyses but excluding those involving albuterol. From search results like these, you can double click to open an analysis, or select up to 20 to visualize in a full comparison analysis. 

User-added image
 

You may not have Analysis Match active on your license today, but please consult with your local QIAGEN customer solutions manager or AdvancedGenomicsSupport@qiagen.com  for additional details on how to get access.