Batch correction evaluation framework using a-priori gene-gene associations: Applied to the GTEx dataset

Judith Somekh, Shai S. Shen-Orr, Isaac S. Kohane

Research output: Contribution to journalArticlepeer-review

Abstract

Background: Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher. Results: We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissue datasets derived from the GTEx project. Conclusions: Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that using a multiple linear regression model to correct for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package.

Original languageEnglish
Article number268
JournalBMC Bioinformatics
Volume20
Issue number1
DOIs
StatePublished - 28 May 2019

Bibliographical note

Funding Information:
The work was supported by the IMOS (Israel Ministry of Science) Yitzhak Shamir Fellowship to JS. The funding source had no role in the design of the study, analysis and interpretation of data and in writing the manuscript.

Publisher Copyright:
© 2019 The Author(s).

Keywords

  • Batch correction
  • Batch effect
  • ComBat
  • GTEx
  • Gene expression
  • Principal component analysis

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Fingerprint

Dive into the research topics of 'Batch correction evaluation framework using a-priori gene-gene associations: Applied to the GTEx dataset'. Together they form a unique fingerprint.

Cite this