Batch correction evaluation framework using a-priori gene-gene associations: Applied to the GTEx dataset

Judith Somekh, Shai S. Shen-Orr, Isaac S. Kohane

Research output: Contribution to journalArticlepeer-review

Abstract

Background: Correcting a heterogeneous dataset that presents artefacts from several confounders is often an essential bioinformatics task. Attempting to remove these batch effects will result in some biologically meaningful signals being lost. Thus, a central challenge is assessing if the removal of unwanted technical variation harms the biological signal that is of interest to the researcher. Results: We describe a novel framework, B-CeF, to evaluate the effectiveness of batch correction methods and their tendency toward over or under correction. The approach is based on comparing co-expression of adjusted gene-gene pairs to a-priori knowledge of highly confident gene-gene associations based on thousands of unrelated experiments derived from an external reference. Our framework includes three steps: (1) data adjustment with the desired methods (2) calculating gene-gene co-expression measurements for adjusted datasets (3) evaluating the performance of the co-expression measurements against a gold standard. Using the framework, we evaluated five batch correction methods applied to RNA-seq data of six representative tissue datasets derived from the GTEx project. Conclusions: Our framework enables the evaluation of batch correction methods to better preserve the original biological signal. We show that using a multiple linear regression model to correct for known confounders outperforms factor analysis-based methods that estimate hidden confounders. The code is publicly available as an R package.

Original languageEnglish
Article number268
JournalBMC Bioinformatics
Volume20
Issue number1
DOIs
StatePublished - 28 May 2019

Bibliographical note

Publisher Copyright:
© 2019 The Author(s).

Keywords

  • Batch correction
  • Batch effect
  • ComBat
  • GTEx
  • Gene expression
  • Principal component analysis

ASJC Scopus subject areas

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Fingerprint

Dive into the research topics of 'Batch correction evaluation framework using a-priori gene-gene associations: Applied to the GTEx dataset'. Together they form a unique fingerprint.

Cite this