TY - JOUR
T1 - Bounds on identification of genome evolution pacemakers
AU - Snir, Sagi
N1 - Funding Information:
We thank Eugene Koonin and Yuri Wolf for the inspiring question, and Ilan Newman and Nick Harvey for helpful discussions. Part of this study was done while the author was visiting the NIH, United States. The authors wish to acknowledge the Israel Science Foundation (ISF) for its kind support in doing this research.
Publisher Copyright:
© Copyright 2019, Mary Ann Liebert, Inc., publishers 2019.
PY - 2019/8
Y1 - 2019/8
N2 - Several studies have pointed out that the tight correlation between genes' evolutionary rate is better explained by a model denoted as the Universal PaceMaker (UPM) rather than by a simple rate constancy as manifested by the classical hypothesis of molecular clock (MC). Under UPM, each gene is associated with a single pacemaker (PM) and varies its evolutionary rate according to this PM ticks. Hence, the relative rates of all genes associated with the same PM remain nearly constant, whereas the absolute rates can change arbitrarily according to the PM ticks. A consequent question to that mentioned is finding the gene-PM association only from the gene sequence data. This, however, turns to be a nontrivial task and is affected by the number of variables, their random noise, and the amount of available information. To this end, a clustering heuristic was devised by exploiting the correlation between corresponding edge lengths across thousands of gene trees. Nevertheless, no theoretical study linking the relationship between the affecting parameters was done. We here study this question by providing theoretical bounds, expressed by the system parameters, on probabilities for positive and negative results. We corroborate these results by a simulation study that reveals the critical role of the variances.
AB - Several studies have pointed out that the tight correlation between genes' evolutionary rate is better explained by a model denoted as the Universal PaceMaker (UPM) rather than by a simple rate constancy as manifested by the classical hypothesis of molecular clock (MC). Under UPM, each gene is associated with a single pacemaker (PM) and varies its evolutionary rate according to this PM ticks. Hence, the relative rates of all genes associated with the same PM remain nearly constant, whereas the absolute rates can change arbitrarily according to the PM ticks. A consequent question to that mentioned is finding the gene-PM association only from the gene sequence data. This, however, turns to be a nontrivial task and is affected by the number of variables, their random noise, and the amount of available information. To this end, a clustering heuristic was devised by exploiting the correlation between corresponding edge lengths across thousands of gene trees. Nevertheless, no theoretical study linking the relationship between the affecting parameters was done. We here study this question by providing theoretical bounds, expressed by the system parameters, on probabilities for positive and negative results. We corroborate these results by a simulation study that reveals the critical role of the variances.
KW - Chernoff bounds
KW - DNA sequence evolution
KW - chi square distribution
KW - probabilistic geometrical clustering
UR - http://www.scopus.com/inward/record.url?scp=85066863555&partnerID=8YFLogxK
U2 - 10.1089/cmb.2018.0178
DO - 10.1089/cmb.2018.0178
M3 - Article
C2 - 30676086
AN - SCOPUS:85066863555
SN - 1066-5277
VL - 26
SP - 806
EP - 821
JO - Journal of Computational Biology
JF - Journal of Computational Biology
IS - 8
ER -