Detection of new genes in a bacterial genome using Markov models for three gene classes

M. Borodovsky, J. D. McIninch, E. V. Koonin, K. E. Rudd, C. Medigue, A. Danchin

Research output: Contribution to journalArticle

110 Citations (Scopus)

Abstract

We further investigated the statistical features of the three classes of Escherichia coli genes that have been previously delineated by factorial correspondence analysis and dynamic clustering methods. A phased Markov model for a nucleotide sequence of each gene class was developed and employed for gene prediction using the GeneMark program. The protein-coding region prediction accuracy was determined for class-specific Markov models of different orders when the programs implementing these models were applied to gene sequences from the same or other classes. It is shown that at least two training sets and two program versions derived for different classes of E.coli genes are necessary in order to achieve a high accuracy of coding region prediction for uncharacterized sequences. Some annotated E.coli genes from Class I and Class III are shown to be spurious, whereas many open reading frames (ORFs) that have not been annotated in GenBank as genes are predicted to encode proteins. The amino acid sequences of the putative products of these ORFs initially did not show similarity to already known proteins. However, conserved regions have been identified in several of them by screening the latest entries in protein sequence databases and applying methods for motif search, while some other of these new genes have been identified in independent experiments.

Original languageEnglish
Pages (from-to)3554-3562
Number of pages9
JournalNucleic Acids Research
Volume23
Issue number17
StatePublished - Jan 1 1995
Externally publishedYes

Fingerprint

Bacterial Genomes
Markov Model
Genome
Genes
Gene
Open Reading Frames
Escherichia coli
Escherichia Coli
Proteins
Protein
Prediction
MHC Class I Genes
Coding
Protein Databases
Correspondence Analysis
Nucleic Acid Databases
Class
Markov model
Factorial
Protein Sequence

ASJC Scopus subject areas

  • Genetics
  • Statistics, Probability and Uncertainty
  • Applied Mathematics
  • Health, Toxicology and Mutagenesis
  • Toxicology
  • Genetics(clinical)

Cite this

Borodovsky, M., McIninch, J. D., Koonin, E. V., Rudd, K. E., Medigue, C., & Danchin, A. (1995). Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Research, 23(17), 3554-3562.

Detection of new genes in a bacterial genome using Markov models for three gene classes. / Borodovsky, M.; McIninch, J. D.; Koonin, E. V.; Rudd, K. E.; Medigue, C.; Danchin, A.

In: Nucleic Acids Research, Vol. 23, No. 17, 01.01.1995, p. 3554-3562.

Research output: Contribution to journalArticle

Borodovsky, M, McIninch, JD, Koonin, EV, Rudd, KE, Medigue, C & Danchin, A 1995, 'Detection of new genes in a bacterial genome using Markov models for three gene classes', Nucleic Acids Research, vol. 23, no. 17, pp. 3554-3562.
Borodovsky M, McIninch JD, Koonin EV, Rudd KE, Medigue C, Danchin A. Detection of new genes in a bacterial genome using Markov models for three gene classes. Nucleic Acids Research. 1995 Jan 1;23(17):3554-3562.
Borodovsky, M. ; McIninch, J. D. ; Koonin, E. V. ; Rudd, K. E. ; Medigue, C. ; Danchin, A. / Detection of new genes in a bacterial genome using Markov models for three gene classes. In: Nucleic Acids Research. 1995 ; Vol. 23, No. 17. pp. 3554-3562.
@article{6bce4425206e4aa6819584e3781a5f12,
title = "Detection of new genes in a bacterial genome using Markov models for three gene classes",
abstract = "We further investigated the statistical features of the three classes of Escherichia coli genes that have been previously delineated by factorial correspondence analysis and dynamic clustering methods. A phased Markov model for a nucleotide sequence of each gene class was developed and employed for gene prediction using the GeneMark program. The protein-coding region prediction accuracy was determined for class-specific Markov models of different orders when the programs implementing these models were applied to gene sequences from the same or other classes. It is shown that at least two training sets and two program versions derived for different classes of E.coli genes are necessary in order to achieve a high accuracy of coding region prediction for uncharacterized sequences. Some annotated E.coli genes from Class I and Class III are shown to be spurious, whereas many open reading frames (ORFs) that have not been annotated in GenBank as genes are predicted to encode proteins. The amino acid sequences of the putative products of these ORFs initially did not show similarity to already known proteins. However, conserved regions have been identified in several of them by screening the latest entries in protein sequence databases and applying methods for motif search, while some other of these new genes have been identified in independent experiments.",
author = "M. Borodovsky and McIninch, {J. D.} and Koonin, {E. V.} and Rudd, {K. E.} and C. Medigue and A. Danchin",
year = "1995",
month = "1",
day = "1",
language = "English",
volume = "23",
pages = "3554--3562",
journal = "Nucleic Acids Research",
issn = "0305-1048",
publisher = "Oxford University Press",
number = "17",

}

TY - JOUR

T1 - Detection of new genes in a bacterial genome using Markov models for three gene classes

AU - Borodovsky, M.

AU - McIninch, J. D.

AU - Koonin, E. V.

AU - Rudd, K. E.

AU - Medigue, C.

AU - Danchin, A.

PY - 1995/1/1

Y1 - 1995/1/1

N2 - We further investigated the statistical features of the three classes of Escherichia coli genes that have been previously delineated by factorial correspondence analysis and dynamic clustering methods. A phased Markov model for a nucleotide sequence of each gene class was developed and employed for gene prediction using the GeneMark program. The protein-coding region prediction accuracy was determined for class-specific Markov models of different orders when the programs implementing these models were applied to gene sequences from the same or other classes. It is shown that at least two training sets and two program versions derived for different classes of E.coli genes are necessary in order to achieve a high accuracy of coding region prediction for uncharacterized sequences. Some annotated E.coli genes from Class I and Class III are shown to be spurious, whereas many open reading frames (ORFs) that have not been annotated in GenBank as genes are predicted to encode proteins. The amino acid sequences of the putative products of these ORFs initially did not show similarity to already known proteins. However, conserved regions have been identified in several of them by screening the latest entries in protein sequence databases and applying methods for motif search, while some other of these new genes have been identified in independent experiments.

AB - We further investigated the statistical features of the three classes of Escherichia coli genes that have been previously delineated by factorial correspondence analysis and dynamic clustering methods. A phased Markov model for a nucleotide sequence of each gene class was developed and employed for gene prediction using the GeneMark program. The protein-coding region prediction accuracy was determined for class-specific Markov models of different orders when the programs implementing these models were applied to gene sequences from the same or other classes. It is shown that at least two training sets and two program versions derived for different classes of E.coli genes are necessary in order to achieve a high accuracy of coding region prediction for uncharacterized sequences. Some annotated E.coli genes from Class I and Class III are shown to be spurious, whereas many open reading frames (ORFs) that have not been annotated in GenBank as genes are predicted to encode proteins. The amino acid sequences of the putative products of these ORFs initially did not show similarity to already known proteins. However, conserved regions have been identified in several of them by screening the latest entries in protein sequence databases and applying methods for motif search, while some other of these new genes have been identified in independent experiments.

UR - http://www.scopus.com/inward/record.url?scp=0029056051&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0029056051&partnerID=8YFLogxK

M3 - Article

C2 - 7567469

AN - SCOPUS:0029056051

VL - 23

SP - 3554

EP - 3562

JO - Nucleic Acids Research

JF - Nucleic Acids Research

SN - 0305-1048

IS - 17

ER -