Sequence similarity analysis of Escherichia coli proteins: Functional and evolutionary implications

Eugene V. Koonin, Roman L. Tatusov, Kenneth E. Rudd

Research output: Contribution to journalArticle

89 Citations (Scopus)

Abstract

A computer analysis of 2328 protein sequences comprising about 60% of the Escherichia coli gene products was performed using methods for database screening with individual sequences and alignment blocks. A high fraction of E. coli proteins-86%-shows significant sequence similarity to other proteins in current databases; about 70% show conservation at least at the level of distantly related bacteria, and about 40% contain ancient conserved regions (ACRs) shared with eukaryotic or Archaeal proteins. For >90% of the E. coli proteins, either functional information or sequence similarity, or both, are available. Forty-six percent of the E. coli proteins belong to 299 clusters of paralogs (intraspecies homologs) defined on the basis of pairwise similarity. Another 10% could be included in 70 superclusters using motif detection methods. The majority of the clusters contain only two to four members. In contrast, nearly 25% of all E. coli proteins belong to the four largest superclusters-namely, permeases, ATPases and GTPases with the conserved 'Walker-type' motif, helix-turn-helix regulatory proteins, and NAD(FAD)-binding proteins. We conclude that bacterial protein sequences generally are highly conserved in evolution, with about 50% of all ACR- containing protein families represented among the E. coli gene products. With the current sequence databases and methods of their screening, computer analysis yields useful information on the functions and evolutionary relationships of the vast majority of genes in a bacterial genome. Sequence similarity with E. coli proteins allows the prediction of functions for a number of important eukaryotic genes, including several whose products are implicated in human diseases.

Original languageEnglish
Pages (from-to)11921-11925
Number of pages5
JournalProceedings of the National Academy of Sciences of the United States of America
Volume92
Issue number25
DOIs
StatePublished - Dec 5 1995
Externally publishedYes

Fingerprint

Escherichia coli Proteins
Sequence Analysis
Databases
Genes
Archaeal Proteins
Helix-Turn-Helix Motifs
Walkers
Escherichia coli
Bacterial Genomes
Proteins
Flavin-Adenine Dinucleotide
Bacterial Proteins
Membrane Transport Proteins
Sequence Alignment
GTP Phosphohydrolases
Protein Sequence Analysis
NAD
Adenosine Triphosphatases
Carrier Proteins
Bacteria

Keywords

  • ancient conserved regions
  • Escherichia coli genome
  • paralogous protein clusters
  • protein sequence similarity

ASJC Scopus subject areas

  • Genetics
  • General

Cite this

Sequence similarity analysis of Escherichia coli proteins : Functional and evolutionary implications. / Koonin, Eugene V.; Tatusov, Roman L.; Rudd, Kenneth E.

In: Proceedings of the National Academy of Sciences of the United States of America, Vol. 92, No. 25, 05.12.1995, p. 11921-11925.

Research output: Contribution to journalArticle

@article{aa3c35b33a2543f68bf3ac1c9798ca84,
title = "Sequence similarity analysis of Escherichia coli proteins: Functional and evolutionary implications",
abstract = "A computer analysis of 2328 protein sequences comprising about 60{\%} of the Escherichia coli gene products was performed using methods for database screening with individual sequences and alignment blocks. A high fraction of E. coli proteins-86{\%}-shows significant sequence similarity to other proteins in current databases; about 70{\%} show conservation at least at the level of distantly related bacteria, and about 40{\%} contain ancient conserved regions (ACRs) shared with eukaryotic or Archaeal proteins. For >90{\%} of the E. coli proteins, either functional information or sequence similarity, or both, are available. Forty-six percent of the E. coli proteins belong to 299 clusters of paralogs (intraspecies homologs) defined on the basis of pairwise similarity. Another 10{\%} could be included in 70 superclusters using motif detection methods. The majority of the clusters contain only two to four members. In contrast, nearly 25{\%} of all E. coli proteins belong to the four largest superclusters-namely, permeases, ATPases and GTPases with the conserved 'Walker-type' motif, helix-turn-helix regulatory proteins, and NAD(FAD)-binding proteins. We conclude that bacterial protein sequences generally are highly conserved in evolution, with about 50{\%} of all ACR- containing protein families represented among the E. coli gene products. With the current sequence databases and methods of their screening, computer analysis yields useful information on the functions and evolutionary relationships of the vast majority of genes in a bacterial genome. Sequence similarity with E. coli proteins allows the prediction of functions for a number of important eukaryotic genes, including several whose products are implicated in human diseases.",
keywords = "ancient conserved regions, Escherichia coli genome, paralogous protein clusters, protein sequence similarity",
author = "Koonin, {Eugene V.} and Tatusov, {Roman L.} and Rudd, {Kenneth E.}",
year = "1995",
month = "12",
day = "5",
doi = "10.1073/pnas.92.25.11921",
language = "English",
volume = "92",
pages = "11921--11925",
journal = "Proceedings of the National Academy of Sciences of the United States of America",
issn = "0027-8424",
number = "25",

}

TY - JOUR

T1 - Sequence similarity analysis of Escherichia coli proteins

T2 - Functional and evolutionary implications

AU - Koonin, Eugene V.

AU - Tatusov, Roman L.

AU - Rudd, Kenneth E.

PY - 1995/12/5

Y1 - 1995/12/5

N2 - A computer analysis of 2328 protein sequences comprising about 60% of the Escherichia coli gene products was performed using methods for database screening with individual sequences and alignment blocks. A high fraction of E. coli proteins-86%-shows significant sequence similarity to other proteins in current databases; about 70% show conservation at least at the level of distantly related bacteria, and about 40% contain ancient conserved regions (ACRs) shared with eukaryotic or Archaeal proteins. For >90% of the E. coli proteins, either functional information or sequence similarity, or both, are available. Forty-six percent of the E. coli proteins belong to 299 clusters of paralogs (intraspecies homologs) defined on the basis of pairwise similarity. Another 10% could be included in 70 superclusters using motif detection methods. The majority of the clusters contain only two to four members. In contrast, nearly 25% of all E. coli proteins belong to the four largest superclusters-namely, permeases, ATPases and GTPases with the conserved 'Walker-type' motif, helix-turn-helix regulatory proteins, and NAD(FAD)-binding proteins. We conclude that bacterial protein sequences generally are highly conserved in evolution, with about 50% of all ACR- containing protein families represented among the E. coli gene products. With the current sequence databases and methods of their screening, computer analysis yields useful information on the functions and evolutionary relationships of the vast majority of genes in a bacterial genome. Sequence similarity with E. coli proteins allows the prediction of functions for a number of important eukaryotic genes, including several whose products are implicated in human diseases.

AB - A computer analysis of 2328 protein sequences comprising about 60% of the Escherichia coli gene products was performed using methods for database screening with individual sequences and alignment blocks. A high fraction of E. coli proteins-86%-shows significant sequence similarity to other proteins in current databases; about 70% show conservation at least at the level of distantly related bacteria, and about 40% contain ancient conserved regions (ACRs) shared with eukaryotic or Archaeal proteins. For >90% of the E. coli proteins, either functional information or sequence similarity, or both, are available. Forty-six percent of the E. coli proteins belong to 299 clusters of paralogs (intraspecies homologs) defined on the basis of pairwise similarity. Another 10% could be included in 70 superclusters using motif detection methods. The majority of the clusters contain only two to four members. In contrast, nearly 25% of all E. coli proteins belong to the four largest superclusters-namely, permeases, ATPases and GTPases with the conserved 'Walker-type' motif, helix-turn-helix regulatory proteins, and NAD(FAD)-binding proteins. We conclude that bacterial protein sequences generally are highly conserved in evolution, with about 50% of all ACR- containing protein families represented among the E. coli gene products. With the current sequence databases and methods of their screening, computer analysis yields useful information on the functions and evolutionary relationships of the vast majority of genes in a bacterial genome. Sequence similarity with E. coli proteins allows the prediction of functions for a number of important eukaryotic genes, including several whose products are implicated in human diseases.

KW - ancient conserved regions

KW - Escherichia coli genome

KW - paralogous protein clusters

KW - protein sequence similarity

UR - http://www.scopus.com/inward/record.url?scp=0029559311&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0029559311&partnerID=8YFLogxK

U2 - 10.1073/pnas.92.25.11921

DO - 10.1073/pnas.92.25.11921

M3 - Article

C2 - 8524875

AN - SCOPUS:0029559311

VL - 92

SP - 11921

EP - 11925

JO - Proceedings of the National Academy of Sciences of the United States of America

JF - Proceedings of the National Academy of Sciences of the United States of America

SN - 0027-8424

IS - 25

ER -