Low molecular weight proteins: A challenge for post-genomic research

Kenneth E. Rudd, Ian Humphery-Smith, Valerie C. Wasinger, Amos Bairoch

Research output: Contribution to journalArticle

46 Citations (Scopus)

Abstract

The EcoGene project involves the examination of Escherichia coli K-12 DNA sequences and accompanying annotation in the public databases in order to refine the representation and prediction of the entire set of E. coli K-12 chromosomally encoded protein sequences. The results of this ongoing effort have been deposited in the SWISSPROT protein sequence database as sequencing of the E. coli genome has progressed to completion in recent years. Through this continuing research, we have discovered that the prediction of low molecular weight (small) proteins, arbitrarily defined as protein sequences ≤ 150 amino acids (aa) in length, is problematic and requires special attention. We describe the small protein subset of EcoGene and the approach used to derive this subset from the complete E. coli genome sequence and database annotations. These E. coli proteins have helped to identify new small genes in other organisms and to identify conserved residues (motifs) using database searches and multiple alignments. Two thirds of the E. coli small proteins have not been characterized experimentally. The careful application of computer and laboratory methods to the analysis of small proteins is needed for accurate prediction, verification and characterization. The problem of accurate protein sequence identification is not limited to small proteins or to E. coli; these problems are encountered to varying degrees throughout all sequence databases.

Original languageEnglish
Pages (from-to)536-544
Number of pages9
JournalElectrophoresis
Volume19
Issue number4
DOIs
StatePublished - Apr 1 1998
Externally publishedYes

Fingerprint

Staphylococcal Protein A
Molecular Weight
Molecular weight
Escherichia coli
Research
Databases
Proteins
Escherichia coli Proteins
Genes
Genome
Protein Databases
DNA sequences
Amino Acid Sequence
Amino Acids

Keywords

  • Bioinformatics
  • Escherichia coli
  • Post-genome research
  • Short open reading frames
  • Small proteins

ASJC Scopus subject areas

  • Clinical Biochemistry

Cite this

Rudd, K. E., Humphery-Smith, I., Wasinger, V. C., & Bairoch, A. (1998). Low molecular weight proteins: A challenge for post-genomic research. Electrophoresis, 19(4), 536-544. https://doi.org/10.1002/elps.1150190413

Low molecular weight proteins : A challenge for post-genomic research. / Rudd, Kenneth E.; Humphery-Smith, Ian; Wasinger, Valerie C.; Bairoch, Amos.

In: Electrophoresis, Vol. 19, No. 4, 01.04.1998, p. 536-544.

Research output: Contribution to journalArticle

Rudd, KE, Humphery-Smith, I, Wasinger, VC & Bairoch, A 1998, 'Low molecular weight proteins: A challenge for post-genomic research', Electrophoresis, vol. 19, no. 4, pp. 536-544. https://doi.org/10.1002/elps.1150190413
Rudd, Kenneth E. ; Humphery-Smith, Ian ; Wasinger, Valerie C. ; Bairoch, Amos. / Low molecular weight proteins : A challenge for post-genomic research. In: Electrophoresis. 1998 ; Vol. 19, No. 4. pp. 536-544.
@article{5eafa2175da648c492e9abc55e20d2bb,
title = "Low molecular weight proteins: A challenge for post-genomic research",
abstract = "The EcoGene project involves the examination of Escherichia coli K-12 DNA sequences and accompanying annotation in the public databases in order to refine the representation and prediction of the entire set of E. coli K-12 chromosomally encoded protein sequences. The results of this ongoing effort have been deposited in the SWISSPROT protein sequence database as sequencing of the E. coli genome has progressed to completion in recent years. Through this continuing research, we have discovered that the prediction of low molecular weight (small) proteins, arbitrarily defined as protein sequences ≤ 150 amino acids (aa) in length, is problematic and requires special attention. We describe the small protein subset of EcoGene and the approach used to derive this subset from the complete E. coli genome sequence and database annotations. These E. coli proteins have helped to identify new small genes in other organisms and to identify conserved residues (motifs) using database searches and multiple alignments. Two thirds of the E. coli small proteins have not been characterized experimentally. The careful application of computer and laboratory methods to the analysis of small proteins is needed for accurate prediction, verification and characterization. The problem of accurate protein sequence identification is not limited to small proteins or to E. coli; these problems are encountered to varying degrees throughout all sequence databases.",
keywords = "Bioinformatics, Escherichia coli, Post-genome research, Short open reading frames, Small proteins",
author = "Rudd, {Kenneth E.} and Ian Humphery-Smith and Wasinger, {Valerie C.} and Amos Bairoch",
year = "1998",
month = "4",
day = "1",
doi = "10.1002/elps.1150190413",
language = "English",
volume = "19",
pages = "536--544",
journal = "Electrophoresis",
issn = "0173-0835",
publisher = "Wiley-VCH Verlag",
number = "4",

}

TY - JOUR

T1 - Low molecular weight proteins

T2 - A challenge for post-genomic research

AU - Rudd, Kenneth E.

AU - Humphery-Smith, Ian

AU - Wasinger, Valerie C.

AU - Bairoch, Amos

PY - 1998/4/1

Y1 - 1998/4/1

N2 - The EcoGene project involves the examination of Escherichia coli K-12 DNA sequences and accompanying annotation in the public databases in order to refine the representation and prediction of the entire set of E. coli K-12 chromosomally encoded protein sequences. The results of this ongoing effort have been deposited in the SWISSPROT protein sequence database as sequencing of the E. coli genome has progressed to completion in recent years. Through this continuing research, we have discovered that the prediction of low molecular weight (small) proteins, arbitrarily defined as protein sequences ≤ 150 amino acids (aa) in length, is problematic and requires special attention. We describe the small protein subset of EcoGene and the approach used to derive this subset from the complete E. coli genome sequence and database annotations. These E. coli proteins have helped to identify new small genes in other organisms and to identify conserved residues (motifs) using database searches and multiple alignments. Two thirds of the E. coli small proteins have not been characterized experimentally. The careful application of computer and laboratory methods to the analysis of small proteins is needed for accurate prediction, verification and characterization. The problem of accurate protein sequence identification is not limited to small proteins or to E. coli; these problems are encountered to varying degrees throughout all sequence databases.

AB - The EcoGene project involves the examination of Escherichia coli K-12 DNA sequences and accompanying annotation in the public databases in order to refine the representation and prediction of the entire set of E. coli K-12 chromosomally encoded protein sequences. The results of this ongoing effort have been deposited in the SWISSPROT protein sequence database as sequencing of the E. coli genome has progressed to completion in recent years. Through this continuing research, we have discovered that the prediction of low molecular weight (small) proteins, arbitrarily defined as protein sequences ≤ 150 amino acids (aa) in length, is problematic and requires special attention. We describe the small protein subset of EcoGene and the approach used to derive this subset from the complete E. coli genome sequence and database annotations. These E. coli proteins have helped to identify new small genes in other organisms and to identify conserved residues (motifs) using database searches and multiple alignments. Two thirds of the E. coli small proteins have not been characterized experimentally. The careful application of computer and laboratory methods to the analysis of small proteins is needed for accurate prediction, verification and characterization. The problem of accurate protein sequence identification is not limited to small proteins or to E. coli; these problems are encountered to varying degrees throughout all sequence databases.

KW - Bioinformatics

KW - Escherichia coli

KW - Post-genome research

KW - Short open reading frames

KW - Small proteins

UR - http://www.scopus.com/inward/record.url?scp=0031918278&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0031918278&partnerID=8YFLogxK

U2 - 10.1002/elps.1150190413

DO - 10.1002/elps.1150190413

M3 - Article

C2 - 9588799

AN - SCOPUS:0031918278

VL - 19

SP - 536

EP - 544

JO - Electrophoresis

JF - Electrophoresis

SN - 0173-0835

IS - 4

ER -