Protein Family Classification from Scratch: A CNN Based Deep Learning Approach

Research output: Contribution to journalArticlepeer-review

2 Scopus citations


Next-generation sequencing techniques provide us with an opportunity for generating sequenced proteins and identifying the biological families and functions of these proteins. However, compared with identified proteins, uncharacterized proteins consist of a notable percentage of the overall proteins in the bioinformatics research field. Traditional family classification methods often devote themselves to extracting N-Gram features from sequences while ignoring motif information as well as affinity information between motifs and adjacent amino acids. Previous clustering-based algorithms have typically been used to define protein features with domain knowledge and annotate protein families based on extensive data samples. In this paper, we apply CNN based amino acid representation learning with limited characterized proteins to explore the performances of annotated protein families by taking into account the amino acid location information. Additionally, we apply the method to all reviewed protein sequences with their families retrieved from the UniProt database to evaluate our approach. Last but not least, we verify our model using those unreviewed protein records, which is typically ignored by other methods.

Original languageEnglish (US)
Pages (from-to)1996-2007
Number of pages12
JournalIEEE/ACM Transactions on Computational Biology and Bioinformatics
Issue number5
StatePublished - 2021


  • convolutional neural network
  • feature engineering
  • Protein family classification

ASJC Scopus subject areas

  • Biotechnology
  • Genetics
  • Applied Mathematics


Dive into the research topics of 'Protein Family Classification from Scratch: A CNN Based Deep Learning Approach'. Together they form a unique fingerprint.

Cite this