TY - GEN
T1 - Automatic discovery of bioluminescent proteins from large protein databases
AU - Meng, Tao
AU - Shyu, Mei Ling
AU - Zhang, Hua
PY - 2013/12/1
Y1 - 2013/12/1
N2 - Accurate annotation of different protein features becomes increasingly important in enriching gene ontology databases. In this work, we present a framework to predict the bioluminescence of any given protein sequence. Bioluminescent proteins are produced by living organisms and emit light naturally. Bioluminescence is deemed to have different functions in living organisms including camouflage, attraction to prey, communication, etc. In addition, bioluminescent proteins are also widely used as labels in assay development, reporters of gene expression, and imaging agents in biotechnology. Currently, bioluminescent proteins are mainly curated by researchers through experimental analysis, which is a time consuming process. However, the data mining based algorithms provide an efficient way to detect candidate bioluminescent proteins and suggest prioritization of the experimental work. While traditional alignment based algorithms (such as BLAST) show promising results in terms of sequence analysis, it suffers from the limitation that the testing sequence should show homology to the sequences in the available training data sets. In order to overcome such a limitation, our proposed framework uses a set of homology-independent features that are extracted directly from the primary sequences to represent the global physicochemical properties as well as the sequence order characteristics of proteins. In addition, a novel subspace-based data filtering algorithm is proposed to eliminate noise from the training data. One existing framework addressing the same problem was implemented and compared with our proposed framework. The experimental results indicate that our proposed framework shows promising performance. In addition, the proposed framework is generic and could easily be applied to annotations of other protein properties.
AB - Accurate annotation of different protein features becomes increasingly important in enriching gene ontology databases. In this work, we present a framework to predict the bioluminescence of any given protein sequence. Bioluminescent proteins are produced by living organisms and emit light naturally. Bioluminescence is deemed to have different functions in living organisms including camouflage, attraction to prey, communication, etc. In addition, bioluminescent proteins are also widely used as labels in assay development, reporters of gene expression, and imaging agents in biotechnology. Currently, bioluminescent proteins are mainly curated by researchers through experimental analysis, which is a time consuming process. However, the data mining based algorithms provide an efficient way to detect candidate bioluminescent proteins and suggest prioritization of the experimental work. While traditional alignment based algorithms (such as BLAST) show promising results in terms of sequence analysis, it suffers from the limitation that the testing sequence should show homology to the sequences in the available training data sets. In order to overcome such a limitation, our proposed framework uses a set of homology-independent features that are extracted directly from the primary sequences to represent the global physicochemical properties as well as the sequence order characteristics of proteins. In addition, a novel subspace-based data filtering algorithm is proposed to eliminate noise from the training data. One existing framework addressing the same problem was implemented and compared with our proposed framework. The experimental results indicate that our proposed framework shows promising performance. In addition, the proposed framework is generic and could easily be applied to annotations of other protein properties.
KW - Bioluminescence
KW - Classification
KW - Lasso
KW - Subspace-based filtering
UR - http://www.scopus.com/inward/record.url?scp=84893971344&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84893971344&partnerID=8YFLogxK
U2 - 10.1109/ICSC.2013.67
DO - 10.1109/ICSC.2013.67
M3 - Conference contribution
AN - SCOPUS:84893971344
SN - 9780769551197
T3 - Proceedings - 2013 IEEE 7th International Conference on Semantic Computing, ICSC 2013
SP - 355
EP - 362
BT - Proceedings - 2013 IEEE 7th International Conference on Semantic Computing, ICSC 2013
T2 - 2013 IEEE 7th International Conference on Semantic Computing, ICSC 2013
Y2 - 16 September 2013 through 18 September 2013
ER -