TY - GEN
T1 - Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains
AU - Dendamrongvit, Sareewan
AU - Kubat, Miroslav
PY - 2010/9/30
Y1 - 2010/9/30
N2 - Text categorization is an important application domain of multi-label classification where each document can simultaneously belong to more than one class. The most common approach is to address the problem of multi-label examples by inducing a separate binary classifier for each class, and then use these classifiers in parallel. What the information-retrieval community has all but ignored, however, is that such classifiers are almost always induced from highly imbalanced training sets. The study reported in this paper shows how taking this aspect into consideration with a majority-class undersampling we used here can indeed improve classification performance as measured by criteria common in text categorization: macro/micro precision, recall, and F 1. We also show how a slight modification of an older undersampling technique helps further improve the results.
AB - Text categorization is an important application domain of multi-label classification where each document can simultaneously belong to more than one class. The most common approach is to address the problem of multi-label examples by inducing a separate binary classifier for each class, and then use these classifiers in parallel. What the information-retrieval community has all but ignored, however, is that such classifiers are almost always induced from highly imbalanced training sets. The study reported in this paper shows how taking this aspect into consideration with a majority-class undersampling we used here can indeed improve classification performance as measured by criteria common in text categorization: macro/micro precision, recall, and F 1. We also show how a slight modification of an older undersampling technique helps further improve the results.
UR - http://www.scopus.com/inward/record.url?scp=77957042586&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77957042586&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-14640-4_4
DO - 10.1007/978-3-642-14640-4_4
M3 - Conference contribution
AN - SCOPUS:77957042586
SN - 3642146392
SN - 9783642146398
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 40
EP - 52
BT - New Frontiers in Applied Data Mining - PAKDD 2009 International Workshops, Revised Selected Papers
T2 - 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2009
Y2 - 27 April 2009 through 30 April 2009
ER -