Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains

Sareewan Dendamrongvit, Miroslav Kubat

Research output: Chapter in Book/Report/Conference proceedingConference contribution

12 Scopus citations

Abstract

Text categorization is an important application domain of multi-label classification where each document can simultaneously belong to more than one class. The most common approach is to address the problem of multi-label examples by inducing a separate binary classifier for each class, and then use these classifiers in parallel. What the information-retrieval community has all but ignored, however, is that such classifiers are almost always induced from highly imbalanced training sets. The study reported in this paper shows how taking this aspect into consideration with a majority-class undersampling we used here can indeed improve classification performance as measured by criteria common in text categorization: macro/micro precision, recall, and F 1. We also show how a slight modification of an older undersampling technique helps further improve the results.

Original languageEnglish (US)
Title of host publicationNew Frontiers in Applied Data Mining - PAKDD 2009 International Workshops, Revised Selected Papers
Pages40-52
Number of pages13
DOIs
StatePublished - Sep 30 2010
Event13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2009 - Bangkok, Thailand
Duration: Apr 27 2009Apr 30 2009

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5669 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2009
CountryThailand
CityBangkok
Period4/27/094/30/09

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint Dive into the research topics of 'Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains'. Together they form a unique fingerprint.

  • Cite this

    Dendamrongvit, S., & Kubat, M. (2010). Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains. In New Frontiers in Applied Data Mining - PAKDD 2009 International Workshops, Revised Selected Papers (pp. 40-52). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5669 LNAI). https://doi.org/10.1007/978-3-642-14640-4_4