Irrelevant attributes and imbalanced classes in multi-label text-categorization domains

Sareewan Dendamrongvit, Peerapon Vateekul, Miroslav Kubat

Research output: Contribution to journalArticle

19 Scopus citations

Abstract

An interesting issue in machine learning is induction in multi-label domains where each example can be labeled with two or more classes at the same time. In a work focusing on text categorization, we followed the most commonly used approach and induced a binary classifier for each class. Analyzing the results, we noticed that performance had been impaired by two factors. First, in text domains, each class is characterized by a different set of attributes; an appropriate attribute-selection technique thus has to be applied separately to each of them. Second, the individual classes often have to be induced from imbalanced training sets, a circumstance we addressed here by majority-class undersampling. The paper provides details of the induction system and reports the results of systematic experimentation.

Original languageEnglish (US)
Pages (from-to)843-859
Number of pages17
JournalIntelligent Data Analysis
Volume15
Issue number6
DOIs
StatePublished - Dec 1 2011

Keywords

  • imbalanced classes
  • Induction
  • multi-label examples
  • text categorization

ASJC Scopus subject areas

  • Artificial Intelligence
  • Theoretical Computer Science
  • Computer Vision and Pattern Recognition

Fingerprint Dive into the research topics of 'Irrelevant attributes and imbalanced classes in multi-label text-categorization domains'. Together they form a unique fingerprint.

  • Cite this