Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains

Sareewan Dendamrongvit, Miroslav Kubat

Research output: Chapter in Book/Report/Conference proceedingConference contribution

10 Citations (Scopus)

Abstract

Text categorization is an important application domain of multi-label classification where each document can simultaneously belong to more than one class. The most common approach is to address the problem of multi-label examples by inducing a separate binary classifier for each class, and then use these classifiers in parallel. What the information-retrieval community has all but ignored, however, is that such classifiers are almost always induced from highly imbalanced training sets. The study reported in this paper shows how taking this aspect into consideration with a majority-class undersampling we used here can indeed improve classification performance as measured by criteria common in text categorization: macro/micro precision, recall, and F 1. We also show how a slight modification of an older undersampling technique helps further improve the results.

Original languageEnglish
Title of host publicationLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Pages40-52
Number of pages13
Volume5669 LNAI
DOIs
StatePublished - Sep 30 2010
Event13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2009 - Bangkok, Thailand
Duration: Apr 27 2009Apr 30 2009

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5669 LNAI
ISSN (Print)03029743
ISSN (Electronic)16113349

Other

Other13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2009
CountryThailand
CityBangkok
Period4/27/094/30/09

Fingerprint

Text Categorization
Labels
Proof by induction
Classifiers
Classifier
Information retrieval
Information Retrieval
Macros
Binary
Training
Class

ASJC Scopus subject areas

  • Computer Science(all)
  • Theoretical Computer Science

Cite this

Dendamrongvit, S., & Kubat, M. (2010). Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5669 LNAI, pp. 40-52). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5669 LNAI). https://doi.org/10.1007/978-3-642-14640-4_4

Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains. / Dendamrongvit, Sareewan; Kubat, Miroslav.

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 5669 LNAI 2010. p. 40-52 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5669 LNAI).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Dendamrongvit, S & Kubat, M 2010, Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). vol. 5669 LNAI, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5669 LNAI, pp. 40-52, 13th Pacific-Asia Conference on Knowledge Discovery and Data Mining, PAKDD 2009, Bangkok, Thailand, 4/27/09. https://doi.org/10.1007/978-3-642-14640-4_4
Dendamrongvit S, Kubat M. Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 5669 LNAI. 2010. p. 40-52. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-14640-4_4
Dendamrongvit, Sareewan ; Kubat, Miroslav. / Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Vol. 5669 LNAI 2010. pp. 40-52 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{d9904f4fe33a4f679f0edbc66fc912a9,
title = "Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains",
abstract = "Text categorization is an important application domain of multi-label classification where each document can simultaneously belong to more than one class. The most common approach is to address the problem of multi-label examples by inducing a separate binary classifier for each class, and then use these classifiers in parallel. What the information-retrieval community has all but ignored, however, is that such classifiers are almost always induced from highly imbalanced training sets. The study reported in this paper shows how taking this aspect into consideration with a majority-class undersampling we used here can indeed improve classification performance as measured by criteria common in text categorization: macro/micro precision, recall, and F 1. We also show how a slight modification of an older undersampling technique helps further improve the results.",
author = "Sareewan Dendamrongvit and Miroslav Kubat",
year = "2010",
month = "9",
day = "30",
doi = "10.1007/978-3-642-14640-4_4",
language = "English",
isbn = "3642146392",
volume = "5669 LNAI",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "40--52",
booktitle = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

}

TY - GEN

T1 - Undersampling approach for imbalanced training sets and induction from multi-label text-categorization domains

AU - Dendamrongvit, Sareewan

AU - Kubat, Miroslav

PY - 2010/9/30

Y1 - 2010/9/30

N2 - Text categorization is an important application domain of multi-label classification where each document can simultaneously belong to more than one class. The most common approach is to address the problem of multi-label examples by inducing a separate binary classifier for each class, and then use these classifiers in parallel. What the information-retrieval community has all but ignored, however, is that such classifiers are almost always induced from highly imbalanced training sets. The study reported in this paper shows how taking this aspect into consideration with a majority-class undersampling we used here can indeed improve classification performance as measured by criteria common in text categorization: macro/micro precision, recall, and F 1. We also show how a slight modification of an older undersampling technique helps further improve the results.

AB - Text categorization is an important application domain of multi-label classification where each document can simultaneously belong to more than one class. The most common approach is to address the problem of multi-label examples by inducing a separate binary classifier for each class, and then use these classifiers in parallel. What the information-retrieval community has all but ignored, however, is that such classifiers are almost always induced from highly imbalanced training sets. The study reported in this paper shows how taking this aspect into consideration with a majority-class undersampling we used here can indeed improve classification performance as measured by criteria common in text categorization: macro/micro precision, recall, and F 1. We also show how a slight modification of an older undersampling technique helps further improve the results.

UR - http://www.scopus.com/inward/record.url?scp=77957042586&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=77957042586&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-14640-4_4

DO - 10.1007/978-3-642-14640-4_4

M3 - Conference contribution

AN - SCOPUS:77957042586

SN - 3642146392

SN - 9783642146398

VL - 5669 LNAI

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 40

EP - 52

BT - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

ER -