Multimodal deep representation learning for video classification

Haiman Tian, Yudong Tao, Samira Pouyanfar, Shu Ching Chen, Mei-Ling Shyu

Research output: Contribution to journalArticle

Abstract

Real-world applications usually encounter data with various modalities, each containing valuable information. To enhance these applications, it is essential to effectively analyze all information extracted from different data modalities, while most existing learning models ignore some data types and only focus on a single modality. This paper presents a new multimodal deep learning framework for event detection from videos by leveraging recent advances in deep neural networks. First, several deep learning models are utilized to extract useful information from multiple modalities. Among these are pre-trained Convolutional Neural Networks (CNNs) for visual and audio feature extraction and a word embedding model for textual analysis. Then, a novel fusion technique is proposed that integrates different data representations in two levels, namely frame-level and video-level. Different from the existing multimodal learning algorithms, the proposed framework can reason about a missing data type using other available data modalities. The proposed framework is applied to a new video dataset containing natural disaster classes. The experimental results illustrate the effectiveness of the proposed framework compared to some single modal deep learning models as well as conventional fusion techniques. Specifically, the final accuracy is improved more than 16% and 7% compared to the best results from single modality and fusion models, respectively.

Original languageEnglish (US)
Pages (from-to)1-17
Number of pages17
JournalWorld Wide Web
DOIs
StateAccepted/In press - May 3 2018

Fingerprint

Fusion reactions
Disasters
Learning algorithms
Feature extraction
Neural networks
Deep learning
Deep neural networks

Keywords

  • Disaster management system
  • Multi-stage fusion
  • Multimodal deep learning
  • Transfer learning

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Multimodal deep representation learning for video classification. / Tian, Haiman; Tao, Yudong; Pouyanfar, Samira; Chen, Shu Ching; Shyu, Mei-Ling.

In: World Wide Web, 03.05.2018, p. 1-17.

Research output: Contribution to journalArticle

Tian, Haiman ; Tao, Yudong ; Pouyanfar, Samira ; Chen, Shu Ching ; Shyu, Mei-Ling. / Multimodal deep representation learning for video classification. In: World Wide Web. 2018 ; pp. 1-17.
@article{1123380c0d14466197d743fd41555e46,
title = "Multimodal deep representation learning for video classification",
abstract = "Real-world applications usually encounter data with various modalities, each containing valuable information. To enhance these applications, it is essential to effectively analyze all information extracted from different data modalities, while most existing learning models ignore some data types and only focus on a single modality. This paper presents a new multimodal deep learning framework for event detection from videos by leveraging recent advances in deep neural networks. First, several deep learning models are utilized to extract useful information from multiple modalities. Among these are pre-trained Convolutional Neural Networks (CNNs) for visual and audio feature extraction and a word embedding model for textual analysis. Then, a novel fusion technique is proposed that integrates different data representations in two levels, namely frame-level and video-level. Different from the existing multimodal learning algorithms, the proposed framework can reason about a missing data type using other available data modalities. The proposed framework is applied to a new video dataset containing natural disaster classes. The experimental results illustrate the effectiveness of the proposed framework compared to some single modal deep learning models as well as conventional fusion techniques. Specifically, the final accuracy is improved more than 16{\%} and 7{\%} compared to the best results from single modality and fusion models, respectively.",
keywords = "Disaster management system, Multi-stage fusion, Multimodal deep learning, Transfer learning",
author = "Haiman Tian and Yudong Tao and Samira Pouyanfar and Chen, {Shu Ching} and Mei-Ling Shyu",
year = "2018",
month = "5",
day = "3",
doi = "10.1007/s11280-018-0548-3",
language = "English (US)",
pages = "1--17",
journal = "World Wide Web",
issn = "1386-145X",
publisher = "Springer New York",

}

TY - JOUR

T1 - Multimodal deep representation learning for video classification

AU - Tian, Haiman

AU - Tao, Yudong

AU - Pouyanfar, Samira

AU - Chen, Shu Ching

AU - Shyu, Mei-Ling

PY - 2018/5/3

Y1 - 2018/5/3

N2 - Real-world applications usually encounter data with various modalities, each containing valuable information. To enhance these applications, it is essential to effectively analyze all information extracted from different data modalities, while most existing learning models ignore some data types and only focus on a single modality. This paper presents a new multimodal deep learning framework for event detection from videos by leveraging recent advances in deep neural networks. First, several deep learning models are utilized to extract useful information from multiple modalities. Among these are pre-trained Convolutional Neural Networks (CNNs) for visual and audio feature extraction and a word embedding model for textual analysis. Then, a novel fusion technique is proposed that integrates different data representations in two levels, namely frame-level and video-level. Different from the existing multimodal learning algorithms, the proposed framework can reason about a missing data type using other available data modalities. The proposed framework is applied to a new video dataset containing natural disaster classes. The experimental results illustrate the effectiveness of the proposed framework compared to some single modal deep learning models as well as conventional fusion techniques. Specifically, the final accuracy is improved more than 16% and 7% compared to the best results from single modality and fusion models, respectively.

AB - Real-world applications usually encounter data with various modalities, each containing valuable information. To enhance these applications, it is essential to effectively analyze all information extracted from different data modalities, while most existing learning models ignore some data types and only focus on a single modality. This paper presents a new multimodal deep learning framework for event detection from videos by leveraging recent advances in deep neural networks. First, several deep learning models are utilized to extract useful information from multiple modalities. Among these are pre-trained Convolutional Neural Networks (CNNs) for visual and audio feature extraction and a word embedding model for textual analysis. Then, a novel fusion technique is proposed that integrates different data representations in two levels, namely frame-level and video-level. Different from the existing multimodal learning algorithms, the proposed framework can reason about a missing data type using other available data modalities. The proposed framework is applied to a new video dataset containing natural disaster classes. The experimental results illustrate the effectiveness of the proposed framework compared to some single modal deep learning models as well as conventional fusion techniques. Specifically, the final accuracy is improved more than 16% and 7% compared to the best results from single modality and fusion models, respectively.

KW - Disaster management system

KW - Multi-stage fusion

KW - Multimodal deep learning

KW - Transfer learning

UR - http://www.scopus.com/inward/record.url?scp=85046399048&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85046399048&partnerID=8YFLogxK

U2 - 10.1007/s11280-018-0548-3

DO - 10.1007/s11280-018-0548-3

M3 - Article

SP - 1

EP - 17

JO - World Wide Web

JF - World Wide Web

SN - 1386-145X

ER -