Many multimedia systems stream real-time visual data continuously for a wide variety of applications. These systems can produce vast amounts of data, but few studies take advantage of the versatile and real-time data. This paper presents a novel model based on the Convolutional Neural Networks (CNNs) to handle such imbalanced and heterogeneous data and successfully identifies the semantic concepts in these multimedia systems. The proposed model can discover the semantic concepts from the data with a skewed distribution using a dynamic sampling technique. The paper also presents a system that can retrieve real-time visual data from heterogeneous cameras, and the run-time environment allows the analysis programs to process the data from thousands of cameras simultaneously. The evaluation results in comparison with several state-of-the-art methods demonstrate the ability and effectiveness of the proposed model on visual data captured by public network cameras.