Deep learning has led to many breakthroughs in machine perception and data mining. Although there are many substantial advances of deep learning in the applications of image recognition and natural language processing, very few work has been done in video analysis and semantic event detection. Very deep inception and residual networks have yielded promising results in the 2014 and 2015 ILSVRC challenges, respectively. Now the question is whether these architectures are applicable to and computationally reasonable in a variety of multimedia datasets. To answer this question, an efficient and lightweight deep convolutional network is proposed in this paper. This network is carefully designed to decrease the depth and width of the state-of-the-art networks while maintaining the high-performance. The proposed deep network includes the traditional convolutional architecture in conjunction with residual connections and very light inception modules. Experimental results demonstrate that the proposed network not only accelerates the training procedure, but also improves the performance in different multimedia classification tasks.