Deep learning, particularly Convolutional Neural Networks (CNNs), has significantly improved visual data processing. In recent years, video classification has attracted significant attention in the multimedia and deep learning community. It is one of the most challenging tasks since both visual and temporal information should be processed effectively. Existing techniques either disregard temporal information between video sequences or generate very complex and computationally expensive models to integrate the spatiotemporal data. In addition, most deep learning techniques do not automatically consider the data imbalance problem. This paper presents an effective deep learning framework for imbalanced video classification by utilizing both spatial and temporal information. This framework includes a spatiotemporal synthetic oversampling to handle data with a skewed distribution, a pre-trained CNN model for spatial sequence feature extraction, followed by a residual bidirectional Long Short Term Memory (LSTM) to capture temporal knowledge in video datasets. Experimental results on two imbalanced video datasets demonstrate the superiority of the proposed framework compared to the state-of-the-art approaches.