A Comprehensive Video Dataset for Multi-Modal Recognition Systems

This paper presents a comprehensive, highly defined and fully labelled video dataset. This dataset consists of videos related to 67 different subjects. The videos contain similar text and the text contains digits from 1 to 20 recited by 67 different subjects using the same experimental setup. This dataset can be used as a unique resource for researchers and analysts for training deep neural networks to build highly efficient and accurate recognition models in various domains of computer vision such as face recognition model, expression recognition model, speech recognition model, text recognition, etc. In this paper, we also train models related to face recognition and speech recognition on our dataset and also compare the results with the publically available datasets to show the effectiveness of our dataset. The experimental results show that our comprehensive dataset is more accurate than other dataset on which the models are tested.


Specifications
. Value of the Data • Today Human-Machine Intelligence (Lai., 2012) and computer vision (Frizzell et al., 2018) become pervasive because of their variety of applications ranging from medical informatics (Goodfellow et al., 2016) to face recognition (Sun, Wu, and Hoi, 2018) and building of smart surveillance system (Memos et al., 2018). The core to many of such applications is image classification and recognition

Data accessibility
The dataset is accessible and it is publicly and freely available for any research, educational, and purposes. (Simonyan and Zisserman, 2014). Hence, our dataset (Handa, Agarwal, and Kohli, 2018) is a valuable resource for use by vision and learning community. • Our dataset contains the videos, frames of a video subject, its boundary coordinates, audio format file of entire video subject, text (digits 1 to 20) in an audio format file and its wave format. • Our dataset allows researchers to apply various neural network models, machine learning algorithms and deep neural network models in various domains like face recognition, expression recognition, and text and speech recognition (Shekar Naganna et al., 2018) for building robust recognition models. • The various features mentioned in the Table 1.1 are extracted for a particular subject as a demonstration and the scripts by which the similar data can be generated for each subject is also provided in our dataset.

Data
A high definition video in .MOV format for each subject is captured using an experimental setup as mentioned in Table 1.1. Every subject in a video recites similar text starting from digits 1 to 20 to maintain uniformity in the data. The following are the various data values in multiple formats that can be used for further modeling:

Boundary Coordinates
The frames from the sample video are used to find the boundary coordinates of the face corresponding to a subject in our dataset. We create .csv files for each frame and .csv file contains coordinates for each boundary box. The fields of a boundary box are LowerLeft(X), LowerLeft(Y), UpperLeft(X), UpperLeft(Y), UpperRight(X), UpperRight(Y), LowerRight(X), LowerRight(Y) corresponding to each frame. Figure 2 shows some samples of the frames with boundary boxes. Table 1 shows the sample .csv format for DSC_0020.MOV video containing boundary box coordinates.

Audio file in .wav format
In this module, we generate a complete audio file for every subject in our dataset which can be further used by the researchers for audio recognition models or for some audio detection related learning models. The  files are stored in the .wav format for a sample video, and the script is customized to find out the same for every subject video in our dataset.

Wave Forms of Audio files
The audio files in the .wav format are generated for each video, and then we create the waveforms. This is done to help researchers in getting the peak amplitudes so that they can compare these amplitudes with the individual peak amplitudes of the text that is recited in the video. This is an intermediate step for text and voice recognition models. The graphs corresponding to the full-length videos are available in a .jpeg format. Figure 3 shows a sample waveform for a sample video.

Split text files
In this step, we generate the data corresponding to the text that is recited by each subject in our dataset. The dataset consists of digits 1 to 20 recited by each subject. This data is of importance in designing learning models and architectures for text recognition or to identify that who has spoken a particular digit? And precisely which digit, out of 1 to 20 in the entire video length? These files in the .wav format for a sample video DSC_0020.MOV and the customized scripts are available in our dataset.

Split text waveforms
In this step, we generate graphs corresponding to each text present in the video, i.e., digits from 1 to 20 in the above step. For building more accurate and efficient text or speech recognition models, we need to compare the waveforms of the full-length video with the waveforms of split text waveforms. It helps us in gathering more information about the individual amplitudes of the text present in the video and it also help to identify any noise if present in any video file. The graphs are generated for each split text file and stored in .jpeg format. The X-axis represents the time and Y-axis represents the amplitude. Figures 4 and 5 illustrates a sample graph for digits 1 and 2 respectively from the video sample DSC_0020.MOV.

Experimental Design and Analysis of Our Dataset
To analyze the effectiveness of our video dataset, we utilize the fundamentals of convolutional neural network (CNN) (Krizhevsky, Sutskever, and Hinton 2012) model for face recognition (Acharya et al. 2018) and speech recognition. We train a CNN model using high-resolution images present as frames in our dataset. Table 2 shows the configuration of the CNN whose output later is variable and depends on the task to be performed. We then select 70% of our dataset randomly and use it as a training set. Further, to evaluate the performance, rest of the dataset is used. For speech recognition model, we train the CNN using waveforms.
There are a total of 1340 waveforms related to 67 different subjects and it corresponds to digits 1 to 20 in Table 1: .csv format for the boundary box coordinates of each frame for sample video DSC_0020.MOV. our dataset. We split the dataset into a ratio of 70:30 for training and testing our model. Figure 6 shows the architecture of our CNN model for speech recognition. The training of the model needs a high computational power and support. Hence, we use NVidia Tesla K80 GPU for our evaluations. Table 3 shows the accuracy and training loss for the face recognition model on our dataset. Similarly, Table 4 shows the accuracy and training loss for the speech recognition model on our dataset. Figure 7 and Figure 8 show the results for accuracy and training loss graph for both the recognition models on our datasets over 500 epochs. The x-axis and the y-axis correspondingly denote the number of epochs and accuracy/training loss.    Figure 6: Architecture of CNN for speech recognition model (Zhao et al. 2017).

Effectiveness Comparison of Our Dataset with Open Source Datasets
In this paper, we also compare the effectiveness of our dataset with other publically available datasets. For face recognition comparison, we use the JAFFE dataset (Lyons et al., 1998). The dataset consists of 213 images posed by 10 Japanese models. Similarly, for speech recognition, we use Free Spoken Digit Dataset (FSDD) (Jackson et al., 2018), which is an audio/speech dataset that contains recordings of digits in. wav format. The recordings are done at 8kHZ and do not contain noise. The total number of recordings is 1500, corresponding to 3 subjects of spoken digits from 0 to 9. Each digit is recited 50 times by a single subject. Table 5 and Table 6 show the results of accuracy and training loss on the JAFFE dataset and FSDD dataset for face recognition and speech recognition models, respectively. Figures 9 and 10 shows the corresponding results for accuracy and training loss on JAFFE and FSDD datasets. To evaluate the performance of our    face and speech recognition models which are trained on our dataset, we test the face recognition model using JAFFE dataset and achieve an accuracy of 93.04% as shown in Table 7. Similarly, for the speech recognition model, we test using the FSDD dataset and achieve an accuracy of 90.11%, as shown in Table  8. Figure 11 shows the test accuracy versus the number of epoch graphs for face recognition and speech recognition models.

Conclusion
This paper presents a video dataset of 67 subjects in which all subjects recite same text, i.e. digits from 1 to 20. We present the ways to extract useful information such as video frames in .jpeg format, full length audio of the corresponding video in .wav format, spoken digits (1-20) audio in .wav format, the waveforms for full length video and for the spoken digits. To show the effectiveness of data, we trained the CNN models related to face recognition and speech recognition to test the accuracy. The results show that the dataset is more accurate as compared to the results for two publically available datasets, JAFFE dataset is used to show the effectiveness for face recognition and FSDD dataset is used to show the effectiveness for speech recognition. This is a comprehensive video dataset and to best of our knowledge there is no publically available dataset that provides all the data values related to a video that we presented in our paper.
To mention here, there are a few limitations. One is that, video dataset consists of text information for digits 1 to 20. More text can be incorporated in future work. Another limitation is that the video contains only two class of emotions -happy and neutral. More emotions like anger, disgust, fear, sad may be added in future.

Ethics and Consent
All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional committee. A copy of consent from the institutional academic council of UIET, CSJM University, Kanpur is attached for reference.