1. Specifications Table

Subject areaImage Processing, Computer Vision, Machine learning and Deep learning.
More specific subject areaFeature Extraction, Speech recognition and Text Recognition.
Type of dataImages, Audio files, Tables and Figures.
How data was acquired (Experimental Setup)Original videos were captured at University Institute of Engineering Technology, Kanpur using a Canon Eos 1200D 18MP Digital SLR Camera with 18–55 mm and 55–250 mm lens in a highly sophisticated and noise free experimental laboratory.
Data formatVideos are in .MOV format, Frames are in .jpg format, audio files are in .wav format, Wave graphs for are in .png format.
Experimental factors The video samples that have been generated for various subjects are De-noised by using Neat Video ().
Experimental featuresExtract various biometric traits for every subject such as frames, boundary box coordinates, audio of the entire video of a subject, the audio wave signal for entire video length, split audio of text spoken by subject, and split audio waveform.
Data source locationUniversity Institute of Engineering Technology, Kanpur, India.
Data accessibilityThe dataset is accessible and it is publicly and freely available for any research, educational, and purposes.

2. Value of the Data

  • Today Human-Machine Intelligence () and computer vision () become pervasive because of their variety of applications ranging from medical informatics () to face recognition () and building of smart surveillance system (). The core to many of such applications is image classification and recognition (). Hence, our dataset () is a valuable resource for use by vision and learning community.
  • Our dataset contains the videos, frames of a video subject, its boundary coordinates, audio format file of entire video subject, text (digits 1 to 20) in an audio format file and its wave format.
  • Our dataset allows researchers to apply various neural network models, machine learning algorithms and deep neural network models in various domains like face recognition, expression recognition, and text and speech recognition () for building robust recognition models.
  • The various features mentioned in the Table 1.1 are extracted for a particular subject as a demonstration and the scripts by which the similar data can be generated for each subject is also provided in our dataset.

3. Data

A high definition video in .MOV format for each subject is captured using an experimental setup as mentioned in Table 1.1. Every subject in a video recites similar text starting from digits 1 to 20 to maintain uniformity in the data. The following are the various data values in multiple formats that can be used for further modeling:

3.1. Frames

We collect the frames of a video corresponding to each subject. The frames are stored in .jpg file format. The code is customized in such a way that it generates frames for each video by giving a suitable path. We generate the same for a sample video and Figure 1 shows the frames of a sample video DSC_0020.MOV from our dataset. We show some example frames out of the total 1011 frames generated for the mentioned video DSC_0020.MOV.

Figure 1 

Frames generated for a sample video DSC_0020.MOV.

3.2. Boundary Coordinates

The frames from the sample video are used to find the boundary coordinates of the face corresponding to a subject in our dataset. We create .csv files for each frame and .csv file contains coordinates for each boundary box. The fields of a boundary box are LowerLeft(X), LowerLeft(Y), UpperLeft(X), UpperLeft(Y), UpperRight(X), UpperRight(Y), LowerRight(X), LowerRight(Y) corresponding to each frame. Figure 2 shows some samples of the frames with boundary boxes. Table 1 shows the sample .csv format for DSC_0020.MOV video containing boundary box coordinates.

Figure 2 

Boundary box for the frames generated for a sample video.

Table 1

.csv format for the boundary box coordinates of each frame for sample video DSC_0020.MOV.

FramesLower Left (X)Lower Left (Y)Upper Left (X)Upper Leftz (Y)Upper Right (X)Upper Right (Y)Lower Right (X)Lower Right (Y)


3.3. Audio file in .wav format

In this module, we generate a complete audio file for every subject in our dataset which can be further used by the researchers for audio recognition models or for some audio detection related learning models. The files are stored in the .wav format for a sample video, and the script is customized to find out the same for every subject video in our dataset.

3.4. Wave Forms of Audio files

The audio files in the .wav format are generated for each video, and then we create the waveforms. This is done to help researchers in getting the peak amplitudes so that they can compare these amplitudes with the individual peak amplitudes of the text that is recited in the video. This is an intermediate step for text and voice recognition models. The graphs corresponding to the full-length videos are available in a .jpeg format. Figure 3 shows a sample waveform for a sample video.

Figure 3 

Wave form of a sample video DSC_0020.MOV.

3.5. Split text files

In this step, we generate the data corresponding to the text that is recited by each subject in our dataset. The dataset consists of digits 1 to 20 recited by each subject. This data is of importance in designing learning models and architectures for text recognition or to identify that who has spoken a particular digit? And precisely which digit, out of 1 to 20 in the entire video length? These files in the .wav format for a sample video DSC_0020.MOV and the customized scripts are available in our dataset.

3.6. Split text waveforms

In this step, we generate graphs corresponding to each text present in the video, i.e., digits from 1 to 20 in the above step. For building more accurate and efficient text or speech recognition models, we need to compare the waveforms of the full-length video with the waveforms of split text waveforms. It helps us in gathering more information about the individual amplitudes of the text present in the video and it also help to identify any noise if present in any video file. The graphs are generated for each split text file and stored in .jpeg format. The X-axis represents the time and Y-axis represents the amplitude. Figures 4 and 5 illustrates a sample graph for digits 1 and 2 respectively from the video sample DSC_0020.MOV.

Figure 4 

Wave form for digit 1 recited in DSC_0020.MOV.

Figure 5 

Wave form for digit 2 recited in DSC_0020.MOV.

4. Experimental Design and Analysis of Our Dataset

To analyze the effectiveness of our video dataset, we utilize the fundamentals of convolutional neural network (CNN) () model for face recognition () and speech recognition. We train a CNN model using high-resolution images present as frames in our dataset. Table 2 shows the configuration of the CNN whose output later is variable and depends on the task to be performed. We then select 70% of our dataset randomly and use it as a training set. Further, to evaluate the performance, rest of the dataset is used. For speech recognition model, we train the CNN using waveforms. There are a total of 1340 waveforms related to 67 different subjects and it corresponds to digits 1 to 20 in our dataset. We split the dataset into a ratio of 70:30 for training and testing our model. Figure 6 shows the architecture of our CNN model for speech recognition. The training of the model needs a high computational power and support. Hence, we use NVidia Tesla K80 GPU for our evaluations. Table 3 shows the accuracy and training loss for the face recognition model on our dataset. Similarly, Table 4 shows the accuracy and training loss for the speech recognition model on our dataset. Figure 7 and Figure 8 show the results for accuracy and training loss graph for both the recognition models on our datasets over 500 epochs. The x-axis and the y-axis correspondingly denote the number of epochs and accuracy/training loss.

Table 2

Configuration of Convolutional Neural Network.

LayersFilter SizeStridesNo. of filters

Convolution Layer 15 × 5132
Pooling Layer 12 × 22
Convolution Layer 2a1 × 1264
Convolution Layer 2a_13 × 3164
Convolution Layer 2b3 × 1164
Convolution Layer 2b_11 × 3164
Pool 2b2 × 22
Convolution Layer 2c1 × 1264
Pool 22 × 22
Fully Connected Layer 11024
Fully Connected Layer 21024
Figure 6 

Architecture of CNN for speech recognition model ().

Table 3

Face Recognition Model Results on Our Dataset.

DatasetTraining/Testing PercentageAccuracyTraining Loss

Our Dataset ()70% and 30%99.14%0.56%

Table 4

Speech Recognition Model Results on Our Dataset.

DatasetTraining/Testing PercentageAccuracyTraining Loss

Our Dataset ()70% and 30%96.42%0.67%
Figure 7 

Accuracy and training loss results graph on our dataset for face recognition.

Figure 8 

Accuracy and training loss results graph on our dataset for speech recognition.

5. Effectiveness Comparison of Our Dataset with Open Source Datasets

In this paper, we also compare the effectiveness of our dataset with other publically available datasets. For face recognition comparison, we use the JAFFE dataset (). The dataset consists of 213 images posed by 10 Japanese models. Similarly, for speech recognition, we use Free Spoken Digit Dataset (FSDD) (), which is an audio/speech dataset that contains recordings of digits in. wav format. The recordings are done at 8kHZ and do not contain noise. The total number of recordings is 1500, corresponding to 3 subjects of spoken digits from 0 to 9. Each digit is recited 50 times by a single subject. Table 5 and Table 6 show the results of accuracy and training loss on the JAFFE dataset and FSDD dataset for face recognition and speech recognition models, respectively. Figures 9 and 10 shows the corresponding results for accuracy and training loss on JAFFE and FSDD datasets. To evaluate the performance of our face and speech recognition models which are trained on our dataset, we test the face recognition model using JAFFE dataset and achieve an accuracy of 93.04% as shown in Table 7. Similarly, for the speech recognition model, we test using the FSDD dataset and achieve an accuracy of 90.11%, as shown in Table 8. Figure 11 shows the test accuracy versus the number of epoch graphs for face recognition and speech recognition models.

Table 5

Face Recognition Model Results on JAFFE Dataset.

DatasetTraining/Testing PercentageAccuracyTraining Loss

JAFFE Dataset ()70% and 30%92.1%0.78%

Table 6

Speech Recognition Model Results on FSDD Dataset.

DatasetTraining/Testing PercentageAccuracyTraining Loss

FSDD Dataset )70% and 30%89.2%0.81%
Figure 9 

Accuracy and training loss results graph on JAFFE dataset for face recognition.

Figure 10 

Accuracy and training loss results graph on FSDD dataset for speech recognition.

Table 7

Face Recognition test results of our trained model for JAFFE dataset.

DatasetTraining/Testing PercentageAccuracy


Our Dataset ()JAFFE Dataset ()70% and 30%93.04%

Table 8

Speech Recognition test results of our trained model for FSDD dataset.

DatasetTraining/Testing PercentageAccuracy


Our Dataset ()FSDD Dataset )70% and 30%90.11%
Figure 11 

Test accuracy of face and speech recognition model trained on our dataset.

6. Conclusion

This paper presents a video dataset of 67 subjects in which all subjects recite same text, i.e. digits from 1 to 20. We present the ways to extract useful information such as video frames in .jpeg format, full length audio of the corresponding video in .wav format, spoken digits (1–20) audio in .wav format, the waveforms for full length video and for the spoken digits. To show the effectiveness of data, we trained the CNN models related to face recognition and speech recognition to test the accuracy. The results show that the dataset is more accurate as compared to the results for two publically available datasets, JAFFE dataset is used to show the effectiveness for face recognition and FSDD dataset is used to show the effectiveness for speech recognition. This is a comprehensive video dataset and to best of our knowledge there is no publically available dataset that provides all the data values related to a video that we presented in our paper.

To mention here, there are a few limitations. One is that, video dataset consists of text information for digits 1 to 20. More text can be incorporated in future work. Another limitation is that the video contains only two class of emotions – happy and neutral. More emotions like anger, disgust, fear, sad may be added in future.

Data Accessibility Statement

Anand Handa, Dr. Rashi Agarwal, and Prof. Narendra Kohli (). A comprehensive video dataset for Multi-Modal Recognition Systems [Data set]. Zenodo http://doi.org/10.5281/zenodo.1492227