STANDARDIZATION OF SPEECH CORPUS

Speech corpus is the basis for analyzing the characteristics of speech signals and developing speech synthesis and recognition systems. In China, almost all speech research and development affiliations are developing their own speech corpora. We have so many different kinds numbers of Chinese speech corpora that it is important to be able to conveniently share these speech corpora to avoid wasting time and money and to make research work more efficient. The primary goal of this research is to find a standard scheme which can make the corpus be established more efficiently and be used or shared more easily. A huge speech corpus on 10 regional accented Chinese, RASC863 (a Regional Accent Speech Corpus funded by National 863 Project) will be exemplified to illuminate the standardization of speech corpus production.


INTRODUCTION
The speech corpus, the collection of speech signals and its annotation, metadata, and documents, is the basis for both analyzing the characteristics of speech signals and developing speech synthesis and recognition systems.
Speech corpus-based technology has been widely used in people's lives although it is still a strange concept for many.An example is the automatic broadcasting system for traffic information.In this kind of system, the sound is not pronounced by actual speakers but synthesized by a TTS (text to speech) system based on a speech corpus.
Not only for TTS technology, but also for ASR (Automatic Speech Recognition) and phonetic research, is speech corpus very important.For phonetic research, speech corpus can provide diverse and accurate data to help researchers find the rules of languages.For ASR, in order to "train" the system to "understand" any of the speakers' voices, a speech corpus with a great capacity is necessary.Taking advantage of the statistical data of a speech corpus, the ASR system can transform speech signals into text strings by using phonological, linguistic, and stochastic analysis.That is why ASR can "understand" human's voice.(Yin, 2006).
With the development of speech corpus technology, a new problem has appeared: on the one hand, many corpora have been established, and much money and time have been put into their technology; on the other hand, these corpora are difficult to share among different affiliations.The main reason for this problem is the lack of general specifications for corpus collection, annotation, and distribution.In order to solve this problem, Data Science Journal, Volume 6, Supplement, 18 November 2007 standardization research on speech corpus is necessary and specifications should be stipulated.

STANDARDIZATION RESEARCH OF SPEECH CORPUS
Standardization of speech corpus includes many aspects as described below.

Legal considerations
Speech corpora and their production must abide by the laws of the nation.These legal documents should be prepared: a property rights statement of the corpora, agreement with the speakers, agreement with the users, etc.

Standardization of speech corpus collection procedures
Although speech corpus collection is only a procedure, it decides its quality and efficiency.Therefore, the production procedure of speech corpus should be standardized as is the ISO system for industry.Figure 1 shows the general procedure of producing a speech corpus.It is unnecessary to follow all of these steps.Some of them can be carried on simultaneously such as collecting and annotating, and some can be skipped in a specific task; in fact, an additional step can be introduced by the producer (Li, et al., 2006).
1) Project analysis and design: analyze the speech corpus project and draft its blue print.The specifications of the corpus will be set: corpus size, quantity of speakers, speech style, recording equipment, etc.
2) Preparing for collection: prepare for the corpus according to the blue print: design the input prompts, prepare hardware and software, raise money and organize staff, find speakers, etc.
3) Pre-collecting: if the speech corpus is very large and complicated, pre-collecting a few samples is absolutely necessary.It can find problems and improve the plan, thus avoiding possible mistakes in the formal collection.4) Pre-validation: evaluating the pre-collected corpus and improving the blue print.Data Science Journal, Volume 6, Supplement, 18 November 2007 7) Compiling lexical dictionaries.8) Post validation: evaluating the speech corpus and examining whether or not it has reached the criteria.This is employed to accept or reject the corpus.9) Distribution: distributing the speech corpus which passes the post validation

Standardization of the speech corpus
Not only should the procedure for producing a corpus, but also the corpus itself should be standardized.Table 1 shows the major specifications in producing a speech corpus (LI & Zu, 2006).

Specifications Comments
Specification of speakers Describing the speaker's features such as age, gender, educational background, voice quality, language and accent.

Specification of corpus design
Describing the corpus organization and contents.For instance, the detailed information or script (prompt) organization of reading and spontaneous speech, dialogues or monologues, elicited spontaneous speech (answering questions, etc.), expressive speech.Introduction to the phonetic or linguistic coverage and the algorithm used for selecting the corpus scripts.

Specification of recording
Describing the recording technical specifications for recording equipment, environmental conditions, recording platform and data storage strategy, such as sampling rate, speech wave format….

Specification of annotation
Describing the annotation conventions of sound to characters transcription, phonetic annotation or other information such as syntactic annotation.

Validation Criteria
Setting explicit criteria that the corpus should fulfill.Giving an overview of the features to be checked and the criteria employed to accept or reject the corpus.

Specification of distribution
Describing the distribution plan, principles and the storage medium.Table 1.Specifications of speakers and corpus collection

DETAILED SPECIFICATIONS EXEMPLIFIED BY RASC863
In this section, RASC863, a Regional Accented Speech Corpus funded by the National 863 Project, will be used to illustrate the above-mentioned standardization.
There are 10 dialect families in China: Guan (Mandarin), Jin, Wu, Hui, Xiang, Gan, Kejia (Hakka), Yue (Cantonese), Min, and Ping.It is well known that Chinese dialects differ greatly from each other and are not mutually intelligible.Thus, it is quite natural that Putonghua (Standard Chinese, hereafter SC), which is phonetically based on Beijing Mandarin, has been chosen as the communicative spoken language among people from different dialectal regions.However, people with different dialectal backgrounds typically speak SC with a certain degree of accent because of the influence of their mother tongue dialect.This kind of influence can be phonetic, lexical, and/or syntactical.
In the recent years, with the development of the ASR techniques, collecting accented spontaneous speech corpora has become an urgent demand in the field of speech technology, as well as in the field of phonetic sciences.
Funded by the National 863 High-Tech Project, we collected a speech corpus with 10 representative regional accents: Chongqing, Shanghai, Guangzhou, Xiamen, Taiyuan, Changsha, NanChang, Wenzhou, Luoyang, and Data Science Journal, Volume 6, Supplement, 18 November 2007 Nanjing.However, only the data for first four regions has been distributed by ChinesLDC (http://www.ChineseLdc.org).Therefore, the following introduction will focus on only these four regions.
The corpus consists of spontaneous speech, read speech, and selected dialectical words.For spontaneous speech, each speaker was asked to select a topic or use one from our prepared topic sheet with a variety of 160 topics and then give a 4-5 minute spontaneous speech on this topic.Also, each speaker was asked to answer 15 elected spontaneous questions.The read speech consisted of 2200 phonetically balanced sentences and 460 frequently used sentences in daily life domain.For each dialectal region, we prepared those words or phrases frequently used in daily life and that are different from Standard Chinese, and each speaker was asked to read 15 dialectal words.800 speakers (200 from each region, balanced in terms of age, gender, and educational background) were recruited for the project.
The detail specifications of RASC863 are as follows: 1) Specification of speakers Specification of speakers describes the number of speakers to be recorded for each dialectal accent and their characterizations.Sometimes it describes the speaking styles.Speaker characterization concerns the distribution of age, education level, gender, and the dialectal coverage aspired to.The speaking styles of speakers can be read speech, answering speech, command/control speech, descriptive speech, non-prompted speech, spontaneous speech, neutral vs. emotional speech, and dialogue.The content of the speech can be described in different ways according to task, topic, or simply in text description.
Table 2 illustrates the distribution of speakers' ages, genders, accent degrees, and educational backgrounds for each region.In accent category, L1-L3 stand for the three major accent degree levels from better to worse, and A and B stand for two sub-levels.

2) Specification of corpus design
The aim of speech corpus design is to determine what is to be recorded and to get the necessary script.Whether a corpus needs a designated script before collection is determined by the corpus type and corpus content (LI, et al., 2004).The RASC863 prompt sheet for each speaker is shown in Table 3.

5) Annotation Specification
Speech corpus annotation includes speech-to-characters transcription, segmental annotation, and prosodic annotation.Specification of annotation describes the annotation format, rules, tools, and consistency criteria.
Sometimes, if there is more than one transcriber transcribing or annotating simultaneously, their annotation consistency should be checked first.In the RASC863 project, Chinese character transcription as well as paralinguistic and non-linguistic labeling have been made for the spontaneous part.Additionally, phonetic annotation has been made for read speech for 80 speakers, 20 from each dialectal region.The speech software Praat was employed for phonetic annotation.C-ToBI3.0 and SAMPA-C annotation systems were used in prosodic annotation and segmental annotation (Li & Zu, 2006).

6) Legal agreement
The agreement between producer and speaker, often called the speaker agreement, in which the usage of the recorded speech data and even some of the speaker's information, is very important.Other aspects, such as whether the speech data can be distributed or copied unlimitedly, should also be described in the agreement.
Before recording, every speaker should sign the agreement.

7) Validation and distribution specification
Corpus validation criterion is the final validation after the pre-validation and the finishing of the whole corpus //Root DATA : speech data // subdirectories may be added such as Male/Female Recording session Speech types (read, spontaneous…) … ANNOT: annotation data META: metadata about corpus itself Specs: specifications of corpus Prom: prompt files DOC: documents LEX: lexicon or its statistic files TOOLS: recording, analysis or annotation tools Data Science Journal, Volume 6, Supplement, 18 November 2007 Because of the importance of speech corpora in China, corpora production has received long term support from various national funds such as the 863 Hi-tech Project and 973 Development Program of China and the National Science Foundation of China.Many speech research and development affiliations have developed their own speech corpora in recent years Fig 1：flow chart of the production of corpus

Table 2 .
Speakers' distribution for each region

Table 4 .
Data Science Journal, Volume 6, Supplement, 18 November 2007 Metadata for each sound file

Table 5 .
A typical corpus structure