THE ELECTRONIC DATA AND RETRIEVAL OF THE SECRET HISTORY OF THE MONGOLS

This paper discusses the principle of electronic data and retrieval methods for the Secret History of the Mongols, which is a great classical historical work written in the 13th century with Chinese characters transliterated from Mongol. This handwritten work contains rather rich text information, which should be the contents of forming an electronic database. There are in the original book multi-types of information, including layouts, volumes, chapters, characters, interlinear translation, segments, and Chinese translation, each format of which has been approached in detail and divided separately with markers. On the basis of analysis, our project builds up a complete electronic retrieval system for this great book, which resolves the return to the original shape of the archaic handwriting form with three lines representing one content. The sorting methods of the system are also designed according to the original text formats, namely concordance technology, which can print out retrieved objects with their contexts, retrieve with statistical data, and freely browse search.


OUTLINES
The Secret History of the Mongols (SHM) is a great classical historical document, which records the origins of Mongoloid nationality and the conquering history of Genghis Khan, the conqueror of Central Asia and southern Russia and the founder and pioneer of the Yuan Dynasty, with his army.The original document of SHM was in ancient Uighur letters or characters written in the middle of the 13th century.However, the original Uighur document was lost in history.Now people only can find a version of SHM in Chinese-transliterated characters, which was preserved to serve the Ming Dynasty's political, military, and diplomatic purposes.The Chinese SHM takes the name 忙豁仑•纽察•脱卜察安 (Monqolun Nihuča Tobčiyan), namely 'The Secret History of the Mongols'.
The SHM is the most important book about the society, politics, war, and social conventions of Mongolian history at that time.It is also a literary work, which portrays many typical characters of grassland ethnics with poetic texts.The SHM wins universal praise because it is a folk epic of the Mongolian nationality.From the Qing Dynasty, the SHM became a research focus in academic fields.The study of its contents includes the origins and development of its versions, historical events, the places and their people, textual research of Data Science Journal, Volune 6, Supplement, 9 July 2007 S393 languages and lexemes, transliteration, and translation.According to statistics, the research achievements of SHM in the world add up to hundreds and thousands, and the translations are published in English, French, German, Japanese, Chinese, Hungarian, Russian, Polish, Czech, Turkish, Spanish, and other languages.The decision of UNESCO (United Nations Educational, Scientific, and Cultural Organization) points out that the SHM holds a lofty status in the world cultural history, and the Secret History of the Mongols could be considered not only as the remarkable masterpiece of the Mongolian literature but also as an outstanding literary monument of world significance (at the celebration of the 750 th anniversary of the Secret History of the Mongols).So far, the SHM has become a learning domain in academic researches of the global world: the SHM-ology.
In the information society nowadays, the SHM, as a cultural heritage and an eternal classic, needs further research and understanding.The many puzzles, such as who wrote the book, are necessary to discover to clarify records.In this thesis we talk about making a full electronic version of SHM, which can help SHM experts perform deep research with the original book of Chinese characters.The electronic version is built on the basis of SibuCongkan (四部丛刊), a classical Chinese collection.

THE TEXT FEATURES OF SHM AND THE PRINCIPLES OF ELECTRONIZATION
The original information of handwritten SHM is rich in its contents and format.Its contents involve history, geography, religion, military, ethnics, and social life, and its format relates to versions, languages, grammatical phenomena and vocabularies, Chinese translation, transliteration, orthography, and so on.The most important principle for creating an electronic version is to keep all the information of the original book, including layouts, volumes, chapters, pages, characters, segmentation, interlinear translation, and Chinese translation.

Information about layouts:
The so-called layouts mean that the original shape of the archaic handwriting form is one content represented by three lines of characters, which is a special complicated text with vertical direction of handwriting lines (Figure 1).The middle line is Chinese-transliterated characters from Mongol.The right (the first) line is side-for-side Chinese translation and grammatical annotation.The left (the third) line is initials representing initial consonants in pronunciations of the aligned characters.In addition, the small characters after Chinese-transliterated characters within the middle lines are endings representing pronunciation of final sounds.
For example, "成吉思 中 合罕" is a string of transliterated characters, the interlinear "太祖"(the first founder of a dynasty) is a side-for-side translation for "成吉思"(Genghis), and "皇帝"(emperor) interlinearizes "合罕".And the interlinear character " 中 " annotates the pronunciation of the initial consonant of character "合".The following chart changes the original vertical lines to the lines sideways.
Let us have a look at the original handwritten format with a photograph of one page of the book from volume 5.

Information about characters:
In order to transliterate Mongol with Chinese characters accurately, the transliterators made use of assisting symbols to represent the pronunciation of Mongol.The initials may help read the consonants of characters, and the endings may help read the finals of characters.Therefore, a complete character may consist of a transliterated Character, an initial, and an ending.As a result, there are four types of Characters: type C, which is a single character, such as "安", "客"; type xC, which consists of one initial and one normal character, such as " 舌 剌", " 中 豁"; type Cy, which is a normal character with an ending, such as "阿勒", "迭克"; and type xCy, which takes normal characters with both initials and endings, such as " 舌 魯黑", " 中 忽勒".
According to the statistics, the number of each type is as follows.

Information about interlinear translation:
The interlinear translation of SHM can align with strings of transliterated characters, which is important so that people can compile a Mongol-Chinese dictionary of the 13th century from the text.In addition, the lines of interlinear translation include much grammatical information, such as "每" represents plural category, "自的行" represents one kind of objective case, "有來" shows the perfect aspect, etc.
Information about translation: SHM, as a complete historical and literary work, is not only a great Mongol book, but also an important Chinese document.The Chinese translation contains much cultural and historical information as well as information about the grammar and vocabulary of middle ancient times.
So far we have discussed only a little about the information in SHM.Here, we limited ourselves to the format and formal information, so that we can make an electronic version and a retrieval system for SHM.The electronic version of SHM keeps all the basic features and retains all of the original information.We designed an aligning format between transliteration and side-for-side translation, continued to have the types of characters with initials or/and endings, preserved all the simplified and traditional Chinese characters as well as Data Science Journal, Volune 6, Supplement, 9 July 2007

S396
variant characters.As for information about volumes, pages, segments, chapters, and punctuations, all work as in the original book.For the electronic version, we made some revisions.Although we put all the volumes together into one system, users can directly select any volume wanted and operate within just that one volume.
Within one volume, users do not need to turn pages, just roll the window bar, which will let them find the distinctive contents divided into pages.The most prominent revision is the reading direction of the characters, which changes to horizontal lines from vertical lines.Figure 2 is an example of the electronic data.

THE ELECTRONIC RETRIEVAL SYSTEM OF SHM
Based on the idea of modern corpus linguistics, the SHM is a complicated text, with many inlaid layers and sections.It is necessary to differentiate "sections" and to add markers between them for electronic processing.
The main sections are sections of transliterated characters, of translation, and of interlinear translation.The other sections are sections of initials, of endings, of volumes, of pages, and of original notes.The sections of transliteration are the main body of the electronic project.The sections of translation refer to character strings after transliteration, which are different from transliteration for containing no interlinear characters.The sections of side-for-side translation are those strings, which interlinearize transliterated characters.When we digitalized the work, we added some markers within the text.The marker "/…/" is used for translation strings, "[]" is used for side-for-side translation strings, "{}" is used for initial strings, "<>" is used for ending strings, and "()" is used for strings of original notes, as shown in Figure 3.With the classification of sections, it is convenient to design different retrieval methods for different sections.
While processing sections of transliteration, the alignment of transliterated characters and interlinear characters is difficult, as is the appropriate processing of initials and endings.We propose a comparison algorithm for these strings.After we get the length of a transliterated character string, we compare it with the length of an extracted interlinear string.If the transliterated string is longer than the interlinear string, we fix the position of the interlinear string at the location of the last character position of the transliterated string.If the interlinear string is longer than a transliterated string and clashes with previous strings, both strings are shifted rearward synchronously.With the realization of the section processing and positioning techniques, there is no doubt that we can determine the format of integrated content with three lines using the digitalized system there.For sections of translation, we can use the same method as for sections of transliterated character strings.
To retrieve the text of SHM, we have designed three retrieval techniques.They are: concordance, browsing, and statistical retrieval.Each method processes separately distinctive sections.
The method of concordance processes strings of transliterated characters.When the retrieval results are shown, their contexts need to be shown together.Figure 4 gives an example of retrieval string "必孫."Users may look at the previous character strings and the following character strings with interlinear characters and initials and endings.

Figure 1 .
Figure 1.The original format of SHM (with English annotations)

Figure 2 .
Figure 2. A piece of the electronic retrieval system of SHM

Figure 3 .
Figure 3. Markers added to text

Table 1 .
The statistics for four types of transliterated characters