Nevskaya, I.A.

Linguistic Computer Databases as a Basis for Preservation and Revitalization of Indigenous Turkic Languages of Siberia

1. Present Concept of Language Documentation

Language documentation with the use of up-to-date technologies for gathering and processing of linguistic data has only recently appeared as an independent discipline (Language documentation and description; Lehmann 1983; Lehmann: Internet publication, etc.). At present it is going trough a period of a rapid growth: its methods are being developed, its goals as distinct from those of descriptive linguistics, aimed at gathering a body of texts, creating a dictionary and writing a grammar of a language, are being specified. Language documentation is gaining significance in the modern day historical period when rapid globalization processes led thousands of languages to endangerment. The attention of the general public is drawn to these issues: different associations of disappearing languages have already been or are being established, foundations for their research are being formed, different academic programs aimed at their documentation are being launched. Specifically, we can mention DoBeS (Dokumentation Bedrohter Sprachen) program funded by German nongovernmental Volkswagen Foundation (Volkswagen-Stiftung) and also ELDP (Endangered Languages Documentation Programme) launched in the Center of Oriental and African Studies (SOAS) of the University of London, funded by a private foundation of Hans Rausing.

The modern concept of language documentation, its methods and technologies were discussed at a typological school in held in Frankfurt University in September 2004 and organized by a group of professors of language and culture disciplines department. International conference “Multilingual World”, where first projects’ results funded within the aforementioned programs aimed at documentation of endangered languages and theoretical understanding of these results as pertained to the development of methods and technologies of language documentation as an independent linguistic discipline, was held in the frameworks of the typological school.

The main goal of language documentation is rather a documentation of a language in different settings of its natural use that is a record of different communicative situations in various social and cultural contexts, than a mere fixation of a linguistic system. This documentation is performed with the use of all technologies available to a researcher at present: audio and video records, pictures and graphics, hand written texts. Then all types of information are deciphered, annotated and analyzed, commented and archived. The priority is given to the following principles of language documentation (Austin 2004):

  1. collection of a wide variety of high quality linguistic materials as a basis for description of different linguistic phenomena;
  2. establishment of a basis for revitalization of a certain language, even if all the other sources of linguistic material have been lost;
  3. creation of materials for preserving and teaching of a language.

Linguistic data are documented on different media:

  1. video records;
  2. audio records;
  3. photographs and pictures;
  4. written records (i.e., transcription, morphological analysis, description of specific phenomena);
  5. metadata (structured data on materials gathered).

Taken all together, these principles lie in the basis of language documentation which should include the following components:

  1. records (video or audio) of a spoken language of different styles as used in different contexts supplemented with transcription, translation in a meta-language and an annotation;
  2. records of written texts of different styles and as used in different contexts supplemented with transcription, translation in a meta-language and an annotation;
  3. significant sociological and culturological information;
  4. bilingual dictionary;
  5. thesaurus dictionary;
  6. educational and methodological materials;
  7. grammatical essay.

Methods of documentation and terminology used should be aimed at the availability of the information on a specific language to the general public: linguists, indigenous community members, teachers and pupils.

One of the examples is an interesting attempt of practical use of results of a disappearing Turkic language scientific documentation – multimedia Karaite CD (Csato and Nathan 2004).

2. The Main Principles of Language Documentation

Qualitative language documentation should meet the following requirements (Woodbery 2004):

  1. material gathered should be diverse: it should represent different parties with different social statuses and different social roles; different channels of information transmission with the help of a language such as a spoken language, written language, electronic letters; different linguistic genres, including dialogues and monologues, formal and informal communication, etc., different dialects and jargons;
  2. material gathered should be large in volume and statistically relevant;
  3. documentation should be successive, involving a maximum number of participants of documentation, especially representatives of an indigenous community, which language is being documented; they should be taught relevant methods of material fixation, equipped with the necessary technological devices and encouraged to continue documentation in every possible way; thus, this documentation type is opposed to traditional one conducted individually, as a rule, by linguists belonging to a traditional society;
  4. documentation should be “transparent”: data should be processed and indexed in such a way that they could be used even 500 years later; all data should be translated in a meta-language of a more common use, transcription should rely on phonetics description and phonology of a language documented; sentence (syntactic) structures should be singled out – simple collection of texts on audio or video media is insufficient for considering documentation complete;
  5. documentation should be archived in such a way that it could be easily preserved or, if necessary, transported to new media which, as we know, updated every 5-10 years; full and comprehensive metadata on linguistic materials gathered are required;
  6. documentation should be translated according to ethical standards of a scientific research: researchers should respect and observe incorporeal rights; they should conduct documentation in collaboration with the community which supplies linguistic data; they should respect this people’s customs and meet its wishes.

3. Documentation of Indigenous Turkic Languages of Siberia

Documentation of indigenous Turkic languages of Siberia presently is one of priorities of modern Siberian linguistics. It is also an urgent goal since dozens of indigenous languages of Siberia are endangered. Present day documentation of Turkic languages is taking first steps now. We should mention the first, for Southern Siberia, attempt of creation of Shor language machine (computer) data collection, undertaken in 1990-1992 by turkologists of Novokuznetsk State Pedagogical Institute (at present Kuzbass State Pedagogical Academy) A.V. Esipova and I.A. Nevskaya and programmer D.Y. Ivanov. Software environment for textual and dictionary data bases was developed, their building started, a program of automated morphological analysis was created (Esipova and Nevskaya, 1994). Due to economic reforms which started in the country and had a dramatic effect on the science and to the lack of funding the work had to be interrupted. It was continued in 1998-2001 in the framework of an international project on the creation of computer databases of Shor language funded by RFH (Russian Foundation of Humanities) and German Academic Research Society. Russian researchers’ group was directed by A.V. Esipova, and German one – by Marsel Erdahl, while I.A. Nevskaya coordinated the work of Russiaan and German research groups.

Shor language, a language of Siberia indigenous population is endangered. We faced a task of creation of electronic body of Shor texts based on unpublished texts and expedition materials. It was also supposed to include published but hardly available sources (i.e. XIX century missionary literature) and samples of developing modern Shor literature.

A created body of Shor texts gave researchers interesting, from historical and linguistic point of view, material. It served as a basis for different studies in grammar, lexicology and dialectology of Shor language as well as of sociolinguistics and folklore.

The works on the project were mainly held according to the goals and tasks set and the plan announced which included:

  1. creation of technical premises for the project’s implementation;
  2. collection of textual materials and preparation of a basis for the electronic body of Shor texts;
  3. work on automated analysis and building of the electronic body of Shor texts per se;
  4. linguistic description of the created body of texts.

Creation of technical premises for the project’s implementation meant the following:

  • Correct electronic transmission of special signs necessary for fixation of Shor texts in Cyrillic and Latin transcription led to creation of a number of types (Siberia_fix, Janalif.ttf).
  • Shor Cyrillic and Latin texts scanned by FineReader program constituted databases for the recognition of Shor texts. At present the level of correct recognition of Shor Cyrillic equals 99% in case of a high quality of a scanned text.
  • For creation of the electronic body of texts and their morphological analysis program «Shoebox» was chosen. This choice was determined by its capacity of automated morphological segmentation of wordforms, and also by simplicity of its manipulation. However, the program had to be adapted to the purposes of our project, especially for creation of the Shor database structure.
  • A package of utility programs (in Turbo Pascal and Visual Basic languages) for conversion of text and Shoebox format files for automated work in programs Shoebox and Word was created.
  • For different operating systems users to be able to read all special signs the presentation of Shor data in the Internet was prepared with the use of UNICODE code system. Tables of Shor signs coding were built and presented in the Internet.
  • A converter of Shor texts from Shoebox program to HTML-format was created.
  • Shor website (http://shoriya.ngpi.rdtc.ru) was designed in such a way that information is identically represented in two popular browsers Netscape Navigator and Internet Explorer.
  • Thanks to the use of Shoebox program we could simultaneously create electronic lexicon (Shor-Russian-English dictionary) and analyze Shor texts included in the electronic body.

Parallel to the creation of technical premises was collection of textual materials which resulted in preparation of a basis for electronic body of Shor texts. In this connection the following should be mentioned:

  • Amplification of a bibliographical reference book on Shor literature being prepared by A.V. Esipova and I.A. Nevskaya.
  • Accumulation of a body of texts of 1700 pages in volume gathered in the result of work held in the archives and libraries of Moscow, Novosibirsk, Abakan, Novokuznetsk, Tashtagol, Mysky, including, along with expeditions materials, unpublished and published texts, samizdat1 literature (a set of books and journals, Xerox copies).
  • Fixation of different forms of folklore on audiotapes (including four epic stories) held during a trip to Shor communities.

The work on the automated morphological analysis of texts and creation of the electronic body of Shor texts per se included the following types of work.

  • Creation of electronic textual database which constitutes one third of all the body of Shor texts which were collected during the work on the project. Partially, texts in Shor language (in Cyrillic and Latin transliteration) are supplemented with a morphological analysis. Practically, all the texts included in the electronic body of Shor texts are translated in Russian. All texts supplemented with a morphological analysis are translated in English. Legendary story Kazyr Too which is being prepared for publication in Gremany, for instance, has been translated in German.
  • The aforementioned sources served as a basis for interactive Shor-Russian-English glossary to the texts. The glossary created in the process of textual analysis contains over 4300 lexical units. Of all existing Shor-Russian dictionaries it is the most comprehensive one with no other trilingual analogues.

The project has both scientific and applied aspects connected with revitalization of written language and national education of Shor people. The electronic body of Shot texts created by participants of the present project, ready-made morphological analysis, Russian and English terminology included in it, Shor-Russian-English dictionary, etc. can be used (and are being used) by researchers of different disciplines, students and teachers of domestic and foreign universities both for educational and scientific work, for compiling manuals and text books, preparing theoretical courses, conducting typological and comparative historical studies.

The development of the electronic body of Shor texts was accompanied by various linguistic studies of synchronic, diachronic and areal character which results were used during the work on improvement of morphological analysis and preparation of the electronic lexicon. The electronic lexicon of Shor language developed as a glossary contains 4300 lexical units and is already the most comprehensive dictionary of the language studied with no other analogical trilingual dictionaries.

Website http://shoriya.ngpi.rdtc.ru, which was created and is being updated, acquired a mirror at the server of Frankfurt University. The site contains information on the project’s goals and performers, a bibliography included in the electronic body of texts in Shor language, samples of Shor spoken language, Shor-Russian and Russian-Shor language. The website provides a historical and geographical reference on ethnography of Shor people, its crafts and customs, way of life and religion. It represents samples of automated morphological and lexical analysis of Shor texts, namely, story by A.I. Chudoyakov “In the Hunting Field” and Shor folktale “Mashmoruk” translated in Russian and English languages.

In spite of all inarguable merits, the developed Shor database does not completely meet all the demands of the present day language documentation. For instance, Shor folklore lacks samples of the spoken language, video records of different communication situations; there are only audio records of several epic tales but no samples of speech, etc. The amplification of Shor collection should be continued; especially as all the technical premises are available.

Work on documentation of Siberian languages should also be conducted in other Siberian regions. Particular attention of turkologists is now drawn to Altai where recognition of a significant number of linguistic variants spoken by Altai’s Turkic population rather as independent languages than as dialects of Altai literary language led to urgent need for documentation of these languages. They have almost completely been out of researchers’ eyeshot for a long time. Almost all of them are endangered or coming to the state when measures for their revitalization must be taken. The situation is aggravated by the fact that all of them are unwritten. At the same time the rise of nationalism of these indigenous ethnic groups, their struggle for preservation of their language and culture for descendants mean that there is social need for the documentation of these languages which has not only purely scientific but also tremendous social aspect.

An international project on Chelkan language is the first attempt of scientific documentation of an Altai Turkic language. On the German part, it is also headquartered in Frankfurt University and directed by M. Erdahl. The Russian part is represented by the Institute of Philology, Siberian Branch of RAS (Russian Academy of Sciences), directed by A.N. Ozonova and coordinated by I.A. Nevskaya. The project is mostly similar to Shor example while it involves an up-to-date variant of linguistic database program Toolbox.

Bibliography

  1. Есипова, А. В. и Невская, И. А. Машинный фонд шорского языка и создание научно-методической базы для изучения родного языка шорцами // Шорский сборник. Кемерово: Кемеровский ГУ, 1994. С. 255-259.
  2. Austin, P. K. Language documentation and your data. A lecture held during the Summer School „Language documentation“ at Frankfurt University, 2004.
  3. Csato, E. and Nathan, D. Multimedia and documentation of endangered languages // Language documentation and description. Working papers. Issue 1. Ed. : P. Austin. London: London University, 2004.
  4. Language documentation and description. Working papers. Issue 1. Ed. : P. Austin. London: London University, 2004.
  5. Lehmann, Ch. Et al. Linguistic documentation. Terminological and bibliographical database. http://www.uni-erfurt.de/sprachwissenschaft/proxy.php?file=lido/servlet/Lido_Servl.
  6. Lehmann, Ch. Directions for interlinear morphemic translation // Folia Linguistica, 16, 1983. 193-224.
  7. Samarin. W. J. Field linguistics: A guide to linguistic field work. New York: Holt, Rinehart and Winston. 1966.
  8. Woodbury, T. Defining documentary lingusitics // Language documentation and description. Working papers. Issue 1. Ed.: P. Austin. London: London University, 2004.

Translated into English by O.A. Povoroznyuk

Google
WWW lingsib.iea.ras.ru
© IEA RAS, 2005
This website was created with support from UNESCO Moscow Office