|Thursday, 14 October 2004|
Need for Character Allocation Tables to incorporate the scripts of Sinhala and Tamil
by S. Donald E. Gaminitillake, CEO, S. Donald E. Gaminitillake Associates
For Sri Lanka to be a global information technology competitor, it is now recognised that a skilled technologically literate work force is imperative. For workers to be highly productive they must have the education and training necessary to keep them abreast of the development of technology.
To achieve this aim, there needs to be the development of Information Communication Technologies (ICT) in Sri Lanka. To develop ICT in Sri Lanka, character allocation tables in both Sinhala and Tamil need to be introduced. Without this crucial step, a most unfortunate digital divide could result, which would have adverse knock-on effects of under-utilisation of resources, and inequality in income distribution, growth and productivity. Therefore, it is of paramount importance for Sinhala and Tamil character allocation tables to be developed and introduced.
There are no publications or even a simple chart giving the complete Sinhala alphabet in Sri Lanka. In Sinhala, there are three versions of the alphabet. The basic pure Sinhala (Elu Hodiya) consist of 12 vowels and 25 consonants. A mixed Sinhala alphabet (Mishra Sinhala Akshara Malawa) consists of 18 vowels and 41 consonants. However, the accepted Sinhala alphabet (Sammatha Sinhala Akshara Malawa) consists of 20 vowels and 41 consonants. In Sinhala, all consonants are expanded for each vowel combination. The present day Sinhala alphabet contains a total of 1660 individual characters.
Unlike other South Asian scripts, Tamil does not have signs for voiceless aspirated (such as /kh/), voiced (/g/), and voiced aspirated stops (/gh/), which explains the relatively small number of signs in the Tamil script compared to other South Asian scripts. To write some of these sounds, some signs have multiple sound values: Tamil letter ka stands for both /ka/ and /ga. Sometimes these phonetic alternations are conditioned by the sound's position in the word. Borrowing from Sanskrit also added some special letters to Tamil.
There are six characters borrowed from Grantha and have been used to write Sanskrit loanwords. Nowadays they are used to write words with English origin as well. Similar to Sinhala scripts, a Tamil letter carries the inherent vowel of /a/. To change this vowel to another, extra strokes or signs are placed around the letter. Even the absence of the vowel is indicated by a dot written, called virama, above the letter. At present, the Tamil alphabet contains a total of 247 characters.
The Sinhala and Tamil languages consist of unique characters for most of the consonants known in the English language as well as the vowels. In Sinhala several characters exist for close sound values and sequence of graphical symbols cluster around a character. This is one of the reasons that, as yet, no character allocation tables for these languages have been developed.
Many Asian languages such as Arabic, Chinese, Japanese, and Korean already have character allocation tables. At present, neither Sinhala nor Tamil can progress in the Information Technology sector due to the lack of efficient and sensible character allocation tables.
To efficiently use Sinhala and Tamil characters on a QWERTY keyboard, a methodical, logical and easy translation needs to be carried out from Roman characters to Sinhala and Tamil characters: a task usually achieved via a character allocation table. However, this simple translation is hampered when the individual characters are broken into parts (glyphs) and assign to QWERTY keyboard.
As the method of assignment of glyphs differ from one software to the other, it leads to incomprehensibility of text composed on different software. For example, for a document composed with software X using font type AB, the document must also be read on an application running Software X using font type AB. Otherwise, a character "............." used in one software would be reproduced as ............ in another. Similarly, key in methods would encounter the same obstacles.
There is also a limitation of the usage of characters. Uncommon characters have been discarded. A user is therefore restricted to individual systems and unable to use different types of software. This is a direct result of existing software for Sinhala and Tamil having fixed parameters comprised by a fixed set of font(s) predetermined by the software developers.
Two different keyboard layouts are used to identify these broken individual characters. This difficulty is particularly acute when confronted with words with many characters and different combinations.
Nevertheless, existing software solely uses glyphs to construct a complete character. However, if character allocation comprised of glyphs, modern technology such as Optical Character Readers (OCR) will not be able to identify all these glyphs correctly and construct all individual characters.
For a meaningful solution, first all Sinhala and Tamil characters need to be identified and named individually. Subsequently all individual characters should be allocated into a matrix, specificifying its unique location number.
This solution also had also the added advantage of being able to use a multi-layer matrix, with a layer where matrix sound values may be stored and another where security data may be entered. Using the simple QWERTY keyboard, it would be possible to access the matrix using any software with a simple dictionary backup, which would access the matrix fixed locations.
Regardless of the method or keyboard or application used due to the fixed matrix value, the targeted characters will thus be represented. This special software would run between the operating system and the application program. This concept of allocating characters have been successfully used by other languages such as Korean and Arabic.
Although these languages initially did use glyphs for ICT, they recognised the practical obstacles and since have transferred to using complete characters in their allocating tables.
Another possible solution that could be the utilisation of OCR to identify a complete character. The OCR hardware scans the material and compares the image with the text matrix in the software. The character is analyzed in a Cartesian coordinate system on a pre-defined grid to identify the character.
Once the shape of the character is determined, the program could conduct a search for similar character(s) in the character allocation matrix and thereby determining the character. However, errors in the precision in placement of characters that need to be identified could lead to mismatch of characters. Further emphasis need also be placed on exact combination and sequence in identifying characters.
Using a character allocation table containing a digital matrix of all individual characters in Sinhala and Tamil would easily identify the characters and display them in a Video Display Terminal (VDT). The terminal provides a visual window into this data. The data will be editable text and not appear as images (eg. JPEG). The typographic industry deals with data, which must be either accessed, corrected, deleted added or shifted. Many terminals will provide hyphenation and justification routines.
A similar digital matrix could be used for sound recognition using text-to-voice and voice-to-text software. Once the sound values are placed on a layer of the digital matrix, it would convert words from a computer document (e.g. word processor document, web page) into audible speech via the computer.
This would be helpful to people who need or want oral verification of text and transmit sound in cyberspace. Other benefits include emails been read either in Sinhala or Tamil for users whose vision may be impaired.
Text-to-speech technology could be integrated with optical character reading systems. This digital matrix would also enable speech synthesis markup requirements for voice markup languages. It would provide a mechanism to specify accurately the desired acoustic-phonetic rendering of a given text segment.
Further the matrix would have the ability to incorporate the difference between speech and non-speech audio output (e.g. wave and MIDI files). Other advantages include mobile telecommunications using SMS (Short Message Service) which would not be restricted to English but also available in Sinhala and Tamil.
All these advantages are possible due to the software comprising of a correct two byte allocation table, than an allocation table restricted to using glyphs.
The computerised two byte unique character allocation table does not require the alteration or simplication of Sinhala or Tamil, and in fact permits all characters in both languages to be incorporated and images.
The two byte unique character allocation table also allows other characters to be included such as Roman characters, Greek characters, Russian characters, alpha numerical, special characters in German, French and East European languages and all diacritical marks. Any delay or lack in implementation of these characters to the general use in Sri Lanka would impede the use of Sinhala and Tamil in Sri Lanka's future economic, academic and linguistic development.
As been argued above, the development of character allocation tables to enable the use of Sinhala and Tamil scripts in ICT, is absolutely imperative. The sections above have discussed the methodology of this development and demonstrated its feasibility and applicability, and this development needs to be implemented with all possible speed if Sri Lanka is not to be left behind on the global arena of Information Technology.
Produced by Lake House