Automatic Encoding Detection And Unicode Conversion Engine Computer Science Essay

In computers, characters are represented using numbers. Initially the encoding schemes were designed to support the English alphabet, which has a limited number of symbols. Later the requirement for a worldwide character encoding scheme to support multi lingual computing was identified. The solution was to come up with a 16 encoding scheme to represent a character so that it can support up to large character set. The current Unicode version contains 107,000 characters covering 90 scripts. In the current context operating systems such as Windows 7, UNIX based operating systems applications such as word processors and data exchange technologies do support this standard enabling internationalization in the IT industry. Even though this standard has been the de facto standard, still there can be seen certain applications using proprietary encoding schemes to represent the data. As an example, famous Sinhala news sites still do not adapt Unicode standard based fonts to represent the content. This causes issues such as the requirement of downloading proprietary fonts, browser dependencies making the efforts of Unicode standard in vain. In addition to the web site content itself there are collections of information included in documents such as PDFs in non Unicode fonts making it difficult to search through search engines unless the search term is entered in that particular font encoding.

This has given the requirement of automatically detecting the encoding and transforming into the Unicode encoding in the corresponding language, so that it avoids the problems mentioned. In case of web sites, a browser plug-in implementation to support the automatic non-Unicode to Unicode conversion would eliminate the requirement of downloading legacy fonts, which uses proprietary character encodings. Although some web sites provide the source font information, there are certain web applications, which do not give this information, making the auto detection process more difficult. Hence it is required to detect the encoding first, before it has been fed to the transformation process. This has given the rise to a research area of auto detecting the language encoding for a given text based on language characteristics.

This problem will be addressed based on a statistical language encoding detection mechanism. The technique would be demonstrated with the support for all the Sinhala Non Unicode encodings. The implementation for the demonstration will make sure that it is an extendible solution for other languages making it support for any given language based on a future requirement.

Since the beginning of the computer age, many encoding schemes have been created to represent various writing scripts/characters for computerized data. With the advent of globalization and the development of the Internet, information exchanges crossing both language and regional boundaries are becoming ever more important. However, the existence of multiple coding schemes presents a significant barrier. The Unicode has provided a universal coding scheme, but it has not so far replaced existing regional coding schemes for a variety of reasons. Thus, today’s global software applications are required to handle multiple encodings in addition to supporting Unicode.

In computers, characters are encoded as numbers. A typeface is the scheme of letterforms and the font is the computer file or program which physically embodies the typeface. Legacy fonts use different encoding systems for assigning the numbers for characters. This leads to the fact that two legacy font encodings defining different numbers for the same character. This may lead to conflicts with how the characters are encoded in different systems and will require maintaining multiple encoding fonts. The requirement of having a standard to unique character identification was satisfied with the introduction of Unicode. Unicode enables a single software product or a single website to be targeted across multiple platforms, languages and countries without re-engineering.

Unicode

Unicode is a computing industry standard for the consistent encoding, representation and handling of text expressed in most of the world’s writing systems. The latest Unicode has more than 107,000 characters covering 90 scripts, which consists of a set of code charts. The Unicode Consortium co-ordinates Unicode’s development and the goal is to eventually replace existing character encoding schemes with Unicode and its standard Unicode Transformation Format (UTF) schemes. This standard is being supported in many recent technologies including Programming Languages and modern operating systems. All W3C recommendations have used Unicode as their document character set since HTML 4.0. Web browsers have supported Unicode, especially UTF-8, for many years [4], [5].

Sinhala Legacy Font Conversion Requirement for Web Content

Sinhala language usage in computer technology has been present since 1980s but the lack of standards in character representation system resulted in proprietary fonts. Sinhala was added to Unicode in 1998 with the intention of overcoming the limitations in proprietary character encodings. Dinamina, DinaminaUniWeb, Iskoola Pota, KandyUnicode, KaputaUnicode, Malithi Web, Potha are some Sinhala Unicode fonts which were developed so that the numbers assigned with the characters are the same. Still some major news sites which display Sinhala character contents have not adapted the Unicode standards. The Legacy Fonts encoding schemes are used instead causing the conflicts in content representation. In order to minimize the problems, font families were created where the shape of characters only differs but the encoding remains the same. FM Font Family, DL Font Family are some examples where a font family concept is used as a grouping of Sinhala fonts with similar encodings [1], [2].

Adaptation of non Unicode encodings causes a lot of compatibility issues when viewed in different browsers and operating systems. Operating systems such as Windows Vista, Windows7 come with Sinhala Unicode support and do not require external fonts to be installed to read Sinhalese script. Variations of GNU/Linux distributions such as Dabian or Ubuntu also provide Sinhala Unicode support. Enabling non Unicode applications especially web contents with the support for Unicode fonts will allow the users to view contents without installing the legacy fonts.

Non Unicode PDF Documents

In addition to the contents in the web, there exists a whole lot of government documents which are in PDF format but their contents are encoded with legacy fonts. Those documents would not be searchable through search engines by entering the search terms in Unicode. In order to overcome the problem it is important to convert such documents in to a Unicode font so that they are searchable and its data can be used by other applications consistently, irrespective of the font. As another part of the project this problem would be addressed through a converter tool, which creates the Unicode version of existing PDF document which are currently in legacy font.

The Problem

Sections 1.3, 1.4 describe two domains in which the Non Unicode to Unicode conversion is required. The conversion involves identification of non-Unicode contents and replacing it with the corresponding Unicode contents. The content replacement requires a Mapping engine, which would do the proper segmentation of the input text and map it with the corresponding Unicode code. The mapping engine can perform the mapping task only if it knows what is the source text encoding. In general, the encoding is specified along with the content so that the mapping engine could feed it directly. However, in certain cases the encoding is not specified along with the content. Hence detecting the encoding through an encoding the detection engine provides a research area, especially with the non-Unicode content. In addition to that, incorporating the detection engine along with a conversion engine would be another part of the problem, to solve the application areas in 1.3, 1.4.

Project Scope

The system will be initially targeted for Sinhala fonts used by local sites. Later the same mechanism will be extended to support other languages and scripts (Tamil, Devanagaree).

Deliverables and outcomes

Web Service/Plug-in to Local Language web site Font Conversion which automatically converts website contents from legacy fonts to Unicode.

PDF document conversion tool to convert legacy fonts to Unicode

In both implementations, the language encoding detection would use the proposed encoding detection mechanism. It can be considered as the core for the implementations in addition to the translation engine which performs the Non Unicode to Unicode mapping.

Literature Review

Character Encodings

Character Encoding Schemes

Encoding refers to the process of representing information in some form. Human language is an encoding system by which information is represented in terms of sequences of lexical units, and those in terms of sound or gesture sequences. Written language is a derivative system of encoding by which those sequences of lexical units, sounds or gestures are represented in terms of the graphical symbols that make up some writing system.

A character encoding is an algorithm for presenting characters in digital form as sequences of octets. There are hundreds of encodings, and many of them have different names. There is a standardized procedure for registering an encoding. A primary name is assigned to an encoding, and possibly some alias names. For example, ASCII, US-ASCII, ANSI_X3.4-1986, and ISO646-US are different names for an encoding. There are also many unregistered encodings and names that are used widely. The character encoding names are not case sensitive and hence “ASCII” and “Ascii” are equivalent [25].

Figure 2.1 Character encoding Example

Single Octet Encodings

When character repertoire that contains at most 256 characters, assigning a number in the range 0255 to each character and use an octet with that value to represent that character is the most simplest and obvious way. Such encodings, called single-octet or 8-bit encodings, are widely used and will remain important [22].

Multi-Octet Encodings

In multi octet encodings more than one octet is used to represent a single character. A simple two-octet encoding is sufficient for a character repertoire that contains at most 65,536 characters. Two octet schemes are uneconomical if the text mostly consists of characters that could be presented in a single-octet encoding. On the other hand, the objective of supporting Universal character set is not achievable with just 65,536 unique codes. Thus, encodings that use a variable number of octets per character are more common. The most widely used among such encodings is UTF-8 (UTF stands for Unicode Transformation Format), which uses one to four octets per character.

Principles of Unicode Standard

Unicode has used as the universal encoding standard to encode characters in all living languages. To the end, is follows a set of fundamental principles. The Unicode standard is simple and consistent. It does not depend on states or modes for encoding special characters.

The Unicode standard incorporates the character sets of many existing standards: For example, it includes Latin-I, character set as its first 256 characters. It includes repertoire of characters from numerous other corporate, national and international standards as well.

In modern businesses needs handle characters from a wide variety of languages at the same time. With Unicode, a single internationalization process can produce code that handles the requirements of all the world markets at the same time. The data corruption problems do not occur since Unicode has a single definition for each character. Since it handles the characters for all the world markets in a uniform way, it avoids the complexities of different character code architectures. All of the modern operating systems, from PCs to mainframes, support Unicode now, or are actively developing support for it. The same is true of databases, as well.There are 10 design principles associated with Unicode.

Universility

The Unicode is designed to be Universal. The repertoire must be large enough to encompass all characters that are likely to be used in general text interchange. Unicode needs to encompass a variety of essentially different collections of characters and writing systems. For example, it cannot postulate that all text is written left to right, or that all letters have uppercase and lowercase forms, or that text can be divided into words separated by spaces or other whitespace.

Efficient

Software does not have to maintain state or look for special escape sequences, and character synchronization from any point in a character stream is quick and unambiguous. A fixed character code allows for efficient sorting, searching, display, and editing of text. But with Unicode efficiency there exist certain tradeoffs made specially with the storage requirements needing four octets for each character. Certain representation forms such as UTF-8 format requiring linear processing of the data stream in order to identify characters. Unicode contains a large amount of characters and features that have been included only for compatibility with other standards. This may require preprocessing that deals with compatibility characters and with different Unicode representations of the same character (e.g., letter Ã© as a single character or as two characters).

Characters, not glyphs

Unicode assigns code points to characters as abstractions, not to visual appearances. A character in Unicode represents an abstract concept rather than the manifestation as a particular form or glyph. As shown in Figure 2.2, the glyphs of many fonts that render the Latin character A all correspond to the same abstract character “a.”

Figure 2.2: Abstract Latin Letter “a” and Style Variants

Another example is the Arabic presentation form. An Arabic character may be written in up to four different shapes. Figure 2.3 shows an Arabic character written in its isolated form, and at the beginning, in the middle, and at the end of a word. According to the design principle of encoding abstract characters, these presentation variants are all represented by one Unicode character.

Figure 2.3: Arabic character with four representations

The relationship between characters and glyphs is rather simple for languages like English: mostly each character is presented by one glyph, taken from a font that has been chosen. For other languages, the relationship can be much more complex routinely combining several characters into one glyph.

Semantics

Characters have well-defined meanings. When the Unicode standard refers to semantics, it often means the properties of characters, such spacing, combinability, and directionality, rather than what the character really means.

Plain text

Unicode deals with plain texti.e., strings of characters without formatting or structuring information (except for things like line breaks).

Logical order

The default representation of Unicode data uses logical order of data, as opposed to approaches that handle writing direction by changing the order of characters.

Unification

The principle of uniqueness was also applied to decide that certain characters should not be encoded separately. Unicode encodes duplicates of a character as a single code point, if they belong to the same script but different languages. For example, the letter Ã¼ denoting a particular vowel in German is treated as the same as the letter Ã¼ in Spanish.

The Unicode standard uses Han unification to consolidate Chinese, Korean, and Japanese ideographs. Han unification is the process of assigning the same code point to characters historically perceived as being the same character but represented as unique in more than one East Asian ideographic character standard. These results in a group of ideographs shared by several cultures and significantly reduces the number of code points needed to encode them. The Unicode Consortium chose to represent shared ideographs only once because the goal of the Unicode standard was to encode characters independent of the languages that use them. Unicode makes no distinctions based on pronunciation or meaning; higher-level operating systems and applications must take that responsibility. Through Han unification, Unicode assigned about 21,000 code points to ideographic characters instead of the 120,000 that would be required if the Asian languages were treated separately. It is true that the same character might look slightly different in Chinese than in Japanese, but that difference in appearance is a font issue, not a “uniqueness” issue.

Figure 2.4: Han Unification example

The Unicode standard allows for character composition in creating marked characters. It encodes each character and diacritic or vowel mark separately, and allows the characters to be combined to create a marked character. It provides single codes for marked characters when necessary to comply with preexisting character standard.

Dynamic composition

Characters with diacritic marks can be composed dynamically, using characters designated as combining marks.

Equivalent sequences

Unicode has a large number of characters that are precomposed forms, such as Ã©. They have decompositions that are declared as equivalent to the precomposed form. An application may still treat the precomposed form and the decomposition differently, since as strings of encoded characters, they are distinct.

Convertibility

Character data can be accurately converted between Unicode and other character standards and specifications.

South Asian Scripts

The scripts of South Asia share so many common features that a side-by-side comparison of a few will often reveal structural similarities even in the modern letterforms. With minor historical exceptions, they are written from left to right. They are all abugidas in which most symbols stand for a consonant plus an inherent vowel (usually the sound /a/). Word-initial vowels in many of these scripts have distinct symbols, and word-internal vowels are usually written by juxtaposing a vowel sign in the vicinity of the affected consonant. Absence of the inherent vowel, when that occurs, is frequently marked with a special sign [17].

Another designation is preferred in some languages. As an example in Hindi, the word hal refers to the character itself, and halant refers to the consonant that has its inherent vowel suppressed. The virama sign nominally serves to suppress the inherent vowel of the consonant to which it is applied; it is a combining character, with its shape varying from script to script.

Most of the scripts of South Asia, from north of the Himalayas to Sri Lanka in the south, from Pakistan in the west to the easternmost islands of Indonesia, are derived from the ancient Brahmi script. The oldest lengthy inscriptions of India, the edicts of Ashoka from the third century BCE, were written in two scripts, Kharoshthi and Brahmi. These are both ultimately of Semitic origin, probably deriving from Aramaic, which was an important administrative language of the Middle East at that time. Kharoshthi, written from right to left, was supplanted by Brahmi and its derivatives. The descendants of Brahmi spread with myriad changes throughout the subcontinent and outlying islands. There are said to be some 200 different scripts deriving from it. By the eleventh century, the modern script known as Devanagari was in ascendancy in India proper as the major script of Sanskrit literature.

The North Indian branch of scripts was, like Brahmi itself, chiefly used to write Indo-European languages such as Pali and Sanskrit, and eventually the Hindi, Bengali, and Gujarati languages, though it was also the source for scripts for non-Indo-European languages such as Tibetan, Mongolian, and Lepcha.

The South Indian scripts are also derived from Brahmi and, therefore, share many structural characteristics. These scripts were first used to write Pali and Sanskrit but were later adapted for use in writing non-Indo-European languages including Dravidian family of southern India and Sri Lanka.

Sinhala Language

Characteristics of Sinhala

The Sinhala script, also known as Sinhalese, is used to write the Sinhala language, by the majority language of Sri Lanka. It is also used to write the Pali and Sanskrit languages. The script is a descendant of Brahmi and resembles the scripts of South India in form and structure. Sinhala differs from other languages of the region in that it has a series of prenasalized stops that are distinguished from the combination of a nasal followed by a stop. In other words, both forms occur and are written differently [23].

Figure 2.5: Example for prenasalized stop in Sinhala

In addition, Sinhala has separate distinct signs for both a short and a long low front vowel sounding similar to the initial vowel of the English word “apple,” usually represented in IPA as U+00E6 Ã¦ latin small letter ae (ash). The independent forms of these vowels are encoded at U+0D87 and U+0D88.

Because of these extra letters, the encoding for Sinhala does not precisely follow the pattern established for the other Indic scripts (for example, Devanagari). It does use the same general structure, making use of phonetic order, matra reordering, and use of the virama (U+0DCA sinhala sign al-lakuna) to indicate conjunct consonant clusters. Sinhala does not use half-forms in the Devanagari manner, but does use many ligatures.

Sinhala Writing System

The Sinhala writing system can be called an abugida, as each consonant has an inherent vowel (/a/), which can be changed with the different vowel signs. Thus, for example, the basic form of the letter k is Ã Â¶Å¡ “ka”. For “ki”, a small arch is placed over the Ã Â¶Å¡: Ã Â¶Å¡Ã Â·â€™. This replaces the inherent /a/ by /i/. It is also possible to have no vowel following a consonant. In order to produce such a pure consonant, a special marker, the hal kirÃ„Â«ma has to be added: Ã Â¶Å¡Ã Â·Å . This marker suppresses the inherent vowel.

Figure 2.6: Character associative Symbols in Sinhala

Historical Symbols. Neither U+0DF4 sinhala punctuation kunddaliya nor the Sinhala numerals are in general use today, having been replaced by Western-style punctuation and Western digits. The kunddaliya was formerly used as a full stop or period. It is included for scholarly use. The Sinhala numerals are not presently encoded.

Sinhala and Unicode

In 1997, Sri Lanka submitted a proposal for the Sinhala character code at the Unicode working group meeting in Crete, Greece. This proposal competed with proposals from UK, Ireland and the USA. The Sri Lankan draft was finally accepted with slight modifications. This was ratified at the 1998 meeting of the working group held at Seattle, USA and the Sinhala Code Chart was included in Unicode Version 3.0 [2].

It has been suggested by the Unicode consortium that ZWJ and ZWNJ should be introduced in Orthographic languages like Sinhala to achieve the following:

1. ZWJ joins two or more consonants to form a single unit (conjunct consonants).

2. ZWJ can also alter shape of preceding consonants (cursiveness of the consonant).

3. ZWNJ can be used to disjoin a single ligature into two or more units.

Encoding auto Detection

Browser and auto-detection

In designing auto detection algorithms to auto detect encodings in web pages it needs to depend on the following assumptions on input data [24].

Input text is composed of words/sentences readable to readers of a particular language.

Input text is from typical web pages on the Internet which is not an ancient dead language.

The input text may contain extraneous noises which have no relation to its encoding, e.g. HTML tags, non-native words (e.g. English words in Chinese documents), space and other format/control characters.

Methods of auto detection

The paper[24] discusses about 3 different methods for detecting the encoding of text data.

Coding Scheme Method

In any of the multi-byte encoding coding schemes, not all possible code points are used. If an illegal byte or byte sequence (i.e. unused code point) is encountered when verifying a certain encoding, it is possible to immediately conclude that this is not the right guess. Efficient algorithm to detecting character set using coding scheme through a parallel state machine is discussed in the paper [24].

For each coding scheme, a state machine is implemented to verify a byte sequence for this particular encoding. For each byte the detector receives, it will feed that byte to every active state machine available, one byte at a time. The state machine changes its state based on its previous state and the byte it receives. In a typical example, one state machine will eventually provide a positive answer and all others will provide a negative answer.

Character Distribution Method

In any given language, some characters are used more often than other characters. This fact can be used to devise a data model for each language script. This is particularly useful for languages with a large number of characters such as Chinese, Japanese and Korean. The tests were carried out with the data for simplified Chinese encoded in GB2312, traditional Chinese encoded in Big, Japanese and Korean. It was observed that a rather small set of coding points covers a significant percentage of characters used.

Parameter called Distribution Ration was defined and used for the purpose separating the two encodings.

Distribution Ratio = the Number of occurrences of the 512 most frequently used characters divided by the Number of occurrences of the rest of the characters.

. Two-Char Sequence Distribution Method

In languages that only use a small number of characters, we need to go further than counting the occurrences of each single character. Combination of characters reveals more language-characteristic information. 2-Char Sequence as 2 characters appearing immediately one after another in input text, and the order is significant in this case. Just as not all characters are used equally frequently in a language, 2-Char Sequence distribution also turns out to be extremely language/encoding dependent.

Current Approaches to Solve Encoding Problems

Siyabas Script

The SiyabasScript is as an attempt to develop a browser plugin, which solves the problem using legacy font in Sinhala news sites [6]. It is an extension to Mozilla Firefox and Google Chrome web browsers. This solution was specifically designed for a limited number of target web sites, which were having the specific fonts. The solution had the limitation of having to reengineer the plug-in, if a new version of the browser is released. The solution was not global since that id did not have the ability to support a new site which is using a Sinhala legacy font. In order to overcome that, the proposed solution will identify the font and encodings based on the content but not on site. There is a chance that the solution might not work if the site decided to adapt another legacy font, as it cannot detect the encoding scheme changes. There is a significant delay in the conversion process. The user would notice the display of the content with characters which are in legacy font before they get converted to the Unicode. This performance delay can be also identified as an area to improve in the solution. The conversion process does not provide the exact conversion specially when the characters need to be combined in Unicode. Ã Â¶Â´Ã Â·Â= Ã Â·â‚¬Ã Â·â€ºÃ Â¶Â¯Ã Â·Å Ã¢â‚¬ÂÃ Â¶Âº Ã Â·â„¢Ã Â¶Â¢Ã Â·Å .Ã Â¶â€™.Ã Â¶Â´Ã Â·â€œ. Ã Â·â‚¬Ã Â·â€™Ã Â·â„¢Ã Â¶Â¢Ã Â·Å Ã Â¶Å¡Ã Â·â€Ã Â¶Â¸Ã Â·ÂÃ Â¶Â» Ã Â¶Â¸Ã Â·â€žÃ Â¶ÂÃ Â·Â, Ã Â¶Å“Ã Â·ËœÃ Â·ËœÃ Â¶Â´Ã Â·Å Ã Â¶Å¡Ã Â¶Â´Ã Â·â€™Ã Â¶ÂÃ Â·ÂÃ Â¶Â±Ã Â·Å Ã Â¶â€¡Ã Â¶Â±Ã Â·Å Ã Â¶Â©Ã Â·ËœÃ Â·Ëœ Ã Â·â‚¬Ã Â·â€™Ã Â·â„¢Ã Â¶Â¢Ã Â·Å Ã Â·Æ’Ã Â·â€Ã Â¶Â»Ã Â·â€™Ã Â¶Âº, Ã Â¶Å¡Ã Â·â€™Ã Â·Å Ã¢â‚¬ÂÃ Â¶Â»Ã Â¶Å¡Ã Â¶Ã Â·Å , Ã Â¶â€¢Ã Â·Æ’Ã Â·Å Ã Â¶Ã Â·Å¡Ã Â·Å Ã¢â‚¬ÂÃ Â¶Â»Ã Â¶Â½Ã Â·â€™Ã Â¶ÂºÃ Â·ÂÃ Â¶Â±Ã Â·â€ , Ã Â¶â€¦Ã Â¶Â±Ã Â·â€Ã Â·Â+Ã Â¶Â»Ã Â¶ÂºÃ Â¶Â±Ã Â·Å Ã Â·â‚¬Ã Â¶Â± Ã Â·ÂÃ Â·â€œÃ Â·Å Ã¢â‚¬ÂÃ Â¶Â» ,Ã Â¶Å¡Ã Â·â€œÃ Â·Å Ã¢â‚¬ÂÃ Â¶Â»Ã Â¶Â©Ã Â·ÂÃ Â¶â€šÃ Â¶Å“Ã Â¶Â«Ã Â¶ÂºÃ Â·Å¡Ã Â¶Â¯Ã Â·â€œ ,Ã Â¶Â¯`Ã Â¶Å“ Ã Â¶Â´Ã Â¶Â±Ã Â·Å Ã Â¶Â¯Ã Â·â€ Ã Â¶ÂºÃ Â·â‚¬Ã Â¶Â±Ã Â·Å Ã Â¶Â±Ã Â¶Â±Ã Â·Å can be mentioned as the examples of words of such conversion issues.

The plug-in supports the Sinhala Unicode conversion for the sites www.lankadeepa.lk, www.lankaenews.com and www.lankascreen.com. But the other websites mentioned in the paper does not get properly converted to Sinhala with Firefox version 3.5.17.

Aksharamukha Asian Script Converter

Aksharamukha is a South: South-East-Asian script convertor tool. It supports transliteration between Brahmi derived Asian scripts. It also has the functionality to transliterate web pages from Indic Scripts to other scripts. The Convertor scrapes the HTML page, then transliterates the Indic Scripts and displays the HTML. There are certain issues in the tool when it comes to alignment with the original web page. Misalignments and missing images, unconverted hyperlinks are some of them.

Figure 2.7: Aksharamukha Asian Script Converter

Corpus-based Sinhala Lexicon

The Lexicon of a language is its vocabulary including higher order constructs such as words and expressions. In order to detect the encoding of a given text this can be used as a supporting tool. Corpus based Sinhala lexicon has nearly 35000 entries based on a corpus consisting of 10 million words from diverse genres such as technical writing, creative writing and news reportage [7], [9]. The text distribution across genres is given in table 1.

Table 2.1: Distribution of Words across Genres [7]

Genre

Number of words

Percentage of words

Creative Writing

2340999

23%

Technical Writing

4357680

43%

News Reportage

3433772

34%

N-gram-based language, script, and encoding scheme-detection

N-Gram refers to N character sequences and is used as a well-established technique used in classifying language of text documents. The method detects language, script, and encoding schemes using a target text document encoded by computer by checking how many byte sequences of the target match the byte sequences that can appear in the texts belonging to a language, script, and encoding scheme. N-grams are extracted from a string, or a document, by a sliding window that shifts one character at a time.

Sinhala Enabled Mobile Browser for J2ME Phones

Mobile phone usage is rapidly increasing throughout the world as well as in Sri Lanka. It has become the most ubiquitous communication device. Accessing internet through the mobile phone has become a common activity of people especially for messaging and news items. In J2ME enabled phones Sinhala Unicode support yet to be developed. They do not allow installation of fonts outside. Hence those devices will not be able to display Unicode contents, especially on the web, until Unicode is supported by the platform. Integrating the Unicode viewing support will provide a good opportunity to carry the technology to remote areas if it can be presented in the native language. If this is facilitated, in addition to the urban crowd, people from rural areas will be able to subscribe to a daily newspaper with their mobile. One major advantage of such an application is that it will provide a phone model independent solution which supports any Java enabled phone.

Cillion is a Mini browser software which shows Unicode contents in J2ME phones. This software is an application developed with the fonts integrated which can identify the web contents in Sinhala Unicode and will map to a displayable format so that the user still can view the Unicode sites although the phone does not have native support for Unicode.

Transliteration Tools

Transliteration is a process in which words in one alphabet are represented in another alphabet. There are a number of rules, which govern transliteration between different alphabets, designed to ensure that it is uniform, allowing readers to clearly understand transliterations. Transliteration is not quite the same thing as transcription, although the two are very similar. Many cultures around the world use different scripts to represent their languages. By transliterating, people can make their languages more accessible to people who do not understand their scripts. There are a few tools, which also supports.

Google Transliterate Tool

Google Transliteration IME is an input method editor which allows users to enter text in one of the supported languages using a Roman keyboard. Users can type a word the way it sounds using Latin characters and Google Transliteration IME will convert the word to its native script. Note that this is not the same as translation. It is the sound of the words that is converted from one alphabet to the other, not their meaning. Converted content will always be in Unicode [15].

Figure 2.8: Google Transliteration

This tool shows a very high accuracy along with the support for suggestions as well. This tool supports several other scripts in addition to Sinhala such as Amharic, Arabic, Bengali, Chinese, Greek, Hebrew, Hindi, Malayalam, Marathi, Nepali, Oriya, Punjab, Sanskrit, Serbian, Tamil, Telugu and Urdu.

Unicode real-time font conversion Utility

This tool also has the same functionality like google transliteration which converts the text written in English based on pronunciation of words into Sinhala. This scheme is referred to, as Singlish scheme by the utility. This utility has been developed by Language Technology Research Laboratory – University of Colombo School of Computing. But this tool does not support other languages and also does not have advanced features such as suggestions for words. This again is an online accessible utility and hence provides the flexibility for the user to access from anywhere which it has the connectivity [16].

Figure 2.9: Unicode Real-time Conversion Utility

Firefox Sinhala Tamil Type Plug-in

Firefox Sinhala Tamil Type Plug-in to convert English written content to Sinhala and Tamil based on pronunciation.

Figure 2.10: Firefox Sinhala Tamil Type Plug-in

Methodology

Unicode conversion Browser Plug-in

Browser plug-ins are adds extra capability to the web browser. The Unicode conversion support for Sinhala web sites will be the extra capability which would be provided by the plug-in.

Browser plug-in Architecture

Figure 3.1 the high-level architecture of the browser plug-in which would be implemented for automatic font conversion.

Browser HTML Rendering

Detection of Legacy Encoding

Applying Conversion Rules

Normalize Data (Remove Invalid Characters)

Font Conversion Engine

Replacing Converted Contents

Rearranging HTML Contents (Adjusting Hyper links, Font Sizes)

Show Converted HTML Page to user

Conversion Rules

Linguistic Rules

Figure 3.1: HTML Unicode Conversion Engine

Legacy Encoding Detection

In this phase, the input will be the HTML text. The first decision of this sub system would be to decide whether this page contains any fonts on legacy encodings. In order for the browser to identify the font, the following mechanisms are used in practice.

Using face attribute in font tag. <Font face = “Name of the Font”>. If the specified font is not found on the system, it will be defaulted to Times New Roman.

Specify in CSS (Cascading Style Sheets) TD{font-family:Â Arial;}

Doing this will make all the text inside TD tags in HTML to use the specified font family. Detection of this font specification is harder than the previous case where font is specified under a separate html tag.

If such legacy fonts are detected it will proceed to the next step. When the font name or the font family is identified, it is possible to find out the encoding which is used to represent the characters. If the legacy encoding cannot be detected using the above techniques encoding auto detection algorithms will be executed. These algorithms will take the linguistic features to predict the encoding. The statistical research outputs such as corpus which were carried out previously on language constructs such as most frequent diagrams, trigrams will be used as a clue in deciding the encoding. If the encoding cannot be identified exactly by the system, the user will be shown a part of the text put to different encodings and will be prompted to select the matching encoding. Once the user selected the proper encoding the system language rules will be updated as per the user selection so that in the next view it will consider the user’s decision to detect the encoding.

Conversion content Identification

The next step would be to separate out the actual display text data from the HTML elements. This process would require a rule based algorithm where a given text with HTML elements it will emit the content part.

Normalizing and Error checking

Normalization of the content and error checking is an important step prior to the conversion process. Unicode conversion logic can be separated from the normalization step and will provide a solution which is extendible. Certain typing errors in the content might not be visible when it is encoded in legacy font but when it is converted to Unicode, the error may be visible. Hence, a normalization phase needs to be carried out before the conversion rules are applied. In the normalization process the validation of the language constructs would be done based on the following NFA.

The NFA is a 5-tuple (Q, Ã¢Ë†â€˜, q0, A, ÃŽÂ´) where

Q = { A, B, C, D, E, F, G, H, I, J, K, L }

Ã¢Ë†â€˜ = { S, S0, }

q0 = { A } ; q0 Æ’Å½ Q

A = { D,E, F, G, H, I, J, K, L } ; A Æ’Â Q

ÃŽÂ´ = The transition function , maps Q x Ã¢Ë†â€˜ to 2Q

The Sinhala characters and associative symbols are categorize into finite number of sets where,

S = { x | x is any Sinhala character }

S0 = { x1 | x1 is any character associative symbol }

S3 , S4 , S5 , S6 , S7 , S8 are subsets of S0

Figure 3.2: Proposed NFA for Error Checking

The character construction is influenced by the state transition sequence and input symbol. The states E, F, G, J, K and L are considered as final accepting states. Hence as soon as system reaches the above states the state machine gets initialized and the next immediate input is considered as a new input to the machine. Valid character is constructed only at an accepting state.

Unicode Conversion Engine

The conversion rules need to be applied which converts the legacy-encoded content to Unicode content, once the legacy font is automatically detected and the data is normalized,. The encoding conversion can be performed through mapping tables which hold the mapping between the legacy encoding and the Unicode. State Machines will be used to implement the language rules when combining various symbols in a given script. It will be used to interpret certain user errors and will make sure that the character conforms to the language rule.

The font conversion would be taken place in the separate component which acts as the conversion engine. The conversion engine needs to support faster conversion in order to make sure that the application meets acceptable performance to a user.

Converted content to HTML

In web pages there are other elements which need to be transformed in addition to its contents. One such example would be the hyperlinks which are specified using <href> tag. When the user clicks on a hyperlink the next loaded page or part of the page also needs to go through the conversion process in order to make sure that Unicode font is consistently displayed. In case of images, in its actual rendering it does not processed through a font and hence does not require any conversion. After the conversion of font faces there might be certain requirements to adjust the font sizes as well due to the fact that letter sizes may differ between fonts. Once the conversion is finished the contents will be viewable through the browser even without the legacy font because the Unicode contents are supported by the browser and operating system.

Unicode Conversion tool for PDF Documents

PDF documents which are encoded in legacy fonts are not searchable through search engines by entering the search term using Unicode. Unicode has become the universal encoding scheme to support internationalization and the requirement is there to convert the PDFs with legacy fonts to Unicode PDFs. This conversion would be different from internet content conversion as it is a onetime process. The conversion process can be designed as a batch process where the PDF documents are converted and Unicode encoding is attached with them without the manual intervention.

PDF Unicode Converter Architecture

The diagram shows the high-level architecture of the PDF Unicode converter, which would be, implemented for offline PDF Unicode conversion.

Write Converted PDF

PFD Reader

Rearranging the Content

Text Extraction

Font Conversion Engine

Unicode Converted Text

Normalizing

Font Detail Extraction

Figure 3.2: PDF File Conversion

Legacy Encoding Detection

In PDF documents the encoding scheme and the font is embedded with the document. There is no straight forward way of extracting the font information from the document contents. The techniques used by PDF readers such as Adobe Acrobat Reader to display the document will be useful in designing the encoding detection mechanism. If it is not possible to detect the encoding through such a process, the statistical based language detection through linguistic features would be used.

PDF Conversion content Identification

In PDF documents, it needs to extract the convertible content out of the data format, which is used to create the PDF document. Open source library products can be helpful for this purpose [23]. Normalisation and error checking module would be based on the same algorithms used under section 3.1.4. Unicode conversion logic will also be same as section 3.1.5 because the requirement would be to translate between legacy encoding and Unicode.

Converted content to PDF

The Final PDF with Unicode encoding needs to be generated as the last step of the process. The contents which are not text (e.g. images) will to be properly aligned according with the original PDF document. The open source library iText has the support to generate the documents in a PDF format.

System optimisations

Caching and Concurrency

The delay between the web page original display and the Unicode conversion would be a significant factor in user experience. In order to achieve the high performance, the processing would be performed concurrently so that the user will not experience a considerable delay. Caching mechanisms would be used to minimize the disk access delays. Since the system is not accessed by multiple users at the same time cache synchronization issues will not take place.

Extensibility to other languages

The system architecture would support extendibility to other south Asian scripts. User interface will be provided to feed the system with new encodings and mapping scheme between the Unicode and the legacy encoding.

Conclusion

In conclusion, the problem of converting a given non-Unicode encoding to the matching Unicode encoding will be resolved through two implementations. There are existing tools which supports online web content transformation as well as PDF document transformation but the solutions are designed to translate between the languages. In this project, the major research focus would be towards solving the encoding detection problem. The techniques used for language classification and encoding detection would be composed into an extensible rule based algorithm in order to achieve the goals of the project. In summary, this will eliminate computer users the requirement of depending on propitiatory based encoding schemes. The initial implementation would be based on Sinhala language non-Unicode web content and non-Unicode encoded Sinhala PDF documents.

Order Now