Algorithm For Segmentation Of Urdu Script English Language Essay

Segmentation of script plays a vital role in script recognition. It is vital to understand the script that is used in writing a document before developing or using a model to recognize it. Chain codes etc. In ligature model, word model is used at document, page and word level for segmentation. Our algorithm for segmentation of Urdu script used character model and Hidden Markov Model (HMM) to enhance work done previously. We have extracted features from images and calculated the maximum likelihood to match characters in inference algorithm with a feature extracted from a text sample. The main features used in the system will be pre-processing, connected component analysis, recognition and segmentation of text up to character level. The algorithm will provide a means to implement an Urdu OCR system on the basis of the character model.

Keywords – Preprocessing, Segmentation of characters, character model, Optical character recognition (OCR), max and argmax.

Introduction

We use an OCR system / scanner to get images of text [1]. Into preprocessing image will be converted to noiseless B/W image.

1.1 Segmentation

Segmentation is dividing an image into smaller segments or pieces [2]. Segmentation occurs on two levels. At first level both text and graphics are separated for further processing. At second level, segmentation is performed on text to separate paragraphs, words, and characters etc. Segmentation of text can be performed on a document, page, paragraph and character levels [3]. They suggested various segmentation approaches namely [4].

Holistic Method

Segmentation based approach

Segmentation free approach

In holistic method whole word is classified using a dictionary, the features of test input are matched against trained prototypes [5]. The limitation is that the method is not good for larger classes and it can only be used with the other two methods. Segmentation divides a word into smaller segments. The image of the word is broken up into several entities called graphemes [4]. Segmentation depends on human intuition. In segmentation free approach character model can be used to concatenate characters and form words. For instance segmentation free approach can be based on Hidden Markov Model (HMM) that is a stochastic model.

1.2. Urdu Language and Text Segmentation

Urdu is a cursive (written with the characters joined) writing language. Urdu language characters are similar in shape and have curves that make it difficult to recognize by a machine. Moreover it has more than one symbol to represent a character. Due to its cursive nature characters / scripts in Urdu language are hard to recognize by a computer program. A very accurate technique is needed to recognize / understand Urdu characters. Urdu characters have four elementary shapes

Basic Symbols (38 Symbols)

Table 1 shows the basic symbols / shapes for Urdu Language.

Beginning Symbols (26 Symbols)

Table 2 shows the basic symbols / shapes for Urdu Language.

Mid Symbols (40 Symbols)

Table 3 shows the basic symbols / shapes for Urdu Language.

Other Symbols

This includes symbols for numbers, special symbols like zabar, zair, paish etc.

The symbol tables, Table 1, Table 2, Table3 and Table 4, for Urdu language are given below as:

Table1. Basic Symbols

Table 2. Beginning Symbols

Table 3. Mid Symbols

Table 4. Other Symbols

We used Urdu script Nastaliq for our work. We extracted images for Urdu character set like basic, beginning, mid and other symbols using available Nastaliq font.

Literature Review

In a structural approach to script identification, stroke geometry has been utilized for script characterization and identification [6]. Individual character images in a document are classified either by applying a prototype classification or by using support vector machine. Ligatures are used for segmentation / recognition of Urdu characters. The ligature is a sequence of characters in a word separated by non-joiner characters like space.

Their approach in [1] used ligature model and it is divided into two stages:

Line Segmentation

Line segmentation deals with the detection of text lines in the image. The image is scanned horizontally from right to left direction, upwards to downwards, in search of a text pixel. Afterwards, it is determined whether this pixel belongs to a primary ligature or a secondary ligature as shown in Fig 1. The freeman chain codes (FCC) of the ligature are compared with already calculated FCC of the secondary ligatures.

Character Segmentation

The text is skeletonized and a label matrix is constructed which contains the identifiers of all ligatures in the image. The position of individual characters in a word is determined. Segmentation is done using primary ligatures only.

Fig 1. (a) Urdu word (b) Seven ligatures (c) Three Primary ligatures

(d) Four Secondary ligatures [7].

Limitations of the method are: firstly, they performed segmentation on the basis of primary ligatures only, therefore, it will not differentiate between seen and sheen because it will ignore secondary ligatures i.e. dots. Secondly, dictionary of images stored for training will be huge. Thirdly, there are problems of over segmentation and under segmentation. In [8], they have proposed a ligature and word model for Urdu word segmentation. It was done in three phases:

In 1st phase, data is collected. They identified Ligatures and calculated word probabilities using probabilistic measure. From the input set of ligatures, all sequences of words are generated and ranked using the lexicon lookup.

In the 2nd phase, top k sequences are selected using a selected beam value for further processing. It uses valid words heuristic for selection process.

In the third phase, maximum probable sequence from these k word sequences is selected. Their method used dictionary of ligatures/words, chain codes, and to find best probable sequences they used HMM toolkit HTK to recognize a word / ligature. They have recommended that their work can be further improved by using the character model for Urdu text segmentation [9].

A poor segmentation will lead to poor recognition [10]. They divided image into smaller blocks, check for uniformity, group uniform block using color similarity and identify text in this block [11]. They used edge density based noise detection to segment out text areas in video/ images [12]. Segmentation of an image into text and non-text regions effect performance in OCR development [13]. They proposed line segmentation method using histogram equalization, indicated various problems and text line into ligature using chain codes [14]. They presented bounding box based approach for segmentation of table of contents in Urdu script [15]. They analyzed horizontal and vertical projection profiles for line and character segmentation. Misclassification occurs at character level [16]. They proposed text line extraction using vertical projection, marking all points where pixel values are not found and text line into ligatures using stroke geometry [17]. They proposed identification of partial words (i.e. connected components) in text line and using horizontal / vertical projections to identify words using relative distance matching [18]. They used dictionary for text line and ligature segmentation in online text [19].

Problem Statement

Previous work has limitations that it cannot correctly perform segmentation in few cases and there will be misclassification problems. Moreover it can recognize a limited set of connected components or ligatures only.

Proposed Segmentation Algorithm

We will enhance previous work by proposing an improved algorithm for Urdu script segmentation that will use a character model. For this purpose we have created a set of characters. There are about 114 characters excluding some special characters like zabar, zair, paish etc. We have used characters of fixed size and style in this work. We are using all the variations of each character in a writing style e.g. bay has three shapes a basic, a beginning and mid shapes. Our algorithm uses a character model with Hidden Markov Models (HMMs) for segmentation of Urdu text. To the best of our knowledge, this work has not been done previously. We have offline text i.e., scanned pre-processed B/W Urdu characters and we are using Matlab ver. 7.12 as programming tool.

4.1 Our Method

Our method is divided into three broad steps:

Step#1 Data Acquisition / Feature Extraction:

In the first step, algorithm transforms images of symbols into binary form as a matrix. Then extract features from the images using our feature extraction program and store it into a disk. These features are represented as hidden states: X(i) = { x(0), x(1), . . . , x (k)} where each X (i) represents a feature (in matrix form) for each shape in an Urdu character set; x (k) is a position vector in the matrix X (i).

Step#2 Get Observed data:

The observed data contain sequences of Urdu characters. In our study we have used a line of Urdu text. After acquiring this filtered image, we have transformed it into binary form. Then extracted features from an image using our feature extraction program. This feature contains several Urdu characters in it. The algorithm will scan it and perform segmentation by calculating maximum probabilities with hidden states and locating observations in feature using HMMs. These observations form observable states: O(i) = { o(0), o(1), . . . , o(k)} where each O(i) represents feature (in matrix form) for each shape in observed states; o(k) is a positional vector in matrix O(i).

Step#3 Apply HMMs:

We are given:

Hidden states: X(i) = { x(1), x(2), . . . , x(k)} where i = 1,2, … , m (for m characters).

Observable states: O(i) = { o(1), o(2), . . . , o(k)} where i = 1,2, … , n.

Initial Distribution X(0).

In a hidden Markov model the state variable x(i) is observable only through its measurements o(i). Now, suppose that a sequence O(i) of emission has been observed.

Fig 2 shows transformation of a character and an observed sequence that are captured using MATLAB matrices.

(a)

(b)

Fig 2: (a) A m x n matrix showing Urdu character Alif. (b) Sample observation showing a connected component of two characters bay and alif spelled out ba.

Instead of using characters our algorithm extracted features from all the characters to reduce computation complexity. These features will be used as hidden states in HMM i.e. x(i) and are stored on disk for example, features showing character alif and bay, captured using MATLAB, are shown below in fig 3.

(a)

(b)

(c)

Fig 3. (a) Feature for character Alif, (b) Feature for character Bay and

The algorithm extracts feature from line of sample text S(i). In forward algorithm, the feature s(1), … , s(k) is matched against each of the hidden states x(i) by matching rows of x(i) with rows of S(i). The process continues for all characters and stops after calculating probabilities for all the characters i.e. P(X(i)|Z(i)). Afterwards it finds the maximization of probability and in this way it finds observation O(1) from the S(1). The forward algorithm will continue from s(k+1), …, s(L) to find observations O(2), … , O(n). If there is more than one probable character, then we can use a so called Viterbi algorithm that will find argmax and will give the optimal probable sequence if we are not near to actual results. The algorithm for the HMMs is as under:

Algorithm Segsha (S, L)

j=1

while ( j < L )

for i = 1 to n

Sample s(j) ~ {w}

wi’ = pr(s(j)|X(i))

end-for

O(i) = O(i) U {max( wi’ )}

s(j) = s(j) + 1

end-while

Where S is a sample feature of vectors obtained from an observed sequence O(i) i.e., a line of Urdu text; L is the dimension of S (length of S); S(j) is a sample taken from S each time to match against character feature X(i) and probability of matching will give us weights, wi’, for each character; max(wi’) is maximization of probability that proceeds as follows:

Here max(wi’) can be calculated by comparing wi’ ~ w and calculated by using the eq.1 [20].

Result

A total of 1200 words were used that include all the characters in our character set. Sample scanned text was taken from Nastaliq font with point size 36. We found that 1176 out of 1200 were completely recognized. Not the whole word but only one or two characters in a word were misclassified. The accuracy of 97% was very encouraging for us and we are looking forward to work further in this area.

Conclusion

We tested our approach on images of text taken from Nastaliq font scanned at 300 dpi and found that better results can be achieved by using HMM with the character model. These results were checked on a prototype using a set of characters. We have achieved 97% accuracy.

Future Work and Enhancements

In future we are planning on two things:

1. To eliminate restriction of fixed font size and style.

2. To work with handwritten Urdu text.

We will use both of the options using the same method but that is another story.

Order Now