Searching In Roman Urdu English Language Essay

after the emergence of the internet computer, scientist came across with many new problems such as multi-lingual and transliteration issues, different standards has been created for the different countries i.e. l18n internationalization (international standards for date time and language) and l10n(Localized standards for date time and language). Such kind of problem arises in south Asian countries such as Pakistan and Indian since Urdu and Hindi are not written using roman characters that resulted in new writing style know as Roman Urdu, it can be defined as writing Urdu using Roman alphabets. In past, many efforts were made for translating the Roman Urdu into Urdu and Hindi script. But, no serious efforts were made for creating information retrieval system or search contents over the internet. Since, it is not standardized language. So people write it according to their literacy rate and intuition. In this research weÂ []Â have tried to resolve these issues by generating multiple queries and on these search queries we applied advance searching techniques such as n-gram modeling techniques.

Index Terms – Searching In Urdu, Roman Urdu Search, Information retrieval.

Introduction

The writing language using its scripture is the basic need of any speaker of any language. Since normal keyboards are standardized using English language. After the emergence of internet people started communicating in local languages and due to this new writing style emerged for different languages, one of the writing style introduced by Hindi and Urdu speakers know as Roman Urdu. Since, lots of contents are available over the web in the form of discussion forums. There are 240 million users of Roman Urdu from all over the world. In this research we tried to create information retrieval system that will help us to find out contents over the internet in roman Urdu in far better way than others.

Why Searching In Roman Urdu Is Different From Others? [7]

The Roman Urdu is written according to user intuition.

Urdu is a morphologically complex language that is why we can write single word in many ways.

There is no standardization available for Roman Urdu.

Different characters sets are purposed for the Roman Urdu transliteration that can be mapped on the Urdu characters on the basis of phonemes but due to morphological nature of Urdu language, it results in to the ambiguous results. But still it does not provide any solution regarding the Roman Urdu searching due to it ambiguous mapping. Different efforts were made for transliteration of Roman Urdu to Urdu. But no substantial efforts were made for information retrieval of data in Roman Urdu. Many different schemas have been purposed for Urdu transliteration and no single standard way to translate the Roman Urdu to Urdu script since the numbers of Urdu characters are greater than Roman alphabets. A solution purposed by the paper [9] where introduced the three tire generation architecture. It has three phases 1) Pre-processing 2) Cross-Script Mapping 3) Trie-generation. Special symbols and capitalization of roman character were used to map Urdu over roman characters. Urdu has 15 long and 3 short vowels and different people use different spelling for the presentation of these vowels using the provided a schema that map the Roman alphabet to the Urdu vowels afterwards this generates the tree which provides the all possible options but due to this it generates many irrelevant options as well which causes ambiguity. The well optimizes solution for such type of problem is to provide an unambiguous way to map Urdu alphabets to English alphabets. They provided following alphabet to alphabet mapping of Urdu alphabets to Roman alphabets. [9]

Numbers of Roman alphabets are less than Urdu they used special charters added to Roman alphabet for one to one mapping.

In another research focused on transliteration from Roman Urdu to Urdu. They tried to provide the CFG for the Roman Urdu. By using this, they tried to transliterate the Roman Urdu into Urdu. They provided different techniques for inserting a word into the DSFA [8]. They provided their own schema and provided vowels and its mapping on Urdu alphabets as provided below:

Using this mapping and vowels sounds, we have created our own vowel mapping schema which is created on the basis of sounds of phonemes. The one to one mapping of vowels according to their pronunciation produced following five tables.

A

Aa

Ae

e

Aa

Â

Ae

Â

Ai

Â

Ao

Au

euTable For ‘a’ and its variations

e

Ai

i

ee

Ea

Â

Ee

Ei

Â

Eo

eEu

Â

Eu

Â

Table for ‘e’ and its variations

I

Ai

ie

Ia

ee

Ie

Ii

Â

Io

Iu

Table for ‘i’ and its variations

O

Ao

oo

ou

Oa

Â

Oe

Oi

Â

Oo

Ou

Â Table for ‘o’ and its variations

u

O

oo

Ao

Ua

Ue

Ui

Uo

Uu

Table for ‘u’ and its variations

We will use these tables to generate multiple queries form user query and fetch the data on the basis of these queries instead of a single query. That will help us to resolve issues related the inconsistency of writing style of the Roman Urdu users.

To test it, we had built following schema so, that we could test our vowel replacement technique.

To implement the schema in above mention diagram, we have to know how the basic search engine and crawlers works. Since, existing search engine application does not provide any free API (Application Programmable interfaces) by which we can test our schema. We have to prepare our own search engine which will help us to apply some of the advance techniques for Information retrieval. Firstly, we need to look into the working of a multi-query generator.

The multi query generator is an algorithm that identifies the vowel between the query string. After identification of the vowels it replaces it with the phonetically equivalent vowels as provided in the table. It generates multiple queries from single user query than it displays the search results after fetching data against each query. The algorithm is given below:

Tokenize the algorithm on the basis of the vowel location or the location vowel combinations.

Then replace the vowel or vowel combination with the phonetically close vowels or their combination.

Then combine the string in n-1way where N is number of combination against a single vowel or combination of vowel.

Fetch the result against them and display the result on the front end.

The basic crawler gets a seed URL and from this URL it extracts more URL’s add it to the frontier. Then fetch pages by getting URL’s from frontier. It depends upon the architecture of crawler that how it fetches the URL from frontier and how does it save the data in the database [12].

During literature review I had gone through following Crawling techniques:

NaÃƒÂ¯ve Best First Crawling[4]

These kinds of crawlers fetch pages and provide the pages and represent them as vector of words according to the frequency of the word in the text. Then it computes cosine similarity according to parent page. In the case of multi-threaded environment it works like best-N-first where N is number of threads executing simultaneously.

Focused Crawler:[1]

In this crawling technique pages are crawled and categorized on the bases of topic taxonomy. In start the crawler requires a topic of taxonomy and example URL’s. These URL’s are classified in many taxonomy which can be corrected by user through interactive process.

Context Focused Crawler:[2]

This crawler is based on Bayesian classifier to guide their crawlers. These classifiers provide the bases to calculate the distance between the base page and the currently crawled page. These crawlers maintain the context graph of N layers for each seeded URL.

Info spider [3]

An adoptive population of agents search pages relevant to the topic of the query. Each agent is essentially following the crawling loop (while using an adaptive query list and a neural net to decide which links to follow. The algorithm provides an exclusive frontier for each agent.

All of the above crawlers has its own pros and cons and discussion of it out of scope of this paper. For our research we tried to implement Best-First crawling technique since, it provide the way for maintaining the similarity index in simplest and effective way to record the information.

Since, there are many issues with current implementation of the project. As provided below:

It misses the context of search and language both i.e.

If I try to search ‘Pakistan ka matlab kiya’ the multi query gets multiple queries of Pakistan as well that should not be done.

It generates many useless queries.

N-gram Modeling [6]

We can resolve this by adopting advance Information retrieval techniques that can help us searching the contents on the basis of context. We can use the advance N-gram modeling techniques. N-gram is a language modeling that estimate the occurrence of next word in the language by the proceeding sequence. It uses the history of n-1 preceding words to compute the occurrence probability P of the current word. Some of the application of N-gram modeling techniques is given below:

Speech recognition

handwriting recognition

information retrieval

optical character recognition

We can use this technique if we have availability of language Corpus and many others languages use this technique for information retrieval and applied on the Indain languages.[7]

There are two major corpuses available for the Urdu:[11]

Emille Lancater Corpus

Becker-Raiz Corpus

For our research we will use Emille Lancater Corpus for being more stable. Since these corpus are for Emille Lancater but we will generating one to one mapping corpus for Roman Urdu to Urdu by developing a standard Roman Urdu word list against available Urdu Corpuses. From there we can apply certain semantic web techniques as well. Purposed solution:

User Query

Multi-query generator

N-gram Techniques, Probabilistic Mapping of word provided in the Corpus

Search Engine

Results to the user

Conclusion

Although a conclusion may review the main points of the paper, do not replicate the abstract as the conclusion. A conclusion might elaborate on the importance of the work or suggest applications and extensions.

Order Now