Sinhala Text To Speech Stts English Language Essay

The system, which is, called SINHALA TEXT TO SPEECH , is a one kind of fully research project. This documentation briefly describes the functionality of my STTS and highlights the important and benefits of the project. So this system will allow user to enter Sinhala texts and internally it will convert in to pronunciation form. Actually it will happen after user select the particular option (convert to voice) to convert it in to that pronunciation form. So totally this system is capable of accepting characters in Sinhala language (Sinhala fonts) and makes them in to sound waves, which can be captured by a technical object (speakers). User will able to select the voice type, which he/she like, it mean there are three option called child voice, female voice and adult (male) voice to select. By selecting that function user can hear the voice, which he/she like most. And the system will carry out several benefits to users, those who will use this system. The users who are not able to read Sinhala, but those can understand verbally will encourage to use this system, because using this product they can overcome that problem very easily. If somebody needs documents with Sinhala texts, then he or she can use this system to get that one. In today world there are no such systems for Sinhala language like this.

Table of Contents

INTRODUCTION

We use speech as the main communication media to communicate between ourselves in our day to day life. However, when it comes to interacting with computers, apart from watching and performing actions, majority of communication is achieved nowadays through reading the computer screen. It involves surfing the internet, reading emails, eBooks, research papers and many more and this is very time consuming. Nevertheless, visually impaired community in Sri Lanka is faced with much trouble communicating with computers since a suitable tool is not available for convenient use. As an appropriate solution to this problem, this project proposes an effective tool for Text-To-Speech conversion accommodating speech in native language.

What is text-to-speech?

Not everybody can read text when displayed on the screen or when printed. This may be because the person is partially sighted, or because they are not literate. These people can be helped by generating speech rather than by printing or displaying it, using a Text-to-Speech (TTS) System to produce the speech for the given text. A Text-To-Speech (TTS) system takes written text (can be from a web page, text editor, clipboard… etc.) as the input and convert it to an audible format so you can hear what is there in the text. It identifies and reads aloud what is being displayed on the screen. With a TTS application, one can listen to computer text in place of reading it. That means you can listen to your emails, eBooks while you do something else which result in saving your valuable time. Apart from time saving and empowering the visually impaired population, TTS can also be used to overcome the literacy barrier of the common masses, increase the possibilities of improved man-machine interaction through on-line newspaper reading from the internet and enhancing other information systems such as learning guides for students, IVR (Interactive Voice Recognition) systems, automated weather forecasting systems and so on [1][2].

What is “Sinhala Text To Speech”?

“Sinhala Text To Speech” is the system I selected as my final research project. As a post graduate student I selected a research project that will convert the Sinhala input text into a verbal form.

Actually, the term “Text-To-speech” (TTS) refers to the conversion of input text into a spoken utterance. The input is a Sinhala text, which may consist of a number of words, sentences, paragraphs, numbers and abbreviations. TTS engine should identify it without any ambiguity and generate the corresponding speech sound wave with acceptable quality. The output should be understandable for an average receiver without making much effort. This means that the output should be made as close as to the natural speech quality.

Speech is produced when air is forced from the lungs through the vocal cords (glottis) and along the vocal tract. Speech is split into a rapidly varying excitation signal and a slowly varying filter. The envelope of the power spectra contains the vocal tract information.

The verbal form of in input should be understandable for the receiver. This means that the output will be made as closer as the natural human voice. The system will carry out few main features. Some of them are, after entering the text user will capable of selecting one of voice qualities, means women voice, male voice and child voice. Also the user is capable of doing variation in speed of the voice.

Actually, my project will carry out main few benefits to the users, those who intend to use this.

Below I have mentioned the basic architecture of project.

Sinhala Voice

Text in Sinhala

And

Voice and speed

Selection

Process

Figure 1.2

1.3 Why need “Sinhala Text To Speech”?

Since most commercial computer systems and applications are developed using English, usage and the benefits of those systems are limited only to the people with English literacy. Due to that fact, majority of world could not take the advantages of such applications. This scenario is also applicable to Sri Lanka as well. Though Sri Lankans have a high language literacy, computer and English language literacy in sub urban areas are bit low. Therefore the amount of benefits and the advantages which can be gained through computer and information systems are being kept away from people in rural areas. One way to overcome that would be through localization. For that “Sinhala Text To Speech” will act as a strong platform to boost up software localization and also to reduce the gap between computers and people.

Read also  Vocabulary On Writing In EFL Learners

AIMS AND OBJECTIVES

The main objective of the project is to develop a fully featured complete Sinhala Text to Speech system that gives a speech output similar to human voice while preserving the native prosodic characteristics in Sinhala language. The system will be having a female voice which is a huge requirement in the current localization software industry. It will act as the main platform for Sinhala Text To Speech and developers will have the benefit of building end user applications on top of that. This will benefit visually impaired population and people with low IT literacy of Sri Lanka by enabling convenient access of information such as reading emails, eBooks, website contents, documents and learning tutors. An end user windows application will be developed and it will act as a document reader as well as a screen reader.

To develop a system, that can able to read text in Sinhala format and covert it in to verbal (Sinhala) form. And also, It will capable to change the sound waves, It mean user would able to select voice quality according to his/her opinion. There are might be three voice selections. These are kind of female voice, kind of male voice and kind of kid’s voice. And user can change the speed of the voice. If somebody needs to hear low speed voices or high-speed voice, then he/she can change it according to their requirements.

SPECIFIC STUDY OBJECTIVES

Produce a verbal format for the input Sinhala text.

Input Sinhala text which may be a user input or a given text document will be transformed in to sound waves, which is then output is captured by speakers. So the disabled people will be one of the most beneficial stakeholders of Sinhala Text to Speech system. Also undergraduates and research people who need to use more references can send the text to my system, just listen and grab what they need.

The output would be more like natural speech.

The human voice is a complex acoustic signal, which is generated by an air stream expelled at either mouth, nose or both. Important characteristics of the speech sound are speed, silence, accentuation and the level of energy output. The tongue appropriately controls the air steam, lips with the help of other articulators in the vocal system. Many variations of the speech signal are caused by the person’s vocal system, in order to convey the meaning and emotion to the receiver who then understand the message. Also includes many other characteristics, which are in receiver’s hearing system to identify what is being said.

Identify an efficient way of translating Sinhala text in to verbal form.

By developing this system we would be able to identify and proposed a most suitable algorithm, which can be used to translate Sinhala format to verbal form by a fast and efficient manner.

Control the voice speed and types of the voice (e.g. man, women, child voice, etc.).

Users would be capable of selecting the quality of the sound wave, which they want. Also they would be allowing reset the speed of the output as they need. People, those would like to learn Sinhala as their second language to learn elocution properly by changing the speed (reducing and increasing). So this will improve the listening capabilities.

Small kids can be encouraged to learn language by varying the speed and types.

Propose ways for that can be extended the current system further more for future needs.

This system only gives the basic functions. The system is feasible of enhancing further more in order to satisfy the changing requirements of the users. This can be embedded in to toys so can be used to improve children listening and elocution abilities. So those will Borden their speaking capacity.

RELEVANCE OF THE PROJECT

The thought of developing a Sinhala Text To Speech (STTS) engine have begun when I considering the opportunities available for Sinhala speaking users to grasp the benefit of Information and Computer Technology (ICT). In Sri Lanka more than 75% of population speaks in Sinhala, but it’s very rare to find Sinhala softwares or Sinhala materials regarding ICT in market. This is directly effect to development of ICT in Sri Lanka.

In present few Sinhala text to speech softwares are available but those have problems such as quality of sound, font schemas, pronunciation etc. Because of these problems developers are afraid to use those STTS for their applications. My focus on developing an engine that can convert Sinhala words in digitized form to Sinhala pronunciation with error free manner. This engine will help to develop some applications.

Some applications where STTS can be used

Document reader. An already digitized document (i.e. e-mails, e-books, newspapers, etc.) or a conventional document by scanned and produced through an optical character recognizer (OCR).

Aid to handicap person. The vision or voice impaired community can use the computers aided devices, directly to communicate with the world. The vision-impaired person can be informed by a STTS system. The voice-impaired person can communicate with others by providing a keypad and a STTS system.

Talking books & toys. Producing talking books & toys will boost the toys market and education.

Help assistant. Develop help assistant speaks in Sinhala like in MS Office help assistant.

Automated News casting. The future of entirely new breed of television networks that have programs hosted by computer-generated characters is possible.

Sinhala SMS reader. SMS consist of several abbreviations. If a system that read those messages it will help to receivers.

Language education. A high quality TTS system incorporated with a computer-aided device can be used as a tool, in learning a new language. These tools can help the learner to improve very quickly since he/she has the access to the correct pronunciation whenever needed.

Travelers guide. System that located inside the vehicle or mobile device that will give information current location & other relevant information incorporated with GPRS.

Read also  How Mobile Phones Affect Our Lives English Language Essay

Alert systems. Systems that can be incorporated with a TTS system to attract the attention of the controlled elements since as humans are used to draw attention through voice.

Specially, countries like Sri Lanka, which is still struggling to harvest the ICT benefits, can use a Sinhala TTS engine as a solution to convey the information effectively. Users can get required information from their native language (i.e. by converting the text to native language text) would naturally move their thoughts to the achievable benefits and will be encouraged to use information technology much frequently.

Therefore the development of a TTS engine for Sinhala will bring personal benefits (e.g. aid for handicapped, language learning) in a social perspective and definitely a financial benefit in economic terms (e.g. virtual television networks, toys manufacture) for the users.

RESEARCH METHODOLOGY

This has been developed using the agile software development method. We aimed to develop the solution short time goals which allow having a sense of accomplishment. Having short term goals make life easier. Project review was a very useful and powerful way of adding a continuous improvement mechanism. The project supervisors are consulted on a regular basis for reviews and feed back in order to make right decisions, clear misunderstandings and carry out the future developments effectively and efficiently. Good planning and meeting follow up was crucial to make these reviews a success.

BACKGROUND AND LITERATURE REVIEW

“Text to speech “is very popular area in computer science field. There are several research held on this area. Most of research base on “how to develop more natural speech for given text “. There are freely available text to speech package available in the world. But most of software develops for most common language like English, Japanese, Chinese languages. Even some software companies distribute “text to speech development tools “for English language as well. “Microsoft Speech SDK tool kit” is one of the examples for freely distributed tool kit developed by Microsoft for English language.

Nowadays, some universities and research labs doing their research project on “Text to speech”. Carnegie Mellon University held their research focus on text to speech (TTS). They provide Open Source Speech Software, Tool kits, related publication and important techniques to undergraduate student and software developer as well. TCTS Lab also doing their research on this area. They introduced simple, but general functional diagram of a TTS system [39].

Image Credit: Thierry Dutoit.

Figure: A simple, but general functional diagram

Before the project initiation, a basic research was done to get familiarized with the TTS systems and to gather information about the existing such systems. Later a comprehensive literature survey was done in the fields of Sinhala language and its characteristics, Festival and Festvox, generic TTS architecture, building new synthetic voices, Festival and Windows integration and how to improve existing voices.

History of Speech Synthesizing

A historical analysis is useful to understand how the current systems work and how they have developed into their present form. History of synthesized speech from mechanical synthesis to the form of today’s high-quality synthesizers and some milestones in synthesis related techniques will be discussed under History of Speech Synthesizing.

Efforts have been made over two hundred years ago to produce synthetic speech. In 1779, Russian Professor Christian Kratzenstein has explained physiological differences between five long vowels (/a/, /e/, /i/, /o/, and /u/) and constructed equipment to create them. Also, acoustic resonators which were alike to human vocal tract were built and activated with vibrating reeds.

In 1791, “Acoustic-Mechanical Speech Machine” was introduced by Wolfgang von Kempelen which generated single and combinations of sounds. He described his studies on speech production and experiments with his speech machine in his publications. Pressure chamber for the lungs, a vibrating reed to act as vocal cords, and a leather tube for the vocal tract action were the crucial components of his machine and he was able to produce different vowel sounds by controlling the shape of the leather tube. Consonants were created by four separate restricted passages controlled by fingers and a model of vocal tract including hinged tongue and movable lips is used for plosive sounds.

In mid 1800’s, Charles Wheatstone implemented a version of Kempelen’s speaking machine which was capable of generating vowels, consonant sounds, some sound combinations and even full words. Vowels were generated using vibrating reed with all passages closed and consonants including nasals were generated with turbulent flow through an appropriate passage with reed-off.

In late 1800’s, Alexander Graham Bell with his father constructed a same kind of machine without any significant success. He changed vocal tract by hand to produce sounds using his dog between his legs and by making it growl.

No significant improvements on research and experiments with mechanical and semi electrical analogs of vocal systems were made until 1960s’ [38].

The first fully electrical synthesis device was introduced by Stewart in 1922[17]. For the excitation, there was a buzzer in it and another two resonant circuits to model the acoustic resonances of the vocal tract. This machine was able to produce single static vowel sounds with two lowest formants. But it couldn’t do any consonants or connected utterances. A similar kind of synthesizer was made by Wanger [27]. This device consisted of four electrical resonators connected parallel and it was also excited by a buzz-like source. The four outputs by resonators were combined in the proper amplitudes to produce vowel spectra. In 1932, Obata and Teshima, two researchers discovered the third formant in vowels [28]. The three first formants are generally considered to be enough for intelligible synthetic speech.

The first device that could be considered as a speech synthesizer was the VODER (Voice Operating DEmonstratoR) introduced by Homer Dudley in New York’s Fair 1939 [17][27][29]. The VODER was inspired by the VOCODER (Voice CODER) which developed at the Bell Laboratories in mid-thirties which was mainly developed for the communication purpose. The VOCODER was built as voice transmitting device as an alternative for low band telephones and the VOCODER analyzed wideband speech, converted it into slowly varying control signals, sent those over a low-band phone line, and finally transformed those signals back into the original speech [36]. The VODER consisted of touch sensitive switches to control voice and a pedal to control the fundamental frequency.

Read also  Propaganda in 1984 by George Orwell

After the demonstration of VODER demonstrating the ability of a machine to produce human voice intelligibly, the people were more interested in speech synthesis. In 1951, Franklin cooper and his associates developed a pattern playback synthesizer at the Haskins Laboratories [17] [29]. Its methodology was to reconvert recorded spectrogram patterns into sounds either in original or modified form. The spectrogram patterns were stored optically on the transparent belts.

The Formant synthesizer was introduced by Walter Lawrence in 1953 [17] and was named as PAT (Parametric Artificial Talker). It consisted of three electronic formant resonators connected in parallel. As an input signal, either a buzz or a noise was used. It could control the three formant frequencies, voicing amplitude, fundamental frequency, and noise amplitude. Approximately the same time, Gunner Fant introduced the first cascade formant synthesizer named OVE I ( Orator Verbis Electris). In 1962, Fant and Martony introduced an improved synthesizer named OVE II which consisted separate parts in it to model the transfer function of the vocal tract for vowels, nasals and obstruent consonants. The OVE projects were further improved and as a result OVE III and GLOVE introduced at the Kungliga Tekniska Högskolan (KTH), Sweden, and the present commercial Infovox system is originally descended from these [30][31][32].

There was a conversation between PAT and OVE on how the transfer function of the acoustic tube should be modeled, in parallel or in cascade. John Holmes introduced his parallel formant synthesizer in 1972 after studying these synthesizers for few years. The voice synthesis was so good that the average listener could not tell the difference between the synthesized and the natural one [17]. About a year later he introduced parallel formant synthesizer developed with JSRU (Joint Speech Research Unit) [33].

First articulator synthesizer was introduced in 1958 by George Rosen at the Massachusetts Institute of Technology, M.I.T. [17]. The DAVO (Dynamic Analog of the Vocal tract) was controlled by tape recording of control signals created by hand. The first experiments with Liner Predictive Coding (LPC) were made in mid 1960s [28].

The first full text-to-speech system for English was developed in the Electro technical Laboratory, Japan 1968 by Noriko Umeda and his companions [17]. The synthesis was based on an articulatory model and included a syntactic analysis module with some sophisticated heuristics. Though the system was intelligible it is yet monotonic.

The MITalk laboratory text-to-speech system developed at M.I.T by Allen, Hunnicutt, and Klatt in 1979. The system was used later also in Telesensory Systems Inc. (TSI) commercial TTS system with some modifications [17][34]. Dennis Klatt introduced his famous Klattalk system two years later, which used a new sophisticated voicing source described more detailed in [17]. The technology used in MITalk and Klattalk systems form the basis for many synthesis systems today, such as DECtalk and Prose-2000.

In 1976, the first reading aid with optical scanner was introduced by Kurzweil. The system was very useful for the blind people and it could read multifont written text. Though it was useful, the price was too expensive for average customers yet it used in libraries and service centers for, but was used in libraries and service centers for visually impaired people [17].

Considerable amount of commercial text-to-speech systems were introduced in late 1970’s and early 1980’s [17]. In 1978 Richard Gagnon introduced an inexpensive Votrax-based Type-n-Talk system. In 1980, two years later Texas Instruments introduced linear prediction coding (LPC) based Speak-n-Spell synthesizer based on low-cost linear prediction synthesis chip (TMS-5100). In 1982 Street Electronics introduced Echo low-cost diphone synthesizer which was based on a newer version of the same chip as in Speak-n-Spell (TMS-5220). At the same time Speech Plus Inc. introduced the Prose-2000 text-to-speech system. A year later, first commercial versions of famous DECtalk and Infovox SA-101 synthesizer were introduced [17].

One of the modern synthesis technology methods applied recently in speech synthesis is hidden Markov models (HMM). They have been applied to speech recognition from late

1970’s. For two decades, it has been used for speech synthesis systems. A hidden Markov

model is a collection of states connected by transitions with two sets of probabilities in each: a transition probability which provides the probability for taking this transition, and an output probability density function (pdf) which defines the conditional probability of emitting each output symbol from a finite alphabet, given that the transition is taken [35].

Neural networks also used in speech synthesis for about ten years and yet the ways of using neural networks are still not fully discovered. Same as the HMM, the neural network technology can also use in speech synthesis in a promising manner [28].

Fig.6.1. some milestones in speech synthesis [38]

6.1.1 History of Finnish Speech Synthesis

In past, compared to English, the number of users is quite small and the development process is time consuming and expensive even though Finnish text processing scheme is simple and correspondence to its pronunciation is in a high level. The demand has been increased with new multimedia and telecommunication applications.

In 1977, the first Finnish speech synthesizer, SYNTE2 was introduced in Tampere University of Technology. It was the first microprocessor based synthesis system and the first portable TTS system in the world. Five years later an improved SYNTE3 synthesizer

Order Now

Order Now

Type of Paper
Subject
Deadline
Number of Pages
(275 words)