Malay Speech Corpus

CHAPTER 3 MALAY SPEECH CORPUS

3.1 Introduction

The knowledge related to the structure of the rules and grammar for any language must be understood in depth prior to the development of any Automatic Speech Recognition (ASR) systems. This chapter is intended to discuss the related issues concerning the Malay language and its speech sounds.Â The Malay corpus and the test collections used for this study are also presented in the following sections.

3.2Malay Speech Sounds and Language Rules

Malay is an Austronesian language spoken by the Malay people who are native to the Malay Peninsula, southern Thailand, Singapore and parts of Sumatra and also known locally as Bahasa Melayu. It is the official language of Malaysia and is an agglutinative language, meaning that the meaning of the word can be changed by adding the necessary prefixes or suffixes that will be explained through out of this section.

The smallest unit in any language is known as phoneme.Â The substitution of this unit for another might make a distinction of meaning (Nong et al. 2001).Â Integrating the phonemes produces the syllable and words.Â Generally, phoneme classification for Malay language is divided into three major groups that consist of Vowels (V), Consonants (C) and other miscellaneous (Manaf & Hamid 1996).Â This structure is relatively same with the English language as shown in Figure 3.1 (Karim 1996).

The vowel class comprises of six vowels that is: /a/, //, /i/, /o/, /u/ and /e/.Â The vowel sound is produced when the air exit from the lunges and mouth without ant noise.

The second category, which is consonant class, can be further divided into seven different categories that is the stops or plosive group, affricates, nasals, glides, liquids, fricatives and the semivowel.Â The sounds from consonants are produced by air from lungs and consist of noise.Â The noise is generated in mouth and nose, for instance, phoneme /p/and /b/.Â Figure 3.2 describe the consonant utterances classification for the Malay language.

The last category, miscellaneous category, consists of the diphthong and vowel functions.Â Vowel function is a combination of two different vowel (ia, io and iu) and most often used in words absorbed directly from its English equivalent such as radio and audio, and in some original Malay words such as nyiur (coconut), hias (decorate) (Hussain, 1997).

3.2.1Malay morphology

Malay morphology is defined as study of word structures in Malay language (Lutfi Abas, 1971). A morpheme is the term used in the morphology. A morpheme is the smallest meaningful unit in a language. In another words, morpheme is a combination of phonemes into a meaningful unit. A Malay word can be comprised of one or more morphemes. When we talk about Malay morphology, we cannot avoid from discussing the process of word formation in Malay language. It is a language of derivative which allows the addition of affixes to the base/root or primary word to form new words. The language itself is different from the English. In English language, the process involves the changes in the phonemes according to their groups. The processes of word formation in Malay language are in the forms of primary words, derivative words, compound words and reduplicative words.

3.2.1.1Primary word

Primary or root words are either nouns or verbs, which is does not take any affixes or reduplication. A primary word can be comprised of one or more syllables. A syllable consists of a vowel (V) or a vowel with a consonant (C) or a vowel with several consonants. The vowel can be presented at the front or back of the consonants. In Malay language, primary word with one syllable accounts for about 500 only (Nik Safiah Karim et al. 1995). Some of the primary words are taken from other languages such as English and Arabic. The structures of the syllable are shown in Table 3.1. Primary words with two syllables are the majority in the Malay language. The structures of the words are shown in Table 3.2 with example of words that illustrated as in Figure 3.3. Primary words with three and more syllables exist in a few numbers. Most of them are taken from other languages as shown in Table 3.3.

Table 3.1:Â Structure of words with one syllable

Syllable Structure	Example of word
CV	Ya (yes)
VC	Am (common)
CVC	Sen (cent)
CCVC	Stor (store)
CVCC	Bank (bank)
CCCV	Skru (screw)
CCCVC	Skrip (script)

Table 3.2:Â Structure of words with two syllables

Syllable Structure	Example of word
V + CV	Ibu (mother)
V + VC	Air (water)
V + CVC	Ikan (fish)
VC + CV	Erti (meaning)
VC + CVC	Empat (four)
CV + V	Doa (pray)
CV + VC	Diam (silent)
CV + CV	Guru (teacher)
CV + CVC	Telur (egg)
CVC + CV	Lampu (lamp)
CVC + CVC	Jemput (invite)

E	R	+	T	I
V	C	+	C	V

J	E	M	+	P	U	T
C	V	C	+	C	V	C

C – Consonant

V – Vowel

Table 3.3:Â Structure of words with three syllables or more

Syllable Structure	Example of word
CV + V + CV	Siapa (who)
CV + V + CVC	Siasat (investigate)
V + CV + V	Usia (age)
CV + CV + V	Semua (all)
CV + CV + VC	Haluan (direction)
CVC + CV + VC	Berlian (diamond)
V + CV + CV	Utara (north)
VC + CV + CV	Isteri (wife)
CV + CV + CV	Budaya (culture)
CVC + CVC + CV	Sempurna (perfect)
CVC + CV + CVC	Matlamat (aim)
CV + CV + VC + CV	Keluarga (family)
CV + CVC + CV + CV	Peristiwa (event)
CV + CV + V + CVC	Mesyuarat (meeting)
CV + CV + CV + CVC	Munasabah (reasonable)
V + CV + CVC + CV + CV	Universiti (University)

3.2.1.2Derivative word

Derivative words are the words that are formed by adding affixes to the primary words. The affixes can exist at the initial (Prefixes), within (Infixes) or final (Suffixes) of the words. They can also exist at the initial and final of the words at the same time. These kinds of affixes are called confixes. Examples of derivative words are â€œberjalanâ€ (walking), â€œmempunyaiâ€ (having), â€œpakaianâ€ (clothes) and so on.

3.2.1.3 Compound word

Compound words are the words that are combined from two individual primary words, which carry certain meanings. There are quite lots of compound words in Malay language. Examples of compound words are â€œalat tulisâ€ (stationery), â€œjalan rayaâ€ (road), â€œkapal terbangâ€ (aeroplane), â€œProfesor Madyaâ€ (associate professor), â€œhak milikâ€ (ownership), â€œpita suaraâ€ (vocal folds) and so on. Some of the Malay idioms are from the compound words such as â€œkaki ayamâ€ (bare feet), â€œbuah hatiâ€ (gift), â€œberat tanganâ€ (lazy), â€œterima kasihâ€ (thank you) and so on.

3.2.1.4 Reduplicative word

Reduplicative words, as its name suggests, are the words that are reduplicated from the primary words. There are three forms of reduplication in Malay language: full, partial and rhythmic. Examples of reduplicative words are â€œmata-mataâ€ (policeman), â€œsama-samaâ€ (welcomed) and so on.

3.3Malay Speech Corpus Design

Malay speech design basically involves the proper selection of speech target sounds for speech recognition.Â The Malay phonemes can be analyzed according to the descriptive analysis and distinctive feature analysis.Â Generally, the descriptive analysis is preferred over the distinctive feature analysis because it is easier to be implemented. To develop a baseline system for spoken Malay utterances or word model, we need database for isolated spoken Malay words.Â However, very little of the literature and reference material in Malay is available in raw electronic form to support research and development work. These materials are sometimes not suitable for the real life speech recognition system due to their setting environments and most of these materials are recorded the planned or read text.Since no spoken Malay database exists, we develop the Malay corpus based on Hansard documents from Parliament of Malaysia. The hansard documents consists of Dewan Rakyat(DR)Parliamentary debates session for the year 2008.Â It contains spontaneous and formally speeches and it is the daily records of the words spoken by 222 elected members of DR. The hansard documents comprises of 51 huge raw video and audio files (.avi form) of daily recorded parliamentary session and 42 text files (.pdf form). Each part of parliamentary session contains six to eight hours spoken speeches that surrounded with medium noise condition or environment (less than 30 dB), speakers interruption (Malay, Chinese and Indian) and different speaking styles (low, medium and high intonation or shouting).Â The reason of chosen this kind of data is due to their spontaneous and natural way of speaking in a formal or standard Malay speech during the debates session.

The analysis has been done to the whole recorded session from mid-term until the end 2008 of hansard documents. Out of 42 text documents and 51 video files, only 22 text documents and 22 video files were being selected due to their perfect matched in terms of the contents of video and audio source files. The remaining of the text documents and video files have not been chosen due to the missing of some text documents that could not be downloaded, some video files having corrupted during recording session and some of the recorded video having missed sounds.Â This study focused and concerned to the video that have audio sounds since it will be used to develop the Malay corpus and to evaluate the performance of isolated spoken Malay speech recognition system. The quantitative information analysis, about the videos and text documents being selected is given in Table 3.4.

Table 3.4: Quantitative information of Hansard documents selected.

No.	Video & Text Documents	No. of Topic	No. of Speakers	Total Words
1.	DR28052008 (MEI)	11	129	40,283
2.	DR29052008 (MEI)	15	114	39,612
3.	DR24062008 (JUNE)	13	154	49,212
4.	DR25062008 (JUNE)	10	118	38,053
5.	DR30062008 (JUNE)	10	175	58,013
6.	DR02072008 (JULY)	14	187	67,906
7.	DR03072008 (JULY)	12	120	48,411
8.	DR07072008 (JULY)	16	210	72,890
9.	DR10072008 (JULY)	13	132	42,350
10.	DR28082008 (AUGUST)	10	123	40,780
11.	DR03112008 (NOVEMBER)	17	232	78,750
12.	DR04112008 (NOVEMBER)	11	136	43,440
13.	DR10112008 (NOVEMBER)	10	105	39,560
14.	DR20112008 (NOVEMBER)	16	109	42,795
15.	DR26112008 (NOVEMBER)	10	186	38,880
16.	DR27112008 (NOVEMBER)	10	147	41,450
17.	DR01122008 (DECEMBER)	7	118	38,430
18.	DR02122008 (DECEMBER)	9	176	56,815
19.	DR03122008 (DECEMBER)	12	152	48,616
20.	DR04122008 (DECEMBER)	11	192	56,780
21.	DR10122008 (DECEMBER)	6	130	38,677
22.	DR11122008 (DECEMBER)	10	143	52,369
	TOTAL

The process of documents analysis shows that the majority of the Malay words are comprised of primary word with two syllables and mono (one) syllables. Among the Malay words, the syllables structure of VC, CV and CVC are the most common.Â These structures are preferred because they are easy to be pronounced exactly as it’s written and their number is quite substantial in the hansard documents. In order to get a good distribution of consonants and vowels for the dataset from the hansard documents, the most frequently primary (root or base) words spoken by speakers during Parliamentary debates are used. As mentioned previously, most of the root words are the primary words that are either in nouns or verbs without adding any derivations (affixes and suffixes) or reduplication to the root words. Thus, from the text documents analysis, we determined 100 primaries words that mostly spoken by the committee members during the debates that consist of 10 primary words of one syllable, four primary words from three or more syllables structures and 86 primary words that form two syllables structures as depicted in Table 3.5. The details quantitative analysis of each words distribution is represented in Appendix A. Each primary word has maximum number of 50 repetitions that uttered by same or different speakers. Thus, there are a total of 5000 isolated spoken Malay words used for this research. The challenging task is to capturing and segmenting the exact words being uttered accordingly to the audio sounds in the video files. The process of creating isolated spoken Malay corpus is illustrated as in Figure 3.4 and briefly explained in the following sections.

Table 3.2: Selection of 100 isolated spoken Malay words as the speech target sounds.

No.	Words	Structures	No.	Words	Structures
1	ADA	V + CV	51	LAGI	CV + CV
2	AHLI	VC + CV	52	LAIN	CV + VC
3	AKAN	V + CVC	53	LAMA	CV + CV
4	AKTA	VC + CV	54	LANGKAH	CVCC + CVC
5	ARAH	V + CVC	55	LEBIH	CV + CVC
6	ATAS	V + CVC	56	MAKLUM	CVC + CVC
7	ATAU	V + CVV	57	MANA	CV + CV
8	BAGI	CV + CV	58	MASA	CV + CV
9	BAIK	CV + VC	59	MASIH	CV + CVC
10	BAKAL	CV + CVC	60	MESTI	CVC + CV
11	BANK	CVCC	61	MUNGKIN	CVCC + CVC
12	BARU	CV + CV	62	NANTI	CVC + CV
13	BEKAS	CV + CVC	63	OLEH	V + CVC
14	BERI	CV + CV	64	ORANG	V + CVCC
15	BINCANG	CVC + CVCC	65	PADA	CV + CV
16	BOLEH	CV + CVC	66	PIHAK	CV + CVC
17	BUAT	CV + VC	67	PRINSIP	CCVC + CVC
18	BUKAN	CV + CVC	68	PULA	CV + CV
19	DALAM	CV + CVC	69	PUN	CVC
20	DAN	CVC	70	RAMAI	CV + CVV
21	DASAR	CV + CVC	71	RIBU	CV + CV
22	DATANG	CV + CVCC	72	RUJUK	CV + CVC
23	DENGAN	CV + CCVC	73	SAH	CVC
24	DIA	CVV	74	SAMA	CV + CV
25	EKONOMI	V + CV + CV + CV	75	SANGAT	CV + CCVC
26	ESOK	V + CVC	76	SAYA	CV + CV
27	HADIR	CV + CVC	77	SEBAB	CV + CVC
28	HAK	CVC	78	SEBUT	CV + CVC
29	HAL	CVC	79	SEDANG	CV + CVCC
30	HARI	CV + CV	80	SEDIA	CV + CVV
31	HENDAK	CVC + CVC	81	SUDAH	CV + CVC
32	IAITU	VV + V + CV	82	SUSAH	CV + CVC
33	IALAH	VV + CVC	83	TADI	CV + CV
34	INGAT	VC + CVC	84	TAHU	CV + CV
35	INGIN	VC + CVC	85	TAHUN	CV + CVC
36	INI	V + CV	86	TIDAK	CV + CVC
37	ISU	V + CV	87	TANYA	CV + CCV
38	ITU	V + CV	88	TELAH	CV + CVC
39	IZIN	V + CVC	89	TENTANG	CVC + CVCC
40	JADI	CV + CV	90	TERIMA	CV + CV + CV
41	JANGAN	CV + CCVC	91	TIDAK	CV + CVC
42	JAWAB	CV + CVC	92	TIPU	CV + CV
43	JUGA	CV + CV	93	TUAN	CV + VC
44	JUTA	CV + CV	94	TUGAS	CV + CVC
45	KABINET	CV + CV + CVC	95	TULIS	CV + CVC
46	KASIH	CV + CVC	96	UNTUK	VC + CVC
47	KAUM	CV + VC	97	WAKIL	CV + CVC
48	KES	CVC	98	WAKTU	CVC + CV
49	KIRA	CV + CV	99	WANG	CVCC
50	KITA	CV + CV	100	YANG	CVCC

3.3.1Corpus Preparation

The Malay corpus creation from Hansard documentsis designed to collect realistic audio data (from the video files) that best represents the actual noise environment in which the autonomous to the parliamentary debates session. All the videos recorded files have the standard of CD audio sampling frequency rate which is chosen to be 44.1 kHz in stereo channel.Â However, the human ear is most sensitive to a frequency spectrum ranging from 500 Hz to 4000 Hz, which roughly corresponds to the speech bandwidth carried along analog telephone lines (Marsh 1999).Â Thus, in order to capture the voiced and unvoiced sounds of the recorded signals, the audio waveforms are re-sampled at 16 kHz sampling rate and quantized to 16 bits per sample that is needed to be sufficient for this study. The process first begins with manually segments each video into their respective topics debates session of the day accordingly to the text documents which have been selected as perfect matched in the earlier process.Â Each video corresponds to the different topics and there are 10 to 15 topics (usul dan pertanyaan) to be debates by the parliamentary committees at the Dewan Rakyat(DR).

In this study, we only concerned to capture the audio signals, thus the process of extracting the video into the waveform signals is needed. This process involved existing shareware software namely Audio Extracted (version 2.3) that has been used as a tool to extracts and convert each video topics into each audio signals. Thus, there are 253 topics in the form of wave files (.wave) that corresponds to the text documents (as depicted in Table 3.4) to be re-sampled at 16 kHz sampling rate and quantized to 16 bits per sample. Finally, the process of re-sampled the waveform signals is done by using a sound editor, Cool Edit Pro, version 2.0.Â According to Nyquist theorem (Salam et al. 2000), the original sound can be better replicated when the sampling rate is at least twice in frequency of the original sound.Â By using Cool Edit Pro, all the input signals were manually segmented to obtain isolated spoken Malay corpus that will be used for training and testing purposes.

3.3.2Acquisition of Malay Speech Dataset

The acquisition of Malay speech dataset consists of training dataset and testing dataset.Â Commonly, there is no general solution to divide the dataset into training and testing dataset (Zhang et al. 1998). However, generally the number of data needed for training is more than for testing. Therefore, in the preliminary study, it is found that neural network classifier with Multi-layer Perceptron (MLP) needs more training data to have good learning and achieve optimal performance. Hence, guidance in selecting the number of training and testing data is done according to the previous research (Salam et al. 2000; Kim & Kim 2001).

It is not the purpose of this research to develop a full scale speech recognizer with large dataset, but to test new techniques by developing a prototype.Â Considering this goal, all the approaches were tested on word recognition with medium dataset. All the experiments reported for word recognition accuracy, used 60% of 5000 (3000 words) isolated spoken words for training and the remaining of 40% are used for testing (2000 words) purposes.

3.4Summary

This chapter presents the creation of isolated spoken Malay database or corpus.Â The fundamental of Malay language and their rules is also presented as guidance for creating the corpus.Â The process of creating Malay corpus is a challenging task since no spoken or naturally speech corpus in Malay language is available.Â The process of capturing and segmenting the audio signals is very time consuming, however the manual process could be as foundation or starting point for other researchers to used this database as a reference pattern in speech recognition fields.Â Finally, the Malay database will be used at the next level of speech processing stage before proceed further to recognition phase.

Order Now

Malay Speech Corpus

CHAPTER 3