Formal Methods Of Describing Syntax Information Technology Essay

The program language can be categorized into high level language and low level language. The low level language is for example machine language which the language is in instructions to be carried out straight to numeric code and addresses in computer. The high level language is for example C and Java which the language is used to carried out in code to translate software or user requirements. Therefore, compiler is used to translate the high level language or source program into low level language. The translation involves two main elements in the language which are syntax and semantics.

The model of compilation has six tools or elements:

Lexical analyzer: to decompose source code into tokens.

Syntax analyzer: to merge the tokens into statement, then into program which later checking the syntax.

Semantic analyzer: to check the meaning of the source code.

Intermediate code generator: to create an internal form of a program for example; a=b+c ƒ (+,b,c,L1) (=,L1,a)

Code optimizer: to optimize object program for the purpose of performance and reliability

Code generator: to transformed intermediate representation into target language.

Assignment

Read this interesting paper and write your own opinion on the below topic:

Wallace, D. R. and Fujii, R. U.: Software verification and validation: an overview. Software, IEEE, 6(3), 10-17. (1989)

The Syntax

Any language consists of a set of strings of characters. The combination of the strings makes sentences or statements. The syntax plays the role of putting rules to the sentences.

The syntax of a programming language is the form of its expressions, statements and program units. For example, the syntax of a Java while statement is

while (<boolean expression>) <statement>

The lowest level of the statement is when converted it into small units is called lexemes. The lexemes have lexical description to describe the lexemes which is separated from the syntactic description.

The lexemes are categorized by a token.

For example, an identifier is a token that can have lexemes, or instances, such as sum and total. In some cases, a token has only a single possible lexem. For example, the token for the arithmetic operator symbol +, which may have the name plus_op, has just one possible lexeme. Consider the following Java statement:

index = 2 * count + 17;

The lexemes and tokens of this statement are:

Lexemes Tokens

index identifier

= equal_sign

2 int_literal

* mult_op

count identifier

+ plus_op

17 int_literal

; semicolon

Figure 1: Lexems and Tokens

Language recognizers

Language recognizer is a machine to accept language of the source code. The input is a string of characters where the machine filters the string whether the language is acceptable or not. For example, if the string is recognized by language Java, the language recognizer will accept it and put it in accept state. If not, it will reject the string.

Language Generators

Language generator is a device to construct the sentences of a language. It has a construction description. If the language generator enables to construct all strings is language Java (for example), then the compiler will precede the next process.

Quiz

What is the different between syntax and semantic?

Answer: Syntax refers to formal rules governing the construction of valid statements in a language. Semantics refers to the set of rules which give the meaning of a statement.

Formal Methods of Describing Syntax

Syntax of programming languages uses grammars to describe the formal language mechanisms using BNF of context free grammar.

Context-free Grammars

Context free grammar originally comes from formal language theory. In computer science, context free grammar involves with programming language syntax which is tokens.

Backus-Naur Form (BNF)

In computer science, BNF (Backus Normal Form or Backus-Naur Form) is a notation technique for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document formats, instruction sets and communication protocols (Wikipedia).

Basic of Metalanguage

A metalanguage is a language used to explain and elaborate another language. In programming languages, BNF is the metalanguage.

BNF uses abstractions for syntactic structures. A simple Java assignment statement, for example, might be represented by the abstraction <assign>. The actual definition of <assign> may be given by

LHS – is the abstraction being defined

This statement is a rule

RHS – the definition of LHS, consist of token, lexem, and reference

(Noted: LHS is left hand side, RHS is right hand side)

Figure 2 : metalanguage

This particular rule specifies that the abstraction <assign> is defined as an instance of the abstraction <var>, followed by the lexeme =, followed by an instance of the abstraction <expression>.

Nonterminal symbols are abstraction in BNF description.

Terminal symbols are lexemes and tokens of the rules in BNF.

Collections of rule are grammar in BNF.

Nonterminal symbols have definitions that represent two or more syntactic forms in the language. If there is multiple definitions, it can be written in separate by using symbol |, as a single rule. For example, in Java for if-else statement

<if_stmt> ƒ if <logic_expr> else <stmt>

<if_stmt> ƒ if <logic_expr> else if <stmt> else <stmt>

Or, by joining them together to become

<if_stmt> ƒ if <logic_expr> else <stmt>

| if <logic_expr> else if <stmt> else <stmt>

Lists

BNF uses recursion in rule to represent lists. The rule is recursive if statement in LHS appears in RHS. For example:

<ident_list> ƒ identifier

| identifier, <ident_list>

This defines <ident_list> as either a single token (identifier) or an identifier followed by a comma followed by another instances of <ident_list>.

Grammars and derivations

A grammar starts with a special nonterminal of the grammar which is called the start symbol. The next sentence elaborating the start sentence is called a derivation. Usually, the start symbol starts with <program>. For example,

A grammar for a Small Language

<program > ƒ begin <stmt_list> end

<stmt_list> ƒ <stmt>

| <stmt>; <stmt_list>

<var> ƒ A | B | C

| <var> – <var>

| <var>

Figure 3 : Small language’s grammar

The language starts with begin and ends with end. The statements in the language are separated by using semicolon. The language only has one type of statement which is assignment. The assignment is assigning expression to a variable, var. The expression is either + or -. There are only three variables in the language which are A, B, C.

If we derive the language, it may become:

<program> => begin <stmt_list> end

=> begin <stmt> ; <stmt_list> end

=> begin <var> = <expression> ; <stmt_list> end

=> begin A = <expression> ; <stmt_list> end

=> begin A = <var> + <var> ; <stmt_list> end

=> begin A = B + <var> ; <stmt_list> end

=> begin A = B + C ; <stmt_list> end

=> begin A = B + C ; <stmt> end

=> begin A = B + C ; <var> = <expression> end

=> begin A = B + C ; B = <expression> end

=> begin A = B + C ; B = <var> end

=> begin A = B + C ; B = C end

This derivation starts with the <program> followed by => which means or reads asderives. Each successive string in the sequence is derived from the previous string by replacing one of the nonterminals with one of that nonterminal’s definitions. Another example,

A Grammar for Simple Assignment Statements

<id> ƒ A | B | C

| <id> * <expr>

| (<expr>)

| <id>

Figure 4 : Assignment’s grammar

The grammar above only derives assignment statements. It has id and expr as expression. id is variable A, B, or C. Expression is only addition, multiplication and parentheses. For the statement

A = B * ( A + C )

It is generated by the leftmost derivation as:

=> A = <expr>

=> A = <id> * <expr>

=> A = B * <expr>

=> A = B * (<expr>)

=> A = B * (<id> + <expr>)

=> A = B * (A + <expr>)

=> A = B * (A + <id>)

=> A = B * (A + C)

Parse Tree

Parse tree is a representation of syntactical grammars in hierarchical structure form. Every internal mode of a parse tree is labeled with a nonterminal symbol; every leaf is labeled with a terminal symbol. Every sub-tree of a parse tree describes one instance of an abstraction in the sentence.

If we take above statement, A = B * ( A + C ), below is its parse tree:

<id>

=

<expr>

<id>

*

<expr>

(

<expr>

)

<id>

+

<expr>

<id>

Figure 5 : parse tree

Ambiguity

Ambiguity happens in parse tree if there are two or more distinct parse tree. Let say we have below statement:

A = B + C * A

and below grammar for assignment statement:

<id> ƒ A | B | C

| <expr> * <expr>

| (<expr>)

| <id>

Therefore, the grammar will produce two distinct parse trees which it cannot be determined by compiler.

Figure 6 : Anambiguous parse tree

Operator Precedence

The problem in ambiguous grammar can be solved by having separate rule for addition and multiplication. When they are separated, they are maintained in higher to lower ordering, respectively in the parse tree. Below is the new grammar:

<id> ƒ A | B | C

| <term>

| <factor>

<factor> ƒ (<expr>)

| <id>

If we use above grammar for A = B + C * A, we get below derivation which is left most derivation:

=> A = <expr>

=> A = <expr> + <term>

=> A = <term> + <term>

=> A = <factor> + <term>

=> A = <id> + <term>

=> A = B + <term>

=> A = B + <term> * <factor>

=> A = B + <factor> * <factor>

=> A = B + <id> * <factor>

=> A = B + C * <factor>

=> A = B + C * <id>

=> A = B + C * A

The parse tree using unambiguous grammar is below:

<id>

=

<expr>

+

<term>

*

<id>

Figure 8 : parse tree of operator precedence

Test

What are categories of program language? State.

Answer: Two. High level and low level language

State the six tools of compilation process.

Answer

Lexical analyzer: to decompose source code into tokens.

Syntax analyzer: to merge the tokens into statement, then into program which later checking the syntax.

Semantic analyzer: to check the meaning of the source code.

Intermediate code generator: to create an internal form of a program for example; a=b+c ƒ (+,b,c,L1) (=,L1,a)

Code optimizer: to optimize object program for the purpose of performance and reliability

Code generator: to transformed intermediate representation into target language.

Write BNF descriptions for a Java class definition header statement.

Answer (vary):

(class definition)

(field definition)

(method definition)

(body statement)

(return type)

Using figure 3, show a parse tree and a leftmost derivation for each following statements:

A = A * (B+(C*A))

B = C * (A*C+B)

Order Now