A Flex Fast Lexical Analyzer Generator English Language Essay
Vern Paxson took over the Software Tools lex project from Jef Poskanzer in 1982. At that point it was written in Ratfor. Around 1987 or so, Paxson translated it into C, and flex was born.
I. COMPILER
A compiler is a program that translates human readable source code into computer executable machine code. To do this successfully the human readable code must comply with the syntax rules of whichever programming language it is written in. The compiler is only a program and cannot fix your programs for you. If you make a mistake, you have to correct the syntax or it won’t compile.
The name “compiler” is primarily used for programs that translate source code from a high-level programming language to a lower level language (e.g., assembly language or machine code). If the compiled program can only run on a computer whose CPU or operating system is different from the one on which the compiler runs the compiler is known as a cross-compiler. A program that translates from a low level language to a higher level one is a decompiler. A program that translates between high-level languages is usually called a language translator, source to source translator, or language converter. A language rewriter is usually a program that translates the form of expressions without a change of language.
When the code is compiling following thing happens:-
II. Lexical Analysis
This is the first process where the compiler reads a stream of characters (usually from a source code file) and generates a stream of lexical tokens. For example, the C++ code
int C= (A*B)+10;
might be analysed as these tokens:
type “int”
variable “C”
equals
left bracket
variable “A”
times
variable “B”
right bracket
plus
literal “10”
Lexical analysis or scanning is the process where the stream of characters making up the source program is read from left-to-right and grouped into tokens. Tokens are sequences of characters with a collective meaning. There are usually only a small number of tokens for a programming language: constants (integer, double, char, string, etc.), operators (arithmetic, relational, logical), punctuation, and reserved words.
The lexical analyzer takes a source program as input, and produces a stream of tokens as output. The lexical analyzer might recognize particular instances of tokens such as:
3 or 255 for an integer constant token
“Fred” or “Wilma” for a string constant token
numTickets or queue for a variable token
Such specific instances are called lexemes. A lexeme is the actual character sequence forming a token; the token is the general class that a lexeme belongs to. Some tokens have exactly one lexeme (e.g., the > character); for others, there are many lexemes (e.g., integer constants).
fig 1. Lexical analyser
III. SYNTACTICAL ANALYSIS
This output from Lexical Analyzer goes to the Syntactical Analyzer part of the compiler. This uses the rules of grammar to decide whether the input is valid or not. Unless variables A and B had been previously declared and were in scope, the compiler might say
‘A’: undeclared identifier. Had they been declared but not initialized. The compiler would issue a warning.
Local variable ‘A’ used without been initialized. You should never ignore compiler warnings. They can break your code in weird and unexpected ways.
Always fix compiler warnings.
The purpose of syntactic analysis is to determine the structure of the input text. This structure consists of a hierarchy of phrases, the smallest of which are the basic symbols and the largest of which is the sentences. It can be described by a tree with one node for each phrase. Basic symbols are represented by leaf nodes and other phrases by interior nodes. The root of the tree represents the sentence.
This paper explains how use a `.con’ specification to describe the set of all possible phrases that could appear in sentences of a language. It also discusses methods of resolving ambiguity in such descriptions, and how to carry out arbitrary actions during the recognition process itself. The use of `.perr’ specifications to improve the error recovery of the generated parser is described as well.
Computations based on the input can be written with attribute grammar specifications that are based on an abstract syntax. The abstract syntax describes the structure of an abstract syntax tree, much the way the concrete syntax describes the phrase structure of the input. Eli uses a tool, called Maptool that automatically generates the abstract syntax tree based on an analysis of the concrete and abstract syntaxes and user specifications given in files of type `.map’. This manual will describe the rules used by Maptool to determine a unique correspondence between the concrete and abstract syntax and the information users can provide in `.map’ files to assist in the process.
IV. Generating Machine Code
Assuming that the compiler has successfully completed these stages:
Lexical Analysis.
Syntactical Analysis.
The final stage is generating machine code. This can be an extremely complicated process, especially with modern CPUs. The speed of the compiled executable should be as fast as possible and can vary enormously according to
The quality of the generated code.
How much optimization has been requested?
Most compilers let you specify the amount of optimization. Typically none for debugging (quicker compiles!) and full optimization for the released code.
v. Flex and lexical analysis
From the area of compilers, we get a host of tools to convert text files into programs. The first part of that process is often called lexical analysis, particularly for such languages as C. A good tool for creating lexical analyzers is flex. It takes a specification file and creates an analyzer, usually called lex.yy.c.
Flex: The Fast Lexical Analyzer
It is a tool for generating programs that perform pattern-matching on text. Flex is a free (but non-GNU) implementation of the original UNIX lex program.
Flex is a tool for generating scanners. A scanner, sometimes called a tokenizer, is a program which recognizes lexical patterns in text. The flex program reads user-specified input files, or its standard input if no file names are given, for a description of a scanner to generate. The description is in the form of pairs of regular expressions and C code, called rules. Flex generates a C source file named, “lex.yy.c”, which defines the function yylex (). The file “lex.yy.c” can be compiled and linked to produce an executable. When the executable is run, it analyzes its input for occurrences of text matching the regular expressions for each rule. Whenever it finds a match, it executes the corresponding C code.
Flex fast lexical analyzer generator:
Is linked with its library (libfl.a) using -lfl as a compile-time option.
Can be called as yylex ().
It is easy to interface with bison/yacc.
l file Æ’Â ! lex ! Æ’Â lex.yy.c
lex.yy.c and other files Æ’Â ! gcc!Æ’Â lexical analyzer
input streamÆ’Â ! lexical analyzer !Æ’Â actions taken when rules applied
A. Simple Example
First some simple examples to get the flavour of how one uses flex. The following flex input specifies a scanner which whenever it encounters the string “username” will replace it with the user’s login name:
%%
username printf( “%s”, getlogin() );
By default, any text not matched by a flex scanner is copied to the output, so the net effect of this scanner is to copy its input file to its output with each occurrence of “username” expanded. In this input, there is just one rule. “username” is the pattern and the “printf” is the action. The “%%” marks the beginning of the rules.
B. Flex Regular Expression
s string s literally
c character c literally, where c would normally be a lex operator
[s] character class
^ indicates beginning of line
[^s] characters not in character class
[s-t] range of characters
s? s occurs zero or one time
. any character except newline
s* zero or more occurrences of s
s+ one or more occurrences of s
r|s r or s
(s) grouping
$ end of line
s/r s iff followed by r (not recommended) (r is *NOT* consumed)
s{m,n} m through n occurrences of s
C. Flex input file format
• The flex input file consists of three sections, separated by a line with just `%%’ in it:
definitions
%%
rules
%%
user code
• Simple Example
%%
username printf( “%s”, getlogin() );
VI. WORKING OF FLEX
Flex is a program generator that produces source code for recognizing regular expressions when given pattern specifications for input. The specifications allow an action to be associated with each input pattern. A Flex-produced DFA (deterministic finite automaton) performs the recognition of regular expressions. Flex is able to deal effectively with ambiguous expressions by always choosing the longest matching string in the input stream.
Lex transforms the user’s input table of regular expressions and actions into a function called yylex(). The yylex() function, when incorporated into your source host-language program, performs each action as the associated pattern is recognized. Flex is capable of producing its output as C, C++, or FORTRAN source code. In either case, the yylex() function incorporates the highly efficient string matching routines of Aho and Corasick (Communications of the ACM, No. 18, 1975).
The yylex() function produced by Lex will generally require time proportional to the length of the input stream. This function is linear with respect to the input and independent of the number of rules. As the number and complexity of rules increases, yylex() will tend to increase in size only. Speed will have to decrease when the input rules require extensive forward scanning of input.
fig 2. Working of flex
A. Flex Actions
Actions are C source fragments. If it is compound, or takes more than one line, enclose with braces (‘{‘ ‘}’).
Example rules:
[a-z]+ printf (“found wordn”);
[A-Z][a-z]* {
printf (“found capitalized word:n”);
printf (” ‘%s’n”,yytext);
}
There are a number of special directives which can be included within an action:
ECHO
Copies yytext to the scanner’s output.Â
BEGIN
Followed by the name of a start condition places the scanner in the corresponding start condition (see below).Â
REJECT
Directs the scanner to proceed on to the “second best” rule which matched the input (or a prefix of the input). The rule is chosen as described above in Matching, and yytext and set up appropriately. It may either be one which matched as much text as the originally chosen rule but came later in the flex input file, or one which matched less text. For example, the following will both count the words in the input and call the routine special() whenever frob is seen:
int word_count = 0;
%%
frob special(); REJECT;
[^ tn]+ ++word_count;
B. Flex Definitions
The form is simply:
name definition
The name is just a word beginning with a letter (or an underscore, but I don’t recommend those for general use) followed by zero or more letters, underscore, or dash.
The definition actually goes from the first non-whitespace character to the end of line. You can refer to it via {name}, which will expand to (definition). (cite: this is largely from “man flex”.)
Tattoueba:
DIGIT [0-9]
Now if you have a rule that looks like
{DIGIT}*.{DIGIT}+
that is the same as writing
([0-9])*.([0-9])+
VII. GENERATED SCANNER
The output of flex is the file lex.yy.c, which contains the scanning routine yylex(), a number of tables used by it for matching tokens, and a number of auxiliary routines and macros. By default, yylex() is declared as follows:
int yylex()
{
… various definitions and the actions in here …
}
(If your environment supports function prototypes, then it will be “int yylex( void )”.) This definition may be changed by defining the “YY_DECL” macro. For example, you could use:
#define YY_DECL float lexscan (a, b) float a, b;
to give the scanning routine the name lexscan, returning a float, and taking two floats as arguments. Note that if you give arguments to the scanning routine using a K&R-style/non-prototyped function declaration, you must terminate the definition with a semi-colon (;).
Whenever yylex() is called, it scans tokens from the global input file yyin (which defaults to stdin). It continues until it either reaches an end-of-file (at which point it returns the value 0) or one of its actions executes a return statement.
If the scanner reaches an end-of-file, subsequent calls are undefined unless either yyin is pointed at a new input file (in which case scanning continues from that file), or yyrestart() is called. yyrestart() takes one argument, a FILE * pointer (which can be nil, if you’ve set up YY_INPUT to scan from a source other than yyin), and initializes yyin for scanning from that file. Essentially there is no difference between just assigning yyin to a new input file or using yyrestart() to do so; the latter is available for compatibility with previous versions of flex, and because it can be used to switch input files in the middle of scanning. It can also be used to throw away the current input buffer, by calling it with an argument of yyin; but better is to use YY_FLUSH_BUFFER (see above). Note that yyrestart() does not reset the start condition to INITIAL.
If yylex() stops scanning due to executing a return statement in one of the actions, the scanner may then be called again and it will resume scanning where it left off.
By default (and for purposes of efficiency), the scanner uses block-reads rather than simple getc() calls to read characters from yyin. The nature of how it gets its input can be controlled by defining the YY_INPUT macro. YY_INPUT’s calling sequence is “YY_INPUT(buf,result,max_size)”. Its action is to place up to max_size characters in the character array buf and return in the integer variable result either the number of characters read or the constant YY_NULL (0 on Unix systems) to indicate EOF. The default YY_INPUT reads from the global file-pointer “yyin”.
A sample definition of YY_INPUT (in the definitions section of the input file):
%{
#define YY_INPUT (buf, result,max_size)
{
int c = getchar();
result = (c == EOF) ? YY_NULL : (buf[0] = c, 1);
}
%}
This definition will change the input processing to occur one character at a time.
When the scanner receives an end-of-file indication from YY_INPUT, it then checks the yywrap() function. If yywrap() returns false (zero), then it is assumed that the function has gone ahead and set up yyin to point to another input file, and scanning continues. If it returns true (non-zero), then the scanner terminates, returning 0 to its caller. Note that in either case, the start condition remains unchanged; it does not revert to INITIAL.
If you do not supply your own version of yywrap (), then you must either use %option noyywrap (in which case the scanner behaves as though yywrap () returned 1), or you must link with -ll to obtain the default version of the routine, which always returns 1.
Three routines are available for scanning from in-memory buffers rather than files: yy_scan_string(), yy_scan_bytes(), and yy_scan_buffer().
The scanner writes its ECHO output to the yyout global (default, stdout), which may be redefined by the user simply by assigning it to some other FILE pointer.
VIII. OPTIONS
-b Æ’Â Generate backing-up information to lex.backup. This is a list of scanner states which require backing up and the input characters on which they do so. By adding rules one can remove backing-up states. If all backing-up states are eliminated and -Cf or -CF is used, the generated scanner will run faster (see the -p flag). Only users who wish to squeeze every last cycle out of their scanners need worry about this option. (See the section on Performance Considerations below.)
-c Æ’Â is a do-nothing, deprecated option included for POSIX compliance.
-f Æ’Â specifies fast scanner. No table compression is done and stdio is bypassed. The result is large but fast. This option is equivalent to -Cfr.
-h Æ’Â generates a “help” summary of flex’s options to stdout and then exits. -? and –help are synonyms for -h.
-n Æ’Â is another do-nothing, deprecated option included only for POSIX compliance.
-b Æ’Â instructs flex to generate a batch scanner, the opposite of interactive scanners generated by -I (see below). In general, you use -B when you are certain that your scanner will never be used interactively, and you want to squeeze a little more performance out of it. If your goal is instead to squeeze out a lot more performance, you should be using the -Cf or -CF options (discussed below), which turn on -B automatically anyway.
IX. LEX
In computer science, lex is a program that generates lexical analyzers (“scanners” or “lexers”). Lex is commonly used with the yacc parser generator. Lex, originally written by Mike Lesk and Eric Schmidt, is the standard lexical analyzer generator on many Unix systems, and a tool exhibiting its behavior is specified as part of the POSIX standard.
Lex reads an input stream specifying the lexical analyzer and outputs source code implementing the lexer in the C programming language.
Though traditionally proprietary software, versions of Lex based on the original AT&T code are available as open source, as part of systems such as OpenSolaris and Plan 9 from Bell Labs. Another popular open source version of Lex is Flex, the “fast lexical analyzer”.
X. STRUCTURE OF LEX
The structure of a lex file is intentionally similar to that of a yacc file; files are divided up into three sections, separated by lines that contain only two percent signs, as follows:
Definition section
%%
Rules section
%%
C code section
The definition section is the place to define macros and to import header files written in C. It is also possible to write any C code here, which will be copied verbatim into the generated source file.
The rules section is the most important section; it associates patterns with C statements. Patterns are simply regular expressions. When the lexer sees some text in the input matching a given pattern, it executes the associated C code. This is the basis of how lex operates.
The C code section contains C statements and functions that are copied verbatim to the generated source file. These statements presumably contain code called by the rules in the rules section. In large programs it is more convenient to place this code in a separate file and link it in at compile time.
XI. INTRODUCTION TO TOKENS AND LEXEMS
Suppose you’re not only reading data files but reading (and perhaps interpreting) a scripting language input file, such as Perl or VB source code. Lexical analysis is the lowest level translation activity. The purpose of a lexical analyzer or scanner is to convert an incoming stream of characters into an outgoing stream of tokens. The scanner operates by matching patterns of characters into lexemes. Each pattern describes what an instance of a particular token must match. For example, a common pattern for an identifier (for example, user-specified variable or constant) in a script language is a letter followed by one or more occurrences of a letter or digit. Some lexemes that would match this pattern are index, sum, and i47.
Things that your input stream defines as useless, such as white space and comments, are not lexemes and can be safely discarded by the scanner. Several classes of tokens are found in the definitions of most script languages.
TABLE 1
Typical Tokens
Keywords
Reserved words (such as procedure and return) that cannot be redefined
Operators
Typically strings (1-3 characters) such as /, >=, and >>= used in expressions
=Identifiers
User-specified objects similar to keywords in form
Numbers
Integer, real, or double-precision as specified
Character constants
Single characters such as c or
Character strings
Zero or more characters stored differently than character constants
EOLN and EOF
Logical end-of-line and end-of-input markers
XII. Lexical analysis terms
A token is a group of characters having collective meaning.
A lexeme is an actual character sequence forming specific instance of a token, such as num.
A pattern is a rule expressed as a regular expression and describing how a particular token can be formed.
For example, [A-Za-z][A-Za-z_0-9]* is a rule.
Characters between tokens are called whitespace; these include spaces, tabs, newlines, and form feeds. Many people also count comments as whitespace, though since some tools such as lint/splint look at comments, this conflation is not perfect.
Attributes for tokens
Tokens can have attributes that can be passed back to the calling function.
Constants could have the value of the constant, for instance.
Identifiers might have a pointer to a location where information is kept about the identifier.
XIII. CONCLUSION
Flex generates C99 function definitions by default. However flex does have the ability to generate obsolete, er, traditional, function definitions. This is to support bootstrapping gcc on old systems. Unfortunately, traditional definitions prevent us from using any standard data types smaller than int (such as short, char, or bool) as function arguments. For this reason, future versions of flex may generate standard C99 code only, leaving K&R-style functions to the historians.
Order Now