Introduction

The compiler’s scanner unit is the lexical analyzer or the first step of Lexical Analysis. It converts the input program’s high-level grammar into a token sequence in operation called tokenization. Thus as in the diagram shown below, the output of the compiler is a token-sequence sent to the parser for syntax correction and analysis in a process known as lexical analysis in compiler design implemented with finite deterministic automata. 

  1. What is lexical analysis?
  2. What is a token?
  3. Example of tokens
  4. Example of Non-Tokens

1) What is lexical analysis?

Lexical analysis groups sounds, symbols, or token into a string of or sequence of tokens/units, forming a syntax that is meaningful and known as a parsing operation in linguistics. However, computer science calls this process parsing and/or tokenizing, depending on its use. The differences between the parser and output of the lexical analyzer are that the parser is used for the analysis of the syntax while the lexical analyzer scans the program’s input.

The lexer issues tokens, whereas the parser provides a code’s abstract representation. The parser updates the symbol table entries, whereas the lexical analysis process inserts tokens into the symbol table. A parse tree is produced as the output of the parser, whereas the lexical analyzer provides lexical errors or issues in the lexical analysis as its output.

 2) What is a token?

The lexical token is the output of the lexical analyzer and is a string having a unit of the grammar used in the programming language represented as a sequence of characters. The role of the lexical analyzer and the process adopted by the Lexical Analyzer is that it first tokenizes the program dividing it into valid tokens and removing all characters of white space. It also removes comments and helps in providing the lexical errors based on the column and row numbers of the input data. Take, for example, the below diagram. This lexical analyzer example identifies lexical errors using the grammar of C or C and the automation machine providing the column and row number of the errors.

 If we pass a lexical analysis example statement like a= c b through a lexical analyzer, the output it generates is a token sequence that appears as id=id id; where each of the LHS and RHS terms represent the details of a variable in its symbol table.

Now take the program below to explain the process of lexical analysis.

int main()

{

  // 2 variables

  int b, a;

  b = 20;

 return 0;

}

The valid tokens that are generated after recognition of tokens are ‘main’, ‘int’,'{‘,'(‘  ‘)’,’int’, ‘b’, ‘a’, ‘,’, ‘;’,’b’,’=’,’20’,’;’, ‘return’, ‘0’,’;’ and’}’. 

Comments have been omitted. 

Look at another example illustrated below as the printf statement. –

Here the printf statement has five valid tokens. The standard compiler used is the JavaCC compiler, which forms the lexical analysis tools of the lexer scanner and parser all-in-one and can be used for both parsing and lexical analyzer operations. The use of lexical analysis has certain advantages like simple design, separation of the syntax and lexical analysis parts allowing one to simplify either of these operations and enhancing the portability and efficiency of the compiler.

3) Example of tokens

There are 3 types or specification of tokens examples, namely

Type tokens like number, real, id etc.

Punctuation token like void, IF, return, etc

Alphabetic tokens like keywords etc.

 These groups can be understood by the lexical analysis examples below.

Keywords like while, for, if etc.

Identifiers like function name, variable name,etc.

Operators like ‘  ‘, ‘ ‘, ‘-‘ etc.

Separators like ‘;’ ‘,’’.’etc.

Here’s an example of working out. Count number of tokens in int max(int i)

  • The Lexical analyzer reads int and accepts and outputs a valid token.
  • It then reads max, and since it accepts this as a valid function name issues a valid token.
  • This process continues until it finally issues seven tokens.
  • The tokens are int, (, max, int, ), i and; not in that sequence of course!

4) Example of Non-Tokens

Having commented on tokenization and the various tokens, one will also need to define the non-tokens in the design of the lexical analyzer. These are part of the input but not issued valid tokens. For Ex: spaces between words, comments, macros, preprocessor directive, blanks, newline, tabs, etc. A Lexeme or token is used to refer to the sequence of characters matched to a form using a pattern aka token. The corresponding sequence of characters in the input that forms a single token is a lexeme. A Lexical error is the sequence of input characters not matching the token patterns in lexical analysis. Lexical phase errors are identified in the lexical analysis program’s execution.

Conclusion

The above article has explained lexical analysis, the role of lexical analysis, the process of tokenization, examples, and non-examples of tokens and the differences between syntax and lexical analysis. Lexical analysis is used with the source code 1 character at a time which it then converts into tokens or meaningful lexemes. In syntax analysis, the tokens generated are output as the parse tree.

If you are interested in making it big in the world of data and evolve as a Future Leader, you may consider our Integrated Program in Business Analytics, a 10-month online program, in collaboration with IIM Indore!

Also Read

SHARE