Anuj Mehta's Blog: Generic CLI API for System/Network Management Systems

Proof-Of-Concept: Parsing of command output using ANTLR

As a Proof-Of-Concept Gen-CLI API uses libraries of a freely available parser generator called ANTLR for the code generation based on the rules. ANTLR stands for ANother Tool for Language Recognition. It is a sophisticated parser generator which can be used to implement language interpreters, compilers and other translators. Gen-CLI API takes the templates of command syntax and command response as input. These templates (rules) are expressed in the EBNF (Extended Backus-Naur Form) notation.

Below is output of route command in Linux.

root@Server:/home/user# route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
172.31.0.0 * 255.255.0.0 U 0 0 0 eth0
link-local * 255.255.0.0 U 1000 0 0 eth0
default 172.31.44.1 0.0.0.0 UG 0 0 0 eth0

Grammar for parsing output
Below is the grammar for parsing command response of ‘route’ command.

grammar routeResponse;
options {
language=Java;
}

start : 'Kernel IP routing table' theTitles+=TITLE+ rowValue*;

rowValue: rowData+=ADDR+ rowData+=FLAG rowData+=INT+ rowData+=IFACE;

ADDR : 'link-local'
'default'
'localhost'
IP_ADDR
;

TITLE : 'Destination''Gateway''Genmask''Flags''Metric''Ref''Use''Iface';

fragment
IP_ADDR : INT '.' INT '.' INT '.' INT
'*'
;

FLAG : ('U' //route is up
'H' //target is a host
'G' //use gateway
'R' //reinstate route for dynamic routing
'D' //dynamically installed by daemon or redirect
'M' //modified from routing daemon or redirect
'A' //installed by addrconf
'C' //cache entry
'!')* //reject route
;

INT : '0'..'9'+;

IFACE : ('0'..'9''a'..'z')+;

WS : (' ''\r''\t''\n')+ {skip();};

Here routeResponse is the grammar name denoted by ‘grammar routeResponse’. In the grammar start and rowValue are two Parser Rules. The parser rules specify the grammatical structure. ADDR, TITLE, IP_ADDR, FLAG, INT, IFACE and WS are Lexical Rules. Lexical rules specify the tokens.
In this grammar the parser rule start is the entry point of execution. This rule parses and stores all the titles, then invokes a parser rule rowValue which will parse and store a row. These parser rules will use lexical rules for getting tokens.

In this example of parsing command response we ignore all the white spaces, tabs and newlines. Hence there is a lexical rule for this

WS : (' ''\r''\t''\n')+ {skip();};

Here WS (White Space is the rule name. It looks for white space, return, tab or a newline character and issue an action by calling skip() method asking ANTLR to throw out the matching character. These characters are surrounded by parenthesis and a plus sign i.e. ( ) + which denotes to match for one or more consecutive occurrence of these characters.

As mentioned earlier start is the entry point of execution. It is defined as follows

start : 'Kernel IP routing table' theTitles+=TITLE+ rowValue*;

This rule can pictorially depicted as

Here the parser first looks for the heading ‘Kernel IP routing table’. On finding the heading it looks for a list of one or more titles and stores them in theTitles. Corresponding to this rule internally a parser code will be generated and it will create an ArrayList (for java) and stores all the matched titles in it.

In the statement
theTitles+=TITLE+
theTitles is a list for storing titles and TITLE is a lexical rule. “TITLE+” means to match for one or more tokens (rule for matching is defined in TITLE) and “theTitles+=” means to store all the matched titles in the list.

The definition of the lexical rule TITLE

TITLE : 'Destination''Gateway''Genmask''Flags''Metric''Ref''Use''Iface';

Here TITLE has defined a number of headings separated by OR i.e. ‘’ to signify match one of these headings. Control goes back once match is successful.

After matching of titles the parser rule start expects zero or more rows hence it is defined as ‘rowValue*’ where * denotes zero or more occurrences. Here rowValue is a parser rule which is defined as follows

rowValue: rowData+=ADDR+ rowData+=FLAG rowData+=INT+ rowData+=IFACE;

Hence each row (after the titles) consists of following

• One or more ADDR.
• One Flag.
• One or more integers.
• One interface.

Here ADDR is a lexical rule which is defined as

ADDR : 'link-local'
'default'
'localhost'
IP_ADDR
;

ADDR uses a “fragment” lexical rule IP_ADDR. IP_ADDR is called as fragment rule as it is never called by any parser rule. It is used by another lexical rule. Hence definition of IP_ADDR is

fragment
IP_ADDR : INT '.' INT '.' INT '.' INT
| '*'
;

Thus IP_ADDR can be either of following
• INT.INT.INT.INT i.e. 192.168.10.55
• *

INT is a lexical rule used in both IP_ADDR and parser rule rowValue hence it is not a fragment rule. It is defined as

INT : '0'..'9'+;

Here INT is defined as one or more digits between 0 and 9.

Now coming back to the explanation of ADDR. If we see the output of route command then Destination can have its value as an IP_ADDR or ‘link-local’, ‘default’ or ‘localhost’. Gateway can have its value as IP_ADDR (which is subset of rule ADDR) and Genmask can have its value as IP_ADDR. Hence for parsing values of Destination, Gateway and Genmask for each row the parser rule rowValue is defined as “rowData+=ADDR+”. Here parser expects one or more ADDR and store the scanned tokens in list “rowData”.
Now next value in row in output of “route” command is Flags. In the parser rule rowValue it is defined as “rowData+=FLAG” which means it expects one FLAG.

FLAG : ('U' //route is up
| 'H' //target is a host
| 'G' //use gateway
| 'R' //reinstate route for dynamic routing
| 'D' //dynamically installed by daemon or redirect
| 'M' //modified from routing daemon or redirect
| 'A' //installed by addrconf
| 'C' //cache entry
| '!')+ //reject route
;

Here FLAG will be combination of one or more flags defined above.
After Flags in the output of “route” command the values of Metric, Ref and Use all are of integer type hence in rowValue it is defined as
“rowData+=INT+”
i.e. the parser expects one or more integer values. These scanned tokens are stored in list rowData.

Finally in the end it is interface value i.e. Iface which is of alpha-numeric type. Hence IFACE is defined as

IFACE : ('0'..'9'|'a'..'z')+;

Dynamic parsing code corresponding to grammarGrammar described above is given as input to Gen-CLI API for defining the format of command response. Based on input grammar ANTLR generates a number of files.
The noteworthy are the lexer and parser files. These two files can be generated in a number of languages like C, C++, Java, C#, Objective C and Python. Choice of language can be specified in “options” block in the grammar.

options {
language=Java;
}

By default the code is generated in Java. Below is description of parser and lexer files generated dynamically based on the grammar provided as input

1. routeResponseLexer: This code scans the stream of input characters (in this case it is output of ‘route’ command) and forms a list of tokens based on lexical rules defined in grammar above. It has a method nextToken() which is used by parser to get the next matching token in input data.

2. routeResponseParser: As described earlier lexical rules defines the grammatical structure. This file contains code corresponding to parser rules. It gets tokens from TokenStream (part of Lexer) and parses them to see if it matches with grammatical structure.

References
1. US Patent application number 20040117452: XML-based network management system and method for configuration management of heterogeneous network devices By Byung Joon Lee, Tae Sang Choi, Tae Soo Jeong

2. http://www.antlr.org/
3. The Definitive ANTLR Reference: Building Domain-Specific Languages by Terence Parr

Related Posts
Part 1
Part 2