Wikipedia: Lexical Analysis

Showing revision 5
Difference (from revision 5 to revision 5) (minor diff, author diff)
(The revisions are identical or unavailable.)

Lexcial analysis is the process of converting a raw input stream--usually but not exclusively a stream of characters--into a stream of tokens for use by a parser. E.g. the string "net_worth_future = (assets - liabilities)*1.1" might be converted into the stream NAME EQUALS OPEN_PARENS NAME MINUS NAME CLOSE_PARENS TIMES NUMBER. The reason that this is done is that it makes writing a parser? much easier. Instead of being forced to deal with the fact that net_worth_future is a name, and assets is a name, and foo_barQUuX? is a name, the parser need only concern itself with syntactical matters. This leads to efficiency of programming, if not efficiency of execution.

Lexers typically operate by using simple regular expressions to recognise tokens in the input stream. For example, a NAME might be any alphabetical character or the underscore followed by any number of any alphanumeric character or the underscore. This could be represented compactly by the string [a-zA-Z_][a-zA-Z_0-9]+. This means "any character a-z, A-Z or _, then 0 or more of a-z, A-Z, _ or 0-9." Regular expressions are a simple way of representing complex problems. What they cannot do is handle recursively defined patterns, such as "n opening parentheses, followed by a statement, followed by n closing parentheses." It takes a full-fledged parser? (there exist tools which accept Context free grammars as input and emit parsers as output) to recognise such patterns.