Lexcial analysis is the process of converting a raw input stream--usually but not exclusively a stream of characters--into a stream of tokens for use by a parser. E.g. the string "net_worth_future = (assets - liabilities)*1.1" might be converted into the stream NAME EQUALS OPEN_PARENS NAME MINUS NAME CLOSE_PARENS TIMES NUMBER. The reason that this is done is that it makes writing a parser? much easier. Instead of being forced to deal with the fact that net_worth_future is a name, and assets is a name, and foo_barQUuX? is a name, the parser need only concern itself with syntactical matters. This leads to efficiency of programming, if not efficiency of execution. |
Lexical analysis is the process of converting a raw input stream--usually but not exclusively a stream of characters--into a stream of tokens for use by a parser. E.g. the string "net_worth_future = (assets - liabilities)*1.1" might be converted into the stream NAME EQUALS OPEN_PARENS NAME MINUS NAME CLOSE_PARENS TIMES NUMBER. The reason that this is done is that it makes writing a parser? much easier. Instead of being forced to deal with the fact that net_worth_future is a name, and assets is a name, and foo_barQUuX? is a name, the parser need only concern itself with syntactical matters. This leads to efficiency of programming, if not efficiency of execution. |
Lexers typically operate by using simple regular expression to recognise tokens in the input stream. For example, a NAME might be any alphabetical character or the underscore followed by any number of any alphanumeric character or the underscore. This could be represented compactly by the string [a-zA-Z_][a-zA-Z_0-9]+. This means "any character a-z, A-Z or _, then 0 or more of a-z, A-Z, _ or 0-9." Regular expressions are a simple way of representing complex problems. What they cannot do is represent structures such as "n opening parentheses, followed by a statement, followed by n closing parentheses." context free grammar provide the tools to recognise such things; this is, in fact, what a parser? does. |
Lexers typically operate by using simple regular expressions to recognise tokens in the input stream. For example, a NAME might be any alphabetical character or the underscore followed by any number of any alphanumeric character or the underscore. This could be represented compactly by the string [a-zA-Z_][a-zA-Z_0-9]+. This means "any character a-z, A-Z or _, then 0 or more of a-z, A-Z, _ or 0-9." Regular expressions are a simple way of representing complex problems. What they cannot do is handle recursively defined patterns, such as "n opening parentheses, followed by a statement, followed by n closing parentheses." It takes a full-fledged parser? (there exist tools which accept Context free grammars as input and emit parsers as output) to recognise such patterns. |