Talk:Lexical analysis
From Wikipedia, the free encyclopedia
| This is the talk page for discussing improvements to the Lexical analysis article. This is not a forum for general discussion of the subject of the article. |
Article policies
|
| Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
| This article is rated C-class on Wikipedia's content assessment scale. It is of interest to the following WikiProjects: | ||||||||||||||||||
| ||||||||||||||||||
| The content of Tokenization (lexical analysis) was merged into Lexical analysis on 10 June 2017. The former page's history now serves to provide attribution for that content in the latter page, and it must not be deleted as long as the latter page exists. For the discussion at that location, see its talk page. |
Wiki Education Foundation-supported course assignment
This article was the subject of a Wiki Education Foundation-supported course assignment, between 24 August 2020 and 9 December 2020. Further details are available on the course page. Student editor(s): Oemo01.
Above undated message substituted from Template:Dashboard.wikiedu.org assignment by PrimeBOT (talk) 02:31, 17 January 2022 (UTC)
Examples
This article is not clear enough! It needs more examples!
Robert A.
The link to ocaml-ulex in 'Links' is broken.
Frank S.
Robert, this article is actually very poorly articulated. It seems that its authors do not have enough clarity of how to explain it, for it to be comprehensible. And the article should probably be marked that way - that is requires further clarity.Stevenmitchell (talk) 16:08, 25 January 2014 (UTC)
Types of tokens
Can someone explain to me what types of tokens there are? (Keywords, identifiers, literals? Or are there more) And do each of these types go into a symbol table of that type? i.e. an identifier table, keyword table and literal table? Or are they just stored into one uniform symbol table? —Dudboi 02:17, 6 November 2006 (UTC)
It really depends. For instance, for the example language (PL/0 - see the appropriate for example source code) in the article, here are the available token types:
multi-character operators: '+', '-', '*', '/', '=', '(', ')', ':=', '<', '<=', '<>', '>', '>=' language required punctuation: ',', ';', '.', literal numbers - specifically, integers identifiers: a-zA-Z {a-zA-Z0-9} keywords: "begin", "call", "const", "do", "end", "if", "odd", "procedure", "then", "var", "while"
Other languages might have more token types. For instance, in the C language, you would have string literals, character literals, floating point numbers, hexadecimal numbers, directives, and so on.
I've seen them all stored in a single table, and I've also seen them stored in multiple tables. I don't know if there is a standard, but from the books I've read, and the source code I've seen, multiple tables appears to be popular.
second task performed during lexical analysis is to make entry of tokens into a symbol table if it is there.Some other task performed during lexical analysis are: 1.to remove all comments,tab,blank spaces and machin characters. 2.to produce error massages occerrd in a source programs.
See the following links for simple approachable compiler sources:
http://en.wikipedia.org/wiki/PL/0 http://www.246.dk/pascals1.html http://www.246.dk/pascals5.html
See http://www.246.dk/pl0.html for more information on PL/0. 208.253.91.250 18:07, 13 November 2006 (UTC)
merger and clean up
I've ``merged`` token (parser) here. The page is a bit of a mess now though. The headings I've made should help sort that out. The examples should be improved so that they take up less space. The lex file should probably be moved to the flex page. --MarSch 15:37, 30 April 2007 (UTC)
- done the move of the example. --MarSch 15:39, 30 April 2007 (UTC)
Next?
Lexical Errors
There's a mention that scanning fails on an "invalid token". It doesn't seem particularly clear what constitutes a lexical error other than a string of garbage characters. Any ideas? --138.16.23.227 (talk) 04:54, 27 November 2007 (UTC)
If the lexer finds an invalid token, it will report an error.
- The comment needs some context. Generally speaking, a lexical analyzer may report an error. But that is usually (for instance in Lex programming tool) under control of the person who designs the rules for the analysis. The analyzer itelf may reject the rules because they're inconsistent. On the other hand, the parser is more likely to have built-in behavior -- yacc for instance fails on any mismatch, requires the developer to specify how to handle errors (updating the article to reflect these comments requires citing reliable sources of course - talk pages aren't a source) Tedickey (talk) 13:56, 27 November 2007 (UTC)
Lexer, Tokenizer, Scanner, Parser
There should be some more clearly explaned what exactly differs those modules, what are their particular jobs and how they're composed together (maybe some illustration?) —Preceding unsigned comment added by Sasq777 (talk • contribs) 16:06, 11 June 2008 (UTC)
I agree, the article does not make it clear which is what, what comes first etc. In fact, there seems to be a few contradictory statements. A block diagram would be ideal, but a clear explanation is urgently needed. Thanks. 122.29.91.176 (talk) 01:03, 21 June 2008 (UTC)
Something like page 4 of this document, which incidentally can be reproduced easily here (see Preface). 122.29.91.176 (talk) 08:50, 21 June 2008 (UTC)
In computer science and linguistics, parsing, or, more formally, syntactic analysis, is the process of analyzing a sequence of tokens to determine their grammatical structure with respect to a given (more or less) formal grammar.
Parsing is also an earlier term for the diagramming of sentences of natural languages, and is still used for the diagramming of inflected languages, such as the Romance languages or Latin. The term parsing comes from Latin pars (ōrātiōnis), meaning part (of speech).[1][2] —Preceding unsigned comment added by 122.168.58.48 (talk) 13:09, 25 September 2008 (UTC)
There are two different definitions of tokenization, one in the Tokens, the other in the Tokenization subsection. —Preceding unsigned comment added by 134.2.173.25 (talk) 14:24, 28 March 2011 (UTC)
The lexing example should use a list, not XML nor s-expressions
In the article, there are two examples I want to comment on. One shows XML. Another uses an s-expression. These examples create the impression that lexing produces a tree data structure. However, lexical analysis generates a list of tokens (often tuples). To be clear, a list, not a tree. In the lexers I've used, tokens do not contain other tokens. There may be exceptions, but if so, they are uncommon.
To show lexing, it is better to use a (simple) list data structure. Below is an example in Python. (JSON would also be suitable. The idea is to choose a commonly used data representation to get the point across; namely, you only need a list of tuples.)
from enum import Enum
class Token(Enum):
WORD = 1
PUNC = 2
tokenization = [
(Token.WORD, "The"),
(Token.WORD, "quick"),
(Token.WORD, "brown"),
(Token.WORD, "fox"),
(Token.WORD, "jumps"),
(Token.WORD, "over"),
(Token.WORD, "the"),
(Token.WORD, "lazy"),
(Token.WORD, "dog"),
(Token.PUNC, "."),
]
DavidCJames (talk) 15:45, 10 March 2021 (UTC)
Regarding "Lexical analyzer generators" section
It seems to be a duplicate of List of parser generators. Probably can be removed? 98.176.182.199 (talk) 02:57, 6 December 2009 (UTC)
- It's not a duplicate (as the casual reader will observe). Tedickey (talk) 12:00, 6 December 2009 (UTC)
Let's revisit this. At some point in the last nine years, the Comparison of parser generators page has grown sections for both regular- and context-free analyzers. The list here is best converted to a reference into that other page. I vote to perform that operation. 141.197.12.183 (talk) 16:40, 25 January 2019 (UTC)
table-driven vs directly coded
I dont think the table-driven approach is the problem - see 'control table' article - flex appears to be inefficient and is not using a trivial hash function.—Preceding unsigned comment added by 81.132.137.100 (talk • contribs)
- Editor comments that lex/flex doesn't use hashing (that may be relevant). My understanding of the statement is that state-tables can grow very large when compared to hand-coded parsers. Agreeing that they're simpler to implement, there's more than one aspect of efficiency. Tedickey (talk) 20:37, 2 May 2010 (UTC)
- It really depends what is meant by a "table-driven" approach. If you are simply talking about tables of "raw" specifications that have to be parsed/interpreted before execution v. embedded hand coded
Ifstatements, the original text may be correct. - If however you are talking about a really well designed "execution ready" control table - that is then used to input an already compacted and well designed "table of specifications" (v. verbose text based instructions), it can be much superior in algorithmic efficiency, maintainability and just about every other way. In other words, the phrase "table-driven" is maybe used rather ambiguously in this context, without reference to the much deeper "table-driven programming" approach. —Preceding unsigned comment added by 86.139.34.219 (talk) 10:53, 7 May 2010 (UTC)
- It really depends what is meant by a "table-driven" approach. If you are simply talking about tables of "raw" specifications that have to be parsed/interpreted before execution v. embedded hand coded
SICP book example
I am not familiar with the code examples in SICP, can someone please provide the section/example name/number and possibly a link to the example. Thanks, Ckoch786 (talk) 01:43, 7 February 2014 (UTC)