[YG Conlang Archives] > [jboske group] > messages [Date Index] [Thread Index] >
Arnt Richard Johansen scripsit: > Do we know what these hacks are? Yes. In grammar.300, the tokens in the 700-series are inserted *before* the problematic areas to disambiguate them. For example, we don't know whether a number is going to be a proper number or a -MAI free modifier until we get to its end, so we insert a lexer_L_712 in front of it in the former case, and a lexer_A_701 in front of it in the latter case. The assumption is that a preparser stage inserts these tokens as needed by snuffling ahead as far as necessary. The official parser actually uses a slightly different scheme. Rather than inserting a 700-series token before the problematic sequence, it replaces the whole sequence with a 900-series token. These tokens are normally commented out, but are restored when constructing a parser; at the same time, the 900-series *rules* are commented out. > I was of the belief that iff YACC is a LALR(1) parser, then it can't cope > with more than 1-token lookahead. Why are we writing rules for it that > requires more than 1-token lookahead? It's Lojban, not the YACC grammar, that contains such rules. We use the inserted (or replacing) tokens to overcome YACC's limitations. Similarly, we use YACC's error-recovery machinery to insert elidable terminators. > That is very disturbing. I hope those hacks are only a few, so that we can > have a hope of *proving* the grammar to be unambigous. Currently there are 24 of them. See the comments on the definitions of the 700-series and 900-series tokens. Most of them are either bounded (like EK) or unbounded but repetitive (like numbers). The worst case is tenses, which have a substantial grammar-within-the-grammar. With the exception of NAI, there is no selma'o that is usable both in the main grammar and in the "preparser grammar". The only possible ambiguities, therefore, are between preparser rules. -- "Well, I'm back." --Sam John Cowan <jcowan@hidden.email>