Melti Elisa: Learning to Disambiguate Syntactic Relations

1 Introduction

Many extensions to text-based, data-intensive knowledge management approaches, such as Information Retrieval or Data Mining, focus on integrating the impressive recent advances in language technology. For this, they need fast, robust parsers that deliver linguistic data which is meaningful for the subsequent processing stages. This paper introduces such a parsing system and discusses some of its disambiguation techniques which are based on learning from a large syntactically annotated corpus.
The paper is organized as follows. Section 2 explains the motivations for writing the parser, and why it profits from Dependency grammar assumptions. Section 3 gives a brief introduction to the parsing system and to evaluation questions. Section 4 presents the probabilistic models and the conducted experiments in detail.

2 Current Parsing Approaches

2.1 Formal Grammar Parsers

On the one hand, a variety of parsers that are based on a formal linguistic theory have existed for a number of years. These are, to name only a few, the Alvey tools (Briscoe et al. 1987) for GPSG, Lingo (Copestake/Flickinger 2000) or Babel (Müller 1996) for HPSG, FIPS (Wehrli 1997) or PAPPI (Fong 1991) for GB, and MINIPAR (Lin 1998) or FDG (Tapanainen/ Järvinen 1997) for DG. These systems generally have a very good coverage of most syntactic phenomena, but some of them suffer from run-time problems due to the complexity of the grammar or from coverage problems due to the rigidness of the grammar or the incompleteness of the lexicon. Typical formal grammar parser complexity is about O(n⁵), which means that parsing time for an exhaustive parse is constant to the fifth power of the number of words in a sentence.
They have in common that they are rule-based, and that scoring systems for disambiguation are heuristic, hand-written instead of learnt from real-world data. For example, in I saw the man [PP with an umbrella], it is syntactically unclear whether the PP should attach to the verb or the noun. A lexicon which is enough specific to include the information that one cannot see by means of umbrellas is too cumbersome to compile, and structural heuristics have to fail since the structurally potentially equivalent I saw the man [PP with a periscope] is semantically rather a case for verb-attachment. A corpus-based probabilistic approach, however, based on whether it is more likely to see the participating words in a verb- or a noun-attachment, can prefer and reject readings based on empirical grounds (Collins/Brooks 1995).

2.2 Probabilistic Parsers

On the other hand, broad-coverage syntactic parsers that learn from syntactically annotated corpora have now become available (Charniak 2000, Collins 1999, Henderson 2003). They generally have good performance, but they typically produce pure constituency data as output, trees that do not include the deep-linguistic information, i.e. grammatical function annotation nor the empty nodes annotation provided in Treebanks such as the Penn Treebank (Marcus/Santorini/Marcinkiewicz 1993) on which they are usually trained. This means that the extraction of long-distance dependencies (LDD) and the mapping to shallow semantic representations is not always possible, because first co-indexation information is not available, second a single parsing error across a tree fragment containing an LDD makes its extraction impossible (Johnson 2002), third some syntactic relations cannot be recovered on configurational grounds only.
Implementations of probabilistic parsers are very efficient since no LDDs are expressed. The CYK algorithm, or versions of it such as Nivre (2003), have parsing complexity O(n³).
The parser described here aims at combining both approaches into a hybrid: using a deep-linguistic formal grammar theory, dependency grammar (DG), with hand-written rules, but assigning probabilistic lexicalized scores for disambiguation and pruning. While it profits from the low parsing complexity of a CYK implementation, it expresses most LDDs like a formal grammar, as briefly explained in the following (see Schneider 2003b for a detailed discussion).

2.3 Functional Dependency Structures

Dependency Grammar (DG) is essentially a valency grammar in which the valency concept is extended from verbs to nouns and adjectives and finally to all word classes.
In its simplest definition, a projective DG is a binary version of a constituent grammar which only knows lexical items, which entails that

for every mother node, the mother node and exactly one of its daughters, the so-called head, are isomorphic
projection is deterministic, endocentric and can thus not fail, which gives DG a robustness advantage
equivalent constituency CFG trees can be derived, as illustrated in figure 1

The last point deserves elaboration. When converting dependency structures to constituency trees, very flat trees are produced. In figure 1, the top V node corresponding to S and its head daughter corresponding to VP are intrinsically identical, only the deliberate choice of an ordering rule that attaches subjects higher than objects leads to a common constituency representation. In this sense, constituency structures are more expressive than dependency structures. On the other hand, the typical dependency syntactic function label is lost in a conversion to common constituency representation.
DG was originally conceived to be a deep-syntactic, proto-semantic theory (Tesnière 1959). The version of DG used here retains syntactic functions as dependency labels, like in LFG, which means that the dependency analyses returned by the parser are also a simple version of LFG f-structures, a hierarchy of syntactic relations between lexical heads which serves as a bridgehead to semantics.

Figure 1: A dependency representation and its typically unlabeled constituency counterpart
They are close to predicate-argument structures or other surface semantic representations such as discourse representation theory (DRT). They are also easy to read and interpret as they are based on traditional school grammar terms and far less nested than constituency-based theories, e.g. GB (Chomsky 1981). Unlike the latter, functional dependency exhibits only a minimal amount of non-projectivity. Non-projectivity, which is typically expressed by transformations, co-indexation or structure-sharing, is absent from functional dependency except for WH-movement, due to the following reasons.
Empty units, empty complementizers and empty relative pronouns pose no problem to DG as they are non-head material. Functional DG only accepts content words as heads. For example, a complementizer is an optional dependent of the subordinated verb. Moved clauses like PPs or clausal complements of verbs of utterance (e.g. Torrential rainfalls are forecast, said the weatherman) do not involve non-local dependencies but are expressed by a non-canonical dependency direction under well-defined conditions. The same applies for subject-verb inversion in affirmative clauses for verbs of utterance (same example). Noun-participle relations (e.g. the report written by ... are covered by a dedicated syntactic relation expressing the noun-modification and the object character of the noun at the same time. Infinite verbs and gerunds may act as subjects (e.g. after winning/VBG the race), which are covered by grammar rules that Tesnière called translations, although these rules do not participate in the probability model. Finally, non-surface semantic arguments, for example the subject-verb relation in the subordinate clause of a control construction, are created at the post-parsing stage, where minimal predicate-argument structures are output.
Tables 1 and 2, which are adapted from Johnson (2002) show that the vast majority of LDDs except for WH-movement, i.e. dependencies involving several local subtrees and thus several CFG rules are covered by strictly local rules expressing local dependencies by the functional DG grammar. Most of the relations expressed by these rules also participate in the machine learning model described in section 4. The numbers in table 2 show how many occurrences were recognised by the patterns described in section 3.3. Because the patterns are formulated with an emphasis on precision, generally not all occurrences of a type are extracted, recall is not 100 %. Many of the structures that are not covered by the patterns are still parsed correctly, for example adverbial sentences as unspecified subordinate clauses, but information about shared semantic arguments is no longer expressed.

3 The Parsing System

The parser differs on the one hand from successful deep-linguistic Dependency Grammar implementations (e.g. Lin 1998, Tapanainen/Järvinen 1997) by using a statistical base, and on the other hand from state-of-the-art statistical approaches (e.g. Collins 1999) by carefully following an established formal grammar theory, Dependency Grammar (DG). The parser has been implemented in Prolog. It has been tested in SWI-Prolog and Sicstus Prolog under Solaris, Linux, Windows 2000 and Windows XP. Figure 2 shows a screenshot of the graphical SWI-Prolog version. Its parsing speed, including the tagging and chunking preprocessing, is above 100'000 words per hour on a modern PC.
The parser uses a hand-written linguistic grammar. Except for the pre-processing tagger, this part of the system is rule-based and non-probabilistic. But the parser also integrates a probability-based model of language, similar to Collins (1999). It is supervised and based on Maximum Likelihood Estimation (MLE). All the probability models tested are discussed in section 4.

3.1 Tagging and Chunking

The parsing system uses a divide-and-conquer approach. Low-level linguistic tasks that can be reliably solved by finite-state techniques are handed over to them. These low-level tasks are the recognition of part-of-speech by means of tagging, and the recognition of base NPs and verbal groups by means of chunking. Finite-state technology is fast enough for unlimited amounts of data. Tagging and chunking is done by a standard tagger and chunker, LTPos (Mikheev 1997). The parser then relies on the disambiguation decisions of the tagging and chunking stage and can profit from a reduced search space, at the cost of a slightly decreased performance due to tagging and chunking errors. The grammar rules are based on the tags of the linguistic heads.
Tagging is a classical machine learning application in Computational Linguistics. Every word token in the running text is assigned a part-of-speech tag depending on its context. For example, the word run is a verb token in I run, as in most cases when it follows a pronoun, but a noun token in the long run, as in most cases when it follows an adjective. Taggers achieve error rates of 2-5%. They often fail if the context that is taken into consideration around the target word is too small for a decision to be taken or if structural information would be needed. For example in the article featured in the newspaper discusses that ... the verb participle (e.g. appeared) is likely to be mistagged as (the generally more frequent) verb past tense, because their immediate left (e.g. a noun) and right (e.g. a preposition) context can be identical. Some frequent tagging errors can be corrected by the parser. For example, the grammar allows verb past forms to act as participles, which leads to local but rarely to global parsing ambiguity.
The chunker and the head extraction method are completely rule-based. A small evaluation shows about 98% correct head extraction. The extracted heads are lemmatized (Minnen/ Carroll/Pearce 2000). Parsing takes place only between the heads of phrases, and only using the best tag suggested by the tagger, which leads to a reduction in complexity.

3.2 The Hand-Written Grammar

Because of the DG isomorphism of a head and its projection, there are no phrase levels, and projection can never fail. All rules apply at the word level, and are based on the Penn Treebank tagset tags.
The rules contain a simple, tag-based scoring system which disprefers marginal and infrequent relations. A large number of linguistic constraints, such that a verb can only have one subject, that adjuncts only follow after all complements, that verbs can maximally have two noun objects etc. are encoded. Some linguistically possible but very rare constructions are ruled out. For example, a verb with a subordinate clause is not allowed to attach a PP after the subordinate clause. Linguistic constructions that are considerably rarer than the error rate they introduce are not desirable in a practically oriented system. A closed list of temporal expressions, possible noun adjuncts, is also used. Only verbs of utterance are allowed to undergo subject-verb inversion in affirmative clauses, and topicalized clauses are only allowed sentence-initially.
Writing grammar rules is a feasible task when using a framework that is close to traditional school grammar assumptions, such as DG. Acknowledged facts are expressed in hand-written declarative rules. Since the tagset is limited and dependency rules are binary, even a broad-coverage set of rules can be written in relatively little time.
Much more difficult than grammar writing is to assess the scope of application of a rule and the amount of ambiguity it creates. Long real-world sentences typically have dozens to hundreds of syntactically correct complete analyses and thousands of partial analyses, although most of them are semantically so odd that one would never think of them. Here, machine-learning approaches, such as probabilizing the manually written rules, are very useful for a successful parser, for two reasons: first, the syntactically possible analyses can be ranked according to their probabilities. For subsequent processing stages like semantic interpretation or document classification it then often suffices to take the first ranked or the n first ranked readings. Second, in the course of the parsing process, very improbable analyses can be abandoned, which greatly improves parsing efficiency.

References

Briscoe, E./Grover, C./Boguraev, B./Carroll, J. (1987): "A formalism and environment for the development of a large grammar of english". In: MacDermott, John (ed.): Proceedings of the 10th International Joint Conference on Artificial Intelligence, Milan, Italy. Los Altos, Calif.: 703-708.
Carroll, John/Minnen, Guido/Briscoe, Ted (1999): "Corpus annotation for parser evaluation". In: Uszkoreit, H./Brants, T./Krenn, B. (eds): Proceedings of the EACL-99 Post-Conference Workshop on Linguistically Interpreted Corpora. Bergen, Norway: 35-41.
Charniak, Eugene (2000): "A maximum-entropy-inspired parser". In: Proceedings of the North American Chapter of the ACL. Seattle, WA: 132-139.
Chomsky, Noam (1981): Lectures on Government and Binding. Dortrecht.
Collins, Michael (1999): Head-Driven Statistical Models for Natural Language Parsing. Ph.d. dissertation, University of Pennsylvania, Philadelphia, PA.
Collins, Michael/Brooks, James (1995): "Prepositional attachment through a backed-off model". In: Yarowsky, David (ed.): Proceedings of the Third Workshop on Very Large Corpora. Cambridge, MA: 27-38.
Copestake, Ann/Flickinger, Dan (2000): "An open-source grammar development environment and broad-coverage english grammar using hpsg". In: Gavrilidou, Maria et al. (eds.): Proceedings of the Second conference on Language Resources and Evaluation (LREC-2000), Athens, Greece. Paris: 591-600.
Fellbaum, Christiane (ed.) (1998): WordNet: An Electronic Lexical Database. Cambridge, MA.
Fong, Sandiway (1991): Computational Properties of Principle-Based Grammatical Theories. Ph.d. dissertation, MIT Artificial Intelligence Lab, Cambridge, MA.
Henderson, James (2003): "Inducing History Representations for Broad Coverage Statistical Parsing". In: Proceedings of the joint meeting of the North American Chapter of the Association for Computational Linguistics and the Human Language Technology Conference (HLT-NAACL 2003). Edmonton, Canada: 103-110.
Johnson, Mark (2002): "A simple pattern-matching algorithm for recovering empty nodes and their antecedents". In: Proceedings of the 40th Meeting of the ACL. University of Pennsylvania, Philadelphia: 136-143.
see http://cog.brown.edu:16080/~mj/Publications.htm
Levin, Beth C (1993): English Verb Classes and Alternations: a Preliminary Investigation. Chicago, IL.
Lin, Dekang (1995): "A dependency-based method for evaluating broad-coverage parsers". In: Mellish, Chris S. (ed.): Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, Montréal, Canada (IJCAI-95). San Mateo, Calif.: 1420-1425.
Lin, Dekang (1997): "Using syntactic dependency as local context to resolve word sense ambiguity". In: Proceedings of the Thirty-Fifth Annual Meeting of the Association for Computational Linguistics and Eighth Conference of the European Chapter of the Association for Computational Linguistics (ACL/EACL-97). Madrid.: 64-71.
see http://www.cs.ualberta.ca/~lindek/papers.htm
Lin, Dekang (1998): "Dependency-based evaluation of MINIPAR". In: Rubio, Antonio et al. (eds.): Proceedings of the Workshop on the Evaluation of Parsing Systems, First International Conference on Language Resources and Evaluation (LREC '98), Granada, Spain. Paris.
see http://www.cs.ualberta.ca/~lindek/papers.htm
Marcus, Mitch/Santorini, Beatrice/Marcinkiewicz, M.A. (1993): "Building a large annotated corpus of English: the Penn Treebank". Computational Linguistics 19: 313-330.
Mikheev, Andrei (1997): "Automatic rule induction for unknown word guessing". Computational Linguistics 23(3): 405-423.
Minnen, Guido/Carroll, John/Pearce, Darren (2000): "Applied morphological generation". In: Proceedings of the 1st International Natural Language Generation Conference (INLG 2000). Mitzpe Ramon, Israel: 201-208.
see http://www.cogs.susx.ac.uk/lab/nlp/carroll/abs/00mcp.html
Müller, Stefan (1996): "The Babel-System-an HPSG Prolog implementation". In: Reintjes, Peter (ed.): Proceedings of the Fourth International Conference on the Practical Application of Prolog, London. Blackpool, Lancashire: 263-277.
Nivre, Joakim (2003): "An Efficient Algorithm for Projective Dependency Parsing". In: Proceedings of the 8th International Workshop on Parsing Technologies (IWPT 03). Nancy: 149-160.
see http://www.masda.vxu.se/~nivre/research/cv_eng.html#publ
Preiss, Judita (2003): "Using grammatical relations to compare parsers". In: Proceedings of EACL 03, Budapest, Hungary. East Stroudsberg, PA: 291-298.
Schneider, Gerold (2003a): "A low-complexity, broad-coverage probabilistic dependency parser for english". In: Proceedings of NAACL/HLT 2003 Student session. Edmonton, Canada: 31-36.
Schneider, Gerold (2003b): "Extracting and Using Trace-Free Functional Dependencies from the Penn Treebank to Reduce Parsing Complexity". In: Nivre, Joakim/Hinrichs, Erhard (eds.): Proceedings of Treebanks and Linguistic Theories (TLT) 2003. Växjö, Sweden: 153-164.
Tapanainen, Pasi/Järvinen, Timo (1997): "A non-projective dependency parser". In: Proceedings of the 5th Conference on Applied Natural Language Processing. Association for Computational Linguistics: 64-71.
see http://www.ling.helsinki.fi/~tapanain/tekeleet.html
Tesnière, Lucien (1959): Eléments de Syntaxe Structurale. Paris.
Wehrli, Eric (1997): L'analyse syntaxique des langues naturelles. Paris.

Melti Elisa

Rabu, 22 Desember 2010

Learning to Disambiguate Syntactic Relations