Colloquiumfest - Nathan, Lifeng | Department of Linguistics

November 3, 2017

All Day

Knowlton Hall 190

Complex Scoping in Simple English

Nathan Rasmussen

Quantificational noun phrases are common in explanatory text, so that disambiguating quantifier scope is essential for understanding these texts correctly. Human readers generally do this easily, but as yet computational systems do not. One reason for this is the lack of training data. Existing scope-annotated corpora either consist of news (Higgins and Sadock, 2003; Srinivasan and Yates, 2009), in which proper names and specific facts make quantifiers less frequent, or are narrow in subject matter (Manshadi et al., 2011; Manshadi et al., 2012), and so are hard to generalize from.

Our new corpus annotates scopes in 336 three-sentence excerpts from the Simple English Wikipedia (total 1008 sentences). The sample covers many different domains of knowledge, but with relatively limited syntax and vocabulary, which we expect will make the texts more tractable for machine learning systems as well as for human readers. Despite this simplicity it is still rich in quantifiers, often several per sentence.

Scopes are annotated in a dependency form that allows underspecification when necessary and that integrates with our other work in graph-based representations of meaning. Coreference and certain presuppositions are also annotated, where they are inextricably entangled with scoping concerns. A 99-sentence subsample was used to evaluate inter-annotator agreement, which was found to be comparable to that of Manshadi et al.'s earlier corpus.

In the near future, the corpus will be used to train and test two scope predictors. One will be based solely on the geometry of the text's semantic graph, and the other will enrich this information with vector representations of the predicates used, as a way of capturing world knowledge.

The Learning of a Grammar with Limited Memory Depth

Lifeng Jin

A probabilistic context-free grammar (PCFG) is able to generate sentences with unlimited recursion depth, but humans seem to put a bound on how deeply embedded a phrase can be. This cognitively motivated bound on recursion depth has been used to constrain computational models of grammar acquisition. In this talk, I will present a Bayesian grammar acquisition model that learns a PCFG using raw text and bounds it with memory depth limit without assumption of any other universals. Such depth-bounding helps grammar acquisition to be easier and more efficient because the search space of all grammars is smaller. Results on synthetic datasets and child-directed speech show that our model performs better than other not necessarily cognitively motivated models in terms of parse accuracy, and demonstrates a consistent use of category labels. The fact that this model learns a PCFG and then transforms it to a bounded PCFG seems to align with Chomsky's distinction between competence and performance, and has the potential to offer some formal guidance to linguistic inquiry about both kinds of models.

There will be a reception following the talks.