In recent years, there has been a lot of interest in the unsupervised induction of grammars from part-of-speech-tagged text. The two standard approaches to this task are based on unlabeled dependency grammars or probabilistic context-free grammars, and return structures that are difficult to interpret linguistically. In this talk I will propose an alternative that is based on a formalism called Combinatory Categorial Grammar (CCG). CCG is a lexicalized formalism that associates words with complex symbolic categories that capture their syntactic behavior. An unsupervised CCG induction system has to automatically identify an appropriate set of language-specific categories from raw text. It may therefore seem that using CCG needlessly complicates the already difficult task of identifying the correct grammatical structures in an unsupervised fashion. However, I will show that using CCG for unsupervised grammar induction has actually a number of advantages. First, it makes our approach more useful, since we return linguistically interpretable structures. Second, it greatly simplifies the modeling task. Although CCG is strictly more expressive than context-free grammars, I will define a non-parametric Bayesian model that is much simpler than corresponding models for context-free grammars. Moreover, unlike other approaches, our grammar remains robust when parsing longer sentences, performing as well as or better than other systems. Finally, an important advantage of CCG-based approaches is that CCG derivations are linguistically interpretable. This allows us to perform an in-depth error analysis of the kinds of constructions where unsupervised induction fails to identify the correct syntactic structures, and where supervision may be necessary.
Julia Hockenmaier teaches in the Department of Computer Science at the University of Illinois at Urbana-Champaign.