The Ohio State University
. www.osu.edu
Help Campus Map Find People Webmail Search Ohio State
.

Research

Research Areas

Computational Linguistics


Current Projects


DECCA: Detection of Errors and Correct in Corpus Annotation

The success of data-driven approaches and stochastic modeling in computational linguistic research and applications is rooted in the availability of electronic natural language corpora. Despite the central role that annotated corpora play for computational linguistic research and applications, the question of how errors in the annotation of corpora can be detected and corrected has received only little attention. The project is designed to address this important gap by exploring an error detection and correction method that is applicable to a wide range of corpus annotations. Hybrid methods for acquisition and tuning of lexical information

Broad coverage dictionaries and ontologies for natural language processing (NLP) are difficult and costly to create and maintain by hand. It is therefore desirable to learn them from distributional information, such as can be obtained from unlabeled or sparsely labeled text corpora. Many linguistic and psycholinguistic theories are distributional, but emphasize local neighborhood structure more than do previous NLP approaches. Successful visualization techniques such as keyword-in-context also rely on the preservation of neighborhood structure. A similar emphasis is present in emerging techniques for data reduction, such as LLE and min-cut algorithms, whose application to language data the project is investigating.

While the immediate goal of the project is to gain a better understanding of lexical tuning and acquisition, the resulting dictionaries, ontologies and mapping techniques have the potential to help information professionals (such as librarians, translators, patent examiners and paralegal researchers) to navigate through corpora, to understand the significance of the data that they see, and to incorporate insights derived from the data into their working practice.

We are integrating computational linguistics into the undergraduate curriculum of the Department of Linguistics, creating new courses designed primarily to appeal to students majoring in the humanities, and to offer such students fresh options in meeting the scientific, mathematical and quantitative components of the university's breadth requirement. TAGARELA: Bridging the Gap between Research in Natural Language Processing and Individualized Language Instruction CoGETI: Constraint-based Grammar: Data, Theory, and Implementation

Completed projects


MiLCA: Media-intensive teaching modules in the computational linguistics curriculum From Corpus Resources to Linguistic Phenomena: Using Computational Linguistics Tools to Access Relevant Data for Linguistics Updating a Grammar Implementation Environment Personalised Speech Synthesis for English Machine Translation Using Probabilistic Finite-State Devices Implementing a Theory of Question Answering in a Dialogue System Information Extraction from Political Texts Parsing for Languages with Flexible Constituent Ordering Linguistic Grammars Online (LinGO collaboration with Stanford University)