hi. i am yanchuan.

— nlp nerd. modelling ninja. coffee junkie. snowboard aficionado. aviator wannabe. —

about

my research interests lie at the intersection of natural language processing, machine learning and social science. i am interested in building models to understand text, especially so from a social science perspective.

in my other life, i play tennis, snowboard, run, travel, explore town hunting for good food and spend far too much time on tv.

google scholar profile

google scholar

link to my google scholar profile page

bitbucket repository

bitbucket

bitbucket repository where my open source development code is hosted.

they provide unlimited private repos and space for users with .edu emails!

noah's ark

noah's ark

i am part of noah's ark, a research group at cmu led by prof noah smith.


publications

a list of my recent publications

Measuring Ideological Proportions in Political Speeches

Yanchuan Sim, Brice Acree, Justin H. Gross, Noah A. Smith

EMNLP 2013. Seattle, WA.

abstract | bibtex | pdf | supplementary | slides | code, data & results

We seek to measure political candidates' ideological positioning from their speeches. To accomplish this, we infer ideological cues from a corpus of political writings annotated with known ideologies. We then represent the speeches of U.S. presidential candidates as sequences of cues and lags (filler distinguished only by its length in words). We apply a domain-informed Bayesian HMM to infer the proportions of ideologies each candidate uses in each campaign. The results are validated against a set of preregistered, domain expert authored hypotheses.

@inproceedings{sim2013measuring, author = {Sim, Yanchuan and Acree, Brice D. L. and Gross, Justin H. and Smith, Noah A.}, title = {Measuring Ideological Proportions in Political Speeches}, booktitle = {Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing}, month = {October}, year = {2013}, address = {Seattle, Washington, USA}, publisher = {Association for Computational Linguistics}, pages = {91--101}, url = {http://www.aclweb.org/anthology/D13-1010}, series = {EMNLP '13}, abstract = {We seek to measure political candidates' ideological positioning from their speeches. To accomplish this, we infer ideological cues from a corpus of political writings annotated with known ideologies. We then represent the speeches of U.S. presidential candidates as sequences of cues and lags (filler distinguished only by its length in words). We apply a domain-informed Bayesian HMM to infer the proportions of ideologies each candidate uses in each campaign. The results are validated against a set of preregistered, domain expert authored hypotheses.}, }

Testing the Etch-a-Sketch Hypothesis: A Computational Analysis of Mitt Romney's Ideological Makeover During the 2012 Primary vs. General Elections

Justin H. Gross, Brice Acree, Yanchuan Sim, Noah A. Smith

American Political Science Association (APSA) 2013 Annual Meeting Paper. Chicago, IL.

abstract | bibtex | pdf

Downsian theory predicts that presidential candidates should shift toward the general electorate's median voter after securing their parties' nominations. Motivated by this largely untested hypothesis, we test the theory using candidates' campaign speeches as data. We develop a model to identify ideological cues in political text. After performing validation and robustness checks, we fit the model using presidential candidates' speeches from 2008 and 2012. The results show that Barack Obama, John McCain and Mitt Romney did indeed make substantively significant rhetorical shifts away from the ideological extremes after securing their parties' presidential nominations.

@inproceedings{gross2013testing, title = {Testing the Etch-a-Sketch Hypothesis: A Computational Analysis of Mitt Romney's Ideological Makeover During the 2012 Primary vs. General Elections}, author = {Gross, Justin and Acree, Brice and Sim, Yanchuan and Smith, Noah A}, booktitle = {APSA 2013 Annual Meeting Paper}, location = {Chicago, Illinois, USA}, year = {2013}, abstract= {Downsian theory predicts that presidential candidates should shift toward the general electorate's median voter after securing their parties' nominations. Motivated by this largely untested hypothesis, we test the theory using candidates' campaign speeches as data. We develop a model to identify ideological cues in political text. After performing validation and robustness checks, we fit the model using presidential candidates' speeches from 2008 and 2012. The results show that Barack Obama, John McCain and Mitt Romney did indeed make substantively significant rhetorical shifts away from the ideological extremes after securing their parties' presidential nominations.}, }

Learning Topics and Positions from Debatepedia

Swapna Gottipatti, Minghui Qiu, Yanchuan Sim, Jing Jiang, Noah A. Smith

EMNLP 2013. Seattle, WA.

abstract | bibtex | pdf | supplementary

We explore Debatepedia, a community authored encyclopaedia of socio-political debates, as evidence for inferring a low dimensional, human-interpretable representation in the domain of issues and positions. We introduce a generative model positing latent topics and cross-cutting positions that gives special treatment to person mentions and opinion words. We evaluate the resulting representation's usefulness in attaching opinionated documents to arguments and its consistency with human judgements about positions.

@inproceedings{gottipatti2013learning, author = {Gottipati, Swapna and Qiu, Minghui and Sim, Yanchuan and Jiang, Jing and Smith, Noah A.}, title = {Learning Topics and Positions from {Debatepedia}}, booktitle = {Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing}, month = {October}, year = {2013}, address = {Seattle, Washington, USA}, publisher = {Association for Computational Linguistics}, pages = {1858--1868}, url = {http://www.aclweb.org/anthology/D13-1191}, series = {EMNLP '13}, abstract = {We explore Debatepedia, a community authored encyclopaedia of socio-political debates, as evidence for inferring a low dimensional, human-interpretable representation in the domain of issues and positions. We introduce a generative model positing latent topics and cross-cutting positions that gives special treatment to person mentions and opinion words. We evaluate the resulting representation's usefulness in attaching opinionated documents to arguments and its consistency with human judgements about positions.}, }

A Probabilistic Model for Canonicalizing Named Entity Mentions

Dani Yogatama, Yanchuan Sim, Noah A. Smith

ACL 2012. Jeju, Korea.

abstract | bibtex | pdf

We present a statistical model for canonicalizing named entity mentions into a table whose rows represent entities and whose columns are attributes (or parts of attributes). The model is novel in that it incorporates entity context, surface features, first-order dependencies among attribute-parts, and a notion of noise. Transductive learning from a few seeds and a collection of mention tokens combines Bayesian inference and conditional estimation. We evaluate our model and its components on two datasets collected from political blogs and sports news, finding that it outperforms a simple agglomerative clustering approach and previous work.

@inproceedings{yogatama2012probabilistic, author = {Yogatama, Dani and Sim, Yanchuan and Smith, Noah A.}, title = {A Probabilistic Model for Canonicalizing Named Entity Mentions}, booktitle = {Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1}, series = {ACL '12}, year = {2012}, location = {Jeju Island, Korea}, pages = {685--693}, numpages = {9}, url = {http://dl.acm.org/citation.cfm?id=2390524.2390621}, acmid = {2390621}, publisher = {Association for Computational Linguistics}, address = {Stroudsburg, PA, USA}, abstract = {We present a statistical model for canonicalizing named entity mentions into a table whose rows represent entities and whose columns are attributes (or parts of attributes). The model is novel in that it incorporates entity context, surface features, first-order dependencies among attribute-parts, and a notion of noise. Transductive learning from a few seeds and a collection of mention tokens combines Bayesian inference and conditional estimation. We evaluate our model and its components on two datasets collected from political blogs and sports news, finding that it outperforms a simple agglomerative clustering approach and previous work.}, }

Discovering Factions in the Computational Linguistics Community

Yanchuan Sim, Noah A. Smith, David A. Smith

ACL 2012 Rediscovering 50 Years of Discoveries Workshop. Jeju, Korea.

abstract | bibtex | pdf | slides

We present a joint probabilistic model of who cites whom in computational linguistics, and also of the words they use to do the citing. The model reveals latent factions, or groups of individuals whom we expect to collaborate more closely within their faction, cite within the faction using language distinct from citation outside the faction, and be largely understandable through the language used when cited from without. We conduct an exploratory data analysis on the ACL Anthology. We extend the model to reveal changes in some authors' faction memberships over time.

@inproceedings{sim2012discovering, title = {Discovering Factions in the Computational Linguistics Community}, author = {Sim, Yanchuan and Smith, Noah A. and Smith, David A.}, booktitle = {Proceedings of the ACL-2012 Special Workshop on Rediscovering 50 Years of Discoveries}, series = {ACL '12}, year = {2012}, location = {Jeju Island, Korea}, pages = {22--32}, numpages = {11}, url = {http://dl.acm.org/citation.cfm?id=2390507.2390511}, acmid = {2390511}, publisher = {Association for Computational Linguistics}, address = {Stroudsburg, PA, USA}, abstract = {We present a joint probabilistic model of who cites whom in computational linguistics, and also of the words they use to do the citing. The model reveals latent factions, or groups of individuals whom we expect to collaborate more closely within their faction, cite within the faction using language distinct from citation outside the faction, and be largely understandable through the language used when cited from without. We conduct an exploratory data analysis on the ACL Anthology. We extend the model to reveal changes in some authors' faction memberships over time.}, }

Entity Linking with Effective Acronym Expansion, Instance Selection and Topic Modeling

Wei Zhang, Yanchuan Sim, Jian Su, Chew Lim Tan

IJCAI 2011. Barcelona, Spain.

abstract | bibtex | pdf

Entity linking maps name mentions in the documents to entries in a knowledge base through resolving the name variations and ambiguities. In this paper, we propose three advancements for entity linking. Firstly, expanding acronyms can effectively reduce the ambiguity of the acronym mentions. However, only rule-based approaches relying heavily on the presence of text markers have been used for entity linking. In this paper, we propose a supervised learning algorithm to expand more complicated acronyms encountered, which leads to 15.1% accuracy improvement over state-of-the-art acronym expansion methods. Secondly, as entity linking annotation is expensive and labor intensive, to automate the annotation process without compromise of accuracy, we propose an instance selection strategy to effectively utilize the automatically generated annotation. In our selection strategy, an informative and diverse set of instances are selected for effective disambiguation. Lastly, topic modeling is used to model the semantic topics of the articles. These advancements give statistical significant improvement to entity linking individually. Collectively they lead the highest performance on KBP-2010 task.

@inproceedings{zhang2011entity, author = {Zhang, Wei and Sim, Yan Chuan and Su, Jian and Tan, Chew Lim}, title = {Entity Linking with Effective Acronym Expansion, Instance Selection and Topic Modeling}, booktitle = {Proceedings of the Twenty-Second international joint conference on Artificial Intelligence - Volume Volume Three}, series = {IJCAI'11}, year = {2011}, isbn = {978-1-57735-515-1}, location = {Barcelona, Catalonia, Spain}, pages = {1909--1914}, numpages = {6}, url = {http://dx.doi.org/10.5591/978-1-57735-516-8/IJCAI11-319}, doi = {10.5591/978-1-57735-516-8/IJCAI11-319}, acmid = {2283721}, publisher = {AAAI Press}, }

NUS-I2R: Learning a Combined System for Entity Linking

Wei Zhang, Yanchuan Sim, Jian Su, Chew Lim Tan

Text Analysis Conference 2010. Gaithersburg, MD, USA.

abstract | bibtex | pdf

In this paper, we report the joint participation of NUS and I2R team in Knowledge Base Population at Text analysis conference 2010. For Entity Linking, we analyze IR approaches and SVM classification in the disambiguation stage and develop a supervised learner for combining these approaches. The combined system performs better than the individual components and achieves results much better than the median. Furthermore, according to our error analysis, quite some errors are caused due to the different Wikipedia version is used, which hinder our system to show significant better performance.

@inproceedings{zhang2010nus, title = {NUS-I2R: Learning a Combined System for Entity Linking}, author = {Zhang, Wei and Sim, Yan Chuan and Su, Jian and Tan, Chew Lim}, booktitle= {Proceedings of the 3rd Text Analysis Conference}, location = {Gaithersburg, Maryland, USA}, year = {2010}, abstract = {In this paper, we report the joint participation of NUS and I2R team in Knowledge Base Population at Text analysis conference 2010. For Entity Linking, we analyze IR approaches and SVM classification in the disambiguation stage and develop a supervised learner for combining these approaches. The combined system performs better than the individual components and achieves results much better than the median. Furthermore, according to our error analysis, quite some errors are caused due to the different Wikipedia version is used, which hinder our system to show significant better performance.}, }


code

an assortment of tools and code that I use for my projects.

ark-sage

code repo | javadocs

Ark-SAGE is a Java library that implements the L1-regularized version of Sparse Additive GenerativE models of Text (Einsenstein et al, 2011). SAGE is an algorithm for learning sparse representations of text (you can read more about it here).

yc-pyutils

code repo | docs

A growing collection of handy utility modules for NLP with Python (mainly data processing related, i.e tokenizing, tf-idf, building and pruning vocabulary).

yc-config

code repo | javadocs

A Java library built on top of Java Simple Argument Parser (JSAP) that allows the use of a default "configuration file" using the --config-file option.

yc-make-latex

code repo

A basic Python script that handles compiling of LaTeX and related files. It supports compiling LaTeX, gnuplot and eps files, and displaying it to the console in a clean manner.


#connect

email is the best way to reach me.

email

ghc 5719

gates-hillman complex
5000 forbes ave
pittsburgh, pa 15213
office address

+1 217-703-4454

phone number
facebook
linkedin
quora
twitter
google+