Statistics and Data Science Seminar: Getting Inference Right with LLM Annotations in the Social Sciences: A General Framework of Using Predicted Variables in Downstream Analysis

Speaker: Brandon Stewart, Princeton University

This event is a joint seminar with the Departments of Sociology and Political Science.

Abstract: Text as data methods, including large language models (LLMs), have allowed social scientists to measure a wide range of properties of documents. While such predicted text-based variables are often analyzed as if they were observed without errors, we show that ignoring prediction errors leads to substantial bias and invalid confidence intervals in downstream analyses, even if the accuracy of the automated annotation step is high, e.g., above 90%. We propose a framework of design-based supervised learning (DSL) that can provide valid statistical estimates, even when predicted variables contain non-random prediction errors. DSL employs a doubly robust procedure to combine predicted labels and a smaller number of high-quality expert annotations. DSL allows scholars to apply advances in LLMs and natural language processing to social science research while maintaining statistical validity. We illustrate its general applicability using two applications where the outcome and independent variables are text-based. This work is joint with Naoki Egami, Musashi Hinck, and Hanying Wei. I will conclude the talk with a broader view of how we can think about the best use of LLMs in the social sciences.

The talk will take place from noon to 1:15pm in Seigle L006. Lunch will be provided, but please arrive a bit early so we can start the talk in a timely manner.  The talk will be recorded for faculty who cannot attend in person, and you can also attend online using the Zoom link below.

Zoom link: https://wustl.zoom.us/j/94834393326