Computational Social Scientist Moritz Laurer demonstrates how instruction-based language models can overcome limitations of older machine learning techniques for text classification. Laurer shows how algorithms can learn to categorize texts with less training data; more accurately on multiple different languages and in a less biased manner. He shows how instruction-based language models can increase validity, robustness and data efficiency.
Summary of findings
Moritz Laurer shows how this type of model can reduce the required training data by a factor of ten compared to previous algorithms, while achieving the same level of performance across eight tasks. He demonstrates how these models require less than 2000 examples in two languages to create valid measurements across eight other languages and ten other countries. Moritz Laurer shows how these models are more robust against group-specific biases. Their average test-set performance only decreases marginally when trained on biased data in experiments across nine groups from four datasets. He explains how these models can be universal classifiers that can learn any number of classification tasks simultaneously in tests across 33 datasets with 389 classes.
Relevance and supervised machine learning
From millions of social media posts, to decades of legal text - more and more relevant information is hidden in digital text corpora that are too large for manual analyses. The key promise of machine learning ("Artificial Intelligence") is to automate parts of the manual analysis process.
One popular method is supervised machine learning for text classification, where a model is trained on examples of manually categorized texts and learns to identify these categories in new texts. Computational social scientists have used this method to create measurements of concepts such as emotions, topics or stances at scale. While measurement with supervised machine learning is established in the social science literature, there are important limitations that reduce the usefulness of established methods for many practical applications.
Limitations of established methods
First, these methods require large amounts of balanced training data to work well. Researchers, however, often only have limited resources for creating training data and need to tailor new data to each new research question. Second, older algorithms struggle with multilingual data. Researchers, however, need measurements that are equally valid for different cultures and languages. Third, they are susceptible to learning shortcuts and biased patterns from their training data, reducing the validity of measurements across social groups. Fourth, they can be difficult to use, making them only accessible to specialised researchers.
Moritz Laurer’s research shows how instruction-based language models can help overcome these limitations.
The models he developed during his PhD research have been downloaded more than 65 million times and are freely available at: https://huggingface.co/MoritzLaurer.
More information on the thesis.