Measurement and Representation Biases in Digital Trace Data-based Studies
This reading-based seminar will cover the latest research on using digital trace data from web and social media platforms like Facebook, Instagram, Wikipedia and others to measure social phenomena such as political attitudes and health behaviours. It will be centred around readings and discussions to understand how representation and measurement errors can creep into research studies using this type of data in conjunction with large-scale computational and data-driven models. The course will also cover methods to quantify and mitigate these errors and demonstrate how to design valid and reliable research studies.
The course assessment will be based on a presentation and final report of a chosen reading (60%), on weekly critiques of other papers (40%), and a bonus of 10% for implementing a part of one or more of the discussed papers.
Schedule and Assigned Readings
April 10th Introduction and kickoff
April 24th How to read and review a research paper AND overview of research with digital traces
- Keshav, Srinivasan. “How to read a paper.” ACM SIGCOMM Computer Communication Review 37.3 (2007): 83-84.
- Pain, Elisabeth “How to review a paper”
May 8th + 15th Social data biases
- Olteanu, Alexandra, et al. “Social data: Biases, methodological pitfalls, and ethical boundaries.” Frontiers in big data 2 (2019)
- Slides from a related tutorial by Olteanu and colleagues
May 22nd Measurement and Representation Errors
- Groves, Robert M., and Lars Lyberg. “Total survey error: Past, present, and future.” Public opinion quarterly 74.5 (2010)
- Sen, Indira, et al. “A total error framework for digital traces of human behavior on online platforms.” Public Opinion Quarterly 85.S1 (2021)
Jun 5 Guest Presentation by Max Pellert
- Pellert, Max, et al. “Validating daily social media macroscopes of emotions.” Scientific reports 12.1 (2022): 11236.
Jun 12 Guest Presentation by Manuel Tonneou
- Tonneau, Manuel, et al. “From Languages to Geographies: Towards Evaluating Cultural Bias in Hate Speech Datasets.” Workshop on Online Harms (WOAH), colocated with North American Association of Computational Lingusitics (NAACL) (2024).
Jun 19 Presentation by Leonard Tiedemann
- Jaidka, Kokil, Alvin Zhou, and Yphtach Lelkes. “Brevity is the soul of Twitter: The constraint affordance and political discussion.” Journal of Communication 69.4 (2019): 345-372.
Jun 26 Guest presentation by Nils Feldhus
Feldhus, Nils, et al. “InterroLang: Exploring NLP Models and Datasets through Dialogue-based Explanations.” Findings of the Association for Computational Linguistics: EMNLP 2023.
Schmitt, Vera, et al. “The Role of Explainability in Collaborative Human-AI Disinformation Detection.” The 2024 ACM Conference on Fairness, Accountability, and Transparency. 2024.
Jul 3 Guest presentation by Katrin Weller
Breuer, Johannes, Katrin Weller, and Katharina Kinder-Kurlanda. 2023. The Role of Participants in Online Privacy Research: Ethical and Practical Considerations. In The Routledge Handbook of Privacy and Social Media, edited by Sabine Trepte, and Philipp K. Masur, 314-323. Routledge.
Kinder-Kurlanda, Katharina E., and Katrin Weller. 2020. “Perspective: Acknowledging data work in the social media research lifecycle.” Frontiers in Big Data 3 (509954).
Jul 10 Guest presentation by Giordano de Marzo
- De Marzo, Giordano, Luciano Pietronero, and David Garcia. “Emergence of scale-free networks in social interactions among large language models.” arXiv preprint arXiv:2312.06619 (2023).
Background on LLMs and Social Simulations:
Riedl, Mark. A Very Gentle Introduction to Large Language Models without the Hype
Park et al. Social Simulacra: Creating Populated Prototypes for Social Computing Systems
Jul 17 Presentation by Theresa Wagner
- Lucy, Li, and David Bamman. “Gender and representation bias in GPT-3 generated stories.” Proceedings of the third workshop on narrative understanding. 2021.
Jul 24 Guest presentation by Jessica Daikeler
- Daikeler, Jessica, et al. “Assessing data quality in the age of digital social research: A systematic review.” Social Science Computer Review (2024): 08944393241245395.
Jul 31 Presentation by Peer Saleth
- Lasser, Jana, et al. “From alternative conceptions of honesty to alternative facts in communications by US politicians.” Nature human behaviour 7.12 (2023): 2140-2151.
Aug 7 Wrap-up
Aug 15 Final reports due
Suggested Readings
If you can’t access the full text of any of these, email me for a copy.
Construct definition
Ruths, Derek, and Jürgen Pfeffer. “Social media for large studies of behavior.” Science 346.6213 (2014): 1063-1064.
Blodgett, Su Lin, et al. “Language (Technology) is Power: A Critical Survey of ‘Bias’ in NLP.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.
Wagner, Claudia, et al. “Measuring algorithmically infused societies.” Nature 595.7866 (2021): 197-204.
Platform Effects
Malik, Momin, and Jürgen Pfeffer. “Identifying platform effects in social media data.” Proceedings of the International AAAI Conference on Web and Social Media. Vol. 10. No. 1. 2016.
Gligorić, Kristina, Ashton Anderson, and Robert West. “How constraints affect content: The case of Twitter’s switch from 140 to 280 characters.” Proceedings of the International AAAI Conference on Web and Social Media. Vol. 12. No. 1. 2018.
Arazy, Ofer, et al. “Information quality in Wikipedia: The effects of group composition and task conflict.” Journal of management information systems 27.4 (2011): 71-98.
Data Collection
Zafar, Muhammad Bilal, et al. “Sampling content from online social networks: Comparing random vs. expert sampling of the twitter stream.” ACM Transactions on the Web (TWEB) 9.3 (2015): 1-33.
Gaffney, Devin, and J. Nathan Matias. “Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus.” PloS one 13.7 (2018): e0200162.
Pfeffer, Juergen, et al. “This Sample seems to be good enough! Assessing Coverage and Temporal Reliability of Twitter’s Academic API.” Proceedings of the International AAAI Conference on Web and Social Media. Vol. 17. 2023.
Data Preprocessing and Modeling
Zagheni, Emilio, and Ingmar Weber. “Demographic research with non-representative internet data.” International Journal of Manpower 36.1 (2015): 13-25.
Culotta, Aron. “Reducing sampling bias in social media data for county health inference.” Joint Statistical Meetings Proceedings. Citeseer, 2014.
Jurgens, David, et al. “Geolocation prediction in twitter using social networks: A critical analysis and review of current practice.” Proceedings of the international AAAI conference on web and social media. Vol. 9. No. 1. 2015.
Cohen, Raviv, and Derek Ruths. “Classifying political orientation on Twitter: It’s not easy!.” Proceedings of the International AAAI Conference on Web and Social Media. Vol. 7. No. 1. 2013.
Fleisig, Eve, Rediet Abebe, and Dan Klein. “When the majority is wrong: Leveraging annotator disagreement for subjective tasks.” arXiv preprint arXiv:2305.06626 (2023).
Lucy, Li, and David Bamman. “Gender and representation bias in GPT-3 generated stories.” Proceedings of the third workshop on narrative understanding. 2021.