Measurement and Representation Biases in Digital Trace Data-based Studies
This reading-based seminar will cover the latest research on using digital trace data from web and social media platforms like Facebook, Instagram, Wikipedia and others to measure social phenomena such as political attitudes and health behaviours. It will be centred around readings and discussions to understand how representation and measurement errors can creep into research studies using this type of data in conjunction with large-scale computational and data-driven models. The course will also cover methods to quantify and mitigate these errors and demonstrate how to design valid and reliable research studies.
The course assessment will be based on a presentation and final report of a chosen reading (60%), on weekly critiques of other papers (40%), and a bonus of 10% for implementing a part of one or more of the discussed papers.
Schedule and Assigned Readings
April 10th Introduction and kickoff
April 24th How to read and review a research paper AND overview of research with digital traces
- Keshav, Srinivasan. “How to read a paper.” ACM SIGCOMM Computer Communication Review 37.3 (2007): 83-84.
- Pain, Elisabeth “How to review a paper”
May 8th Social data biases
- Olteanu, Alexandra, et al. “Social data: Biases, methodological pitfalls, and ethical boundaries.” Frontiers in big data 2 (2019)
May 15th Measurement and Representation Errors
- Groves, Robert M., and Lars Lyberg. “Total survey error: Past, present, and future.” Public opinion quarterly 74.5 (2010)
- Sen, Indira, et al. “A total error framework for digital traces of human behavior on online platforms.” Public Opinion Quarterly 85.S1 (2021)
May 22th guest presentation
Jun 5 student presentation
Jun 12 guest presentation [Max Pellert]
Jun 19 student presentation
Jun 26 guest presentation
Jul 3 student presentation
Jul 10 guest presentation [Giordano de Marzo]
Jul 17 student presentation
Jul 24 guest presentation
Jul 31 student presentation
Aug 7 student presentation
Suggested Readings
If you can’t access the full text of any of these, email me for a copy.
Construct definition
Ruths, Derek, and Jürgen Pfeffer. “Social media for large studies of behavior.” Science 346.6213 (2014): 1063-1064.
Blodgett, Su Lin, et al. “Language (Technology) is Power: A Critical Survey of ‘Bias’ in NLP.” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.
Wagner, Claudia, et al. “Measuring algorithmically infused societies.” Nature 595.7866 (2021): 197-204.
Platform Effects
Malik, Momin, and Jürgen Pfeffer. “Identifying platform effects in social media data.” Proceedings of the International AAAI Conference on Web and Social Media. Vol. 10. No. 1. 2016.
Gligorić, Kristina, Ashton Anderson, and Robert West. “How constraints affect content: The case of Twitter’s switch from 140 to 280 characters.” Proceedings of the International AAAI Conference on Web and Social Media. Vol. 12. No. 1. 2018.
Arazy, Ofer, et al. “Information quality in Wikipedia: The effects of group composition and task conflict.” Journal of management information systems 27.4 (2011): 71-98.
Data Collection
Zafar, Muhammad Bilal, et al. “Sampling content from online social networks: Comparing random vs. expert sampling of the twitter stream.” ACM Transactions on the Web (TWEB) 9.3 (2015): 1-33.
Gaffney, Devin, and J. Nathan Matias. “Caveat emptor, computational social science: Large-scale missing data in a widely-published Reddit corpus.” PloS one 13.7 (2018): e0200162.
Pfeffer, Juergen, et al. “This Sample seems to be good enough! Assessing Coverage and Temporal Reliability of Twitter’s Academic API.” Proceedings of the International AAAI Conference on Web and Social Media. Vol. 17. 2023.
Data Preprocessing and Modeling
Zagheni, Emilio, and Ingmar Weber. “Demographic research with non-representative internet data.” International Journal of Manpower 36.1 (2015): 13-25.
Culotta, Aron. “Reducing sampling bias in social media data for county health inference.” Joint Statistical Meetings Proceedings. Citeseer, 2014.
Jurgens, David, et al. “Geolocation prediction in twitter using social networks: A critical analysis and review of current practice.” Proceedings of the international AAAI conference on web and social media. Vol. 9. No. 1. 2015.
Cohen, Raviv, and Derek Ruths. “Classifying political orientation on Twitter: It’s not easy!.” Proceedings of the International AAAI Conference on Web and Social Media. Vol. 7. No. 1. 2013.
Fleisig, Eve, Rediet Abebe, and Dan Klein. “When the majority is wrong: Leveraging annotator disagreement for subjective tasks.” arXiv preprint arXiv:2305.06626 (2023).
Lucy, Li, and David Bamman. “Gender and representation bias in GPT-3 generated stories.” Proceedings of the third workshop on narrative understanding. 2021.