Research Paper Library

A curated collection of seminal and recent papers in trustworthy machine learning. Papers are organized by topic and include our commentary on significance and practical implications.

Search & Filter

Use Ctrl+F to search for specific topics, authors, or venues. Papers are tagged with key concepts for easy discovery.

Foundational Papers

Fairness & Bias

Seminal Works

Fairness Through Awareness (Dwork et al., 2012)
ITCS 2012 | individual-fairness awareness
Introduces the concept of individual fairness and awareness in algorithmic decision-making.
Equality of Opportunity in Supervised Learning (Hardt et al., 2016)
NIPS 2016 | group-fairness equalized-odds
Defines equalized odds and equality of opportunity for binary classification.
Fairness Definitions Explained (Verma & Rubin, 2018)
IEEE FATES 2018 | survey fairness-metrics
Comprehensive survey of 20+ fairness definitions with mathematical formulations.

Recent Advances

Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned (Bellamy et al., 2019)
WSDM 2019 | aif360 toolkit industry
Practical insights from deploying fairness-aware ML in enterprise settings.
Intersectional Fairness: A Fractal Approach (Foulds et al., 2020)
FAccT 2020 | intersectionality subgroup-fairness
Novel approach to handling fairness across intersecting protected attributes.

Robustness & Adversarial ML

Foundational

Intriguing Properties of Neural Networks (Szegedy et al., 2013)
ICLR 2014 | adversarial-examples discovery
First systematic study of adversarial examples in deep neural networks.
Explaining and Harnessing Adversarial Examples (Goodfellow et al., 2014)
ICLR 2015 | fgsm linear-hypothesis
Introduces FGSM attack and linear hypothesis for adversarial vulnerability.
Towards Deep Learning Models Resistant to Adversarial Attacks (Madry et al., 2017)
ICLR 2018 | pgd adversarial-training
Establishes PGD as the gold standard for adversarial training evaluation.

Certified Defenses

Certified Adversarial Robustness via Randomized Smoothing (Cohen et al., 2019)
ICML 2019 | certified-defense randomized-smoothing
Scalable approach to obtaining robustness certificates using Gaussian noise.
Provably Robust Deep Learning via Adversarially Trained Smoothed Classifiers (Salman et al., 2019)
NeurIPS 2019 | certified-training smoothing
Combines adversarial training with randomized smoothing for stronger guarantees.

Interpretability & Explainability

Core Methods

"Why Should I Trust You?": Explaining the Predictions of Any Classifier (Ribeiro et al., 2016)
KDD 2016 | lime local-explanations
Introduces LIME for locally interpretable model-agnostic explanations.
A Unified Approach to Interpreting Model Predictions (Lundberg & Lee, 2017)
NIPS 2017 | shap shapley-values
SHAP: Unified framework based on cooperative game theory.
Attention is All You Need (Vaswani et al., 2017)
NIPS 2017 | attention transformers interpretability
While primarily an architecture paper, attention mechanisms provide built-in interpretability.

Evaluation & Benchmarking

Evaluating the Visualization of What a Deep Neural Network Has Learned (Simonyan et al., 2013)
ICLR 2014 | saliency-maps evaluation
Early work on evaluating explanation quality through perturbation analysis.
Sanity Checks for Saliency Maps (Adebayo et al., 2018)
NeurIPS 2018 | sanity-checks saliency-evaluation
Demonstrates that many explanation methods fail basic sanity checks.

Privacy-Preserving ML

Differential Privacy

Deep Learning with Differential Privacy (Abadi et al., 2016)
CCS 2016 | differential-privacy sgd deep-learning
First practical application of differential privacy to deep learning training.
The Algorithmic Foundations of Differential Privacy (Dwork & Roth, 2014)
Foundations and Trends | dp-foundations survey
Comprehensive theoretical foundation of differential privacy.

Federated Learning

Communication-Efficient Learning of Deep Networks from Decentralized Data (McMahan et al., 2017)
AISTATS 2017 | federated-learning fedavg
Introduces federated learning and the FedAvg algorithm.
Towards Federated Learning at Scale: System Design (Bonawitz et al., 2019)
MLSys 2019 | federated-systems scale
System design considerations for large-scale federated learning deployment.

Recent Research (2023-2024)

Emerging Topics

Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022)
Anthropic | constitutional-ai alignment safety
Novel approach to AI alignment using constitutional principles and AI feedback.
Red Teaming Language Models with Language Models (Perez et al., 2022)
EMNLP 2022 | red-teaming llm-safety automated-testing
Automated red teaming approach for identifying harmful LLM behaviors.
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback (Bai et al., 2022)
Anthropic | rlhf helpfulness harmlessness
Balancing helpfulness and harmlessness in conversational AI systems.

Benchmark Papers

BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation (Dhamala et al., 2021)
FAccT 2021 | bias-benchmarks language-generation
Comprehensive benchmark for measuring bias in text generation models.
RobustBench: a Standardized Adversarial Robustness Benchmark (Croce et al., 2020)
NeurIPS 2021 | robustness-benchmark evaluation
Standardized benchmark and leaderboard for adversarial robustness evaluation.

Paper Collections by Venue

Top-Tier Conferences

FAccT (Fairness, Accountability, Transparency)ICML/NeurIPSICLR

FAccT 2024 - Latest fairness and accountability research
FAccT 2023 - Includes algorithmic auditing advances
FAccT 2022 - Focus on intersectionality and bias

Focus on theoretical foundations and scalable algorithms
Strong representation in robustness and privacy research
Recent emphasis on LLM safety and alignment

Cutting-edge deep learning approaches to trustworthy ML
Novel architectures for interpretable models
Adversarial robustness innovations

Specialized Venues

AIES (AI, Ethics, and Society): Interdisciplinary perspectives
S&P, CCS, USENIX Security: Security and privacy focus
CHI, CSCW: Human-computer interaction and social impacts
AAAI: Broad AI applications and theoretical work

Reading Lists by Course Module

For Assignment 1: Bias Detection

Verma & Rubin (2018) - Fairness definitions overview
Bellamy et al. (2019) - Practical fairness toolkit usage
Choose one: Group fairness vs. individual fairness comparison

For Assignment 2: Adversarial Robustness

Goodfellow et al. (2014) - FGSM and basic concepts
Madry et al. (2017) - PGD and evaluation methodology
Cohen et al. (2019) - Certified defenses introduction

For Midterm Preparation

Core papers from each topic area marked with ⭐ in the full bibliography.

Contributing

Found an important paper we missed? Submit a suggestion to help keep this library comprehensive and current.

Last updated: {{ git_revision_date }}