Computer Science Lecture Series

Provable Algorithms for Resilient Data Science

Sitan Chen, PhD Candidate, MIT

Mar 18, 2021

Please register here to join us for the CS Colloquium Series 2020-21 via Zoom. Once you register, you will receive a recurring Zoom link. You only need to register once to be able to attend any of the seminars. 


Datasets in the wild are rarely captured by the clean models on which learning theory is built, yet we run our algorithms on them all the same, with no guarantee that their performance will degrade gracefully as we stray from the models originally posited. To complicate matters further, modern ML tasks are typically high-dimensional, and data often arrives in a dynamic, non-stationary fashion. In these settings, designing learning algorithms that work under even the cleanest of models can be challenging, let alone ones that are resilient to the pathologies of real-world data.

This talk will center around developing provable algorithms for handling all of these issues simultaneously. I will instantiate these techniques in the context of regression and contextual bandits, yielding the first efficient and practical algorithms to achieve optimal error guarantees even when a constant fraction of the data has been corrupted. These results provably cannot be obtained using off-the-shelf techniques like Huber regression or least trimmed squares. And importantly, unlike existing guarantees in algorithmic robust statistics, ours do not posit any model for the data-generating process. Our results represent an appealing instance where ideas from theory can lead to genuinely new primitives that perform well on real-world data.

Speaker Bio

Sitan Chen is a fifth-year graduate student at MIT advised by Ankur Moitra. He has been the recipient of an NSF Mathematical Sciences Postdoctoral Research Fellowship, a Paul and Daisy Soros Fellowship, an Akamai Presidential Fellowship, and the Captain Jonathan Fay Prize. His research focuses on building new provable algorithms for fundamental problems in the theory of machine learning, especially in settings where data is heterogeneous or contaminated.


CS Colloquium


Jessica Brenn