Research

I have a background in statistics, survey data science, and ML. I earned my Ph.D. degree from the University of Michigan in 2021. Supervised by professors Michael R. Elliott and Carol A. C. Flannagan, my doctoral research was focused on robust and scalable Bayesian approaches for finite population inference about big unstructured data as large-scale non-probability samples (find the full-text of my dissertation here). I have authored several papers and open-source code during my research, which can be found in Google Scholar and GitHub profiles. The wide reliance of such methods on flexible prediction modeling has led me to develop my theoretical knowledge and skills in ML, with more focus on Bayesian learning, e.g. BART and GP, which are not only robust but also permit quantifying the prediction uncertainty. Since survey data come with additional complexities in the sample design and data structure, I developed my knowledge and expertise on how to properly account for sampling weights and sampling clusters and at the same time, propagate the uncertainty of model prediction when training different ML algorithms.

It’s now about three years since I started my career at Meta as a Research Scientist, where I had a chance to apply my knowledge and skills to solve a variety of business-related problems. For instance, I helped mitigate the sampling errors in some on-platform surveys by building automated pipelines that employ Bayesian ML techniques to estimate the pseudo-weights of survey respondents. Playing a central role as a sampling/weighting expert within the team brought me a wide range of responsibilities from conducting ad-hoc analyses to feature engineering, data processing, and model optimization, which also required extensive collaboration with XFN teams as well as external academic partners. My expertise also concerns advanced causal inference methods for addressing complex business hypotheses. At Meta, I used this expertise when conducting a panel study embedded in a large-scale experiment to test some critical business-related theoretical DAGs in the presence of time-varying mediators. Furthermore, I conducted graph cluster experiments to validate the formulas used for estimating network effect in user engagement where I developed linearization methods to quantify the associated uncertainty under the complex correlation structure of the data. Where needed, I also proposed novel tools, e.g. a stat test for comparing weighted means of partially correlated/overlapping samples, or a design-based method for mix-shift analysis, and used Monte Carlo simulation studies to assess their performance.

With the invention of transformers and the widespread use of generative models in training LLMs, my attention was turned to how I can use my theoretical knowledge of probability and statistics as well as my practical experience in building more responsible, trustworthy, and uncertainty-aware AI systems. This is mainly because the research community finds a huge gap in LLMs where outputs lack warning the user of the uncertainty and unfairness of their response. This led me to expand my knowledge and skills in deep neural networks, natural language processing, reinforcement learning, and metrics used for model evaluation in this context. Currently, I am passionate about developing scalable methods that establish a unified framework for mitigating fairness and propagating uncertainty when training ML models, especially deep generative models, using non-parametric Bayesian methods. You will see more about my reserach findings as well as relevant open-source code in this area in the near future.