EXPERIENCE
Last Updated: July 2025
Summary
Sijia Li (Nancy)
Research data scientist/data analyst with a strong foundation in machine learning, statistical analysis, and data visualization, as well as experience in technology, pharmaceuticals, and economics and innovation research. Excels in transforming complex datasets into actionable insights and innovative solutions. Passionate about solving real-world problems, storytelling through data, and driving key decisions.
- Current Location: Boston, MA, United States
- Email: [email protected]
Education
M.S. in Data Science
08/2022 - 05/2024
Harvard University; Cambridge, MA, United States
- Relevant Coursework: Data Science 1 & 2, Machine Learning Operations (MLOps), Natural Language Processing (NLP), Computer Vision (CV), Systems Development, Visualization
- Teaching Fellow Appointments: Visualization, Critical Thinking in Data Science
BASc. in Industrial Engineering, Minors in Engineering Business & AI Engineering
09/2017 - 06/2022
University of Toronto; Toronto, ON, Canada
- Relevant Coursework: Probability, Statistics, Data Modelling, Operations Research, Object Oriented Programming, Artificial Intelligence, Algorithms & Numerical Methods, Optimization in Machine Learning
- Activities: University of Toronto Consulting Association (Development Co-Director, Consulting Group Associate)
Professional Experience
Research Associate
07/2024 - Present
HBS and PCRI; Boston, MA, United States
Project 1: Chinese Innovation in Critical Technologies
- Developed a suite of publication-quality visualizations (e.g., line charts, stacked area charts, histograms) using Matplotlib and Seaborn in Python and summary tables to show Chinese patenting activities from 1985 to 2023
- Queried Revelio Labs workforce data using SQL, linked it to assignee-inventor patent data to construct firm-level diffusion panels, and ran OLS regressions showing a strong positive correlation between joiners/inventors and future patenting
Project 2: PCRI Research Database
- Spearheaded the migration of a proprietary research database of private equity funds and transactions from Stata-MySQL to Python, modernizing core research infrastructure for long-term scalability and maintainability
- Refactored over 15K lines of Stata code into modular, well-structured Python functions using pandas, NumPy, and other standard Python packages to support efficient and reproducible ETL workflows
Data Engineer Intern
05/2023 - 08/2023
Merck KGaA, EMD Digital; Cambridge, MA, United States
Project: Insufficient Follow-up After Paper Retractions Damages the Scientific Record
- Created a comprehensive biomedical articles and retractions dataset by scraping and retrieving data PubMed, PMC, and iCite
- Automated ETL pipelines with PySpark in Palantir Foundry, enabling parallel ingestion of 35M+ articles and reducing manual effort by 90%
- Built interactive dashboards to visualize pre- and post-retraction citation patterns and the impact on scientific publishing
Business Analyst Intern
08/2020 - 05/2021
Sanofi Pasteur; Toronto, ON, Canada
- Analyzed project financial data ($200M+ CapEx and $60M+ OpEx) to identify root causes of budget to actual variances.
- Supported metrics creation in the Project Portfolio Prioritization and Optimization initiative to evaluate, score, and rank projects from various functional areas, as well as discovering effective project selection and prioritization strategies.