EXPERIENCE

Last Updated: July 2025

Summary

Sijia Li (Nancy)

Research data scientist/data analyst with a strong foundation in machine learning, statistical analysis, and data visualization, as well as experience in technology, pharmaceuticals, and economics and innovation research. Excels in transforming complex datasets into actionable insights and innovative solutions. Passionate about solving real-world problems, storytelling through data, and driving key decisions.

Education

M.S. in Data Science

08/2022 - 05/2024

Harvard University; Cambridge, MA, United States

  • Relevant Coursework: Data Science 1 & 2, Machine Learning Operations (MLOps), Natural Language Processing (NLP), Computer Vision (CV), Systems Development, Visualization
  • Teaching Fellow Appointments: Visualization, Critical Thinking in Data Science

BASc. in Industrial Engineering, Minors in Engineering Business & AI Engineering

09/2017 - 06/2022

University of Toronto; Toronto, ON, Canada

  • Relevant Coursework: Probability, Statistics, Data Modelling, Operations Research, Object Oriented Programming, Artificial Intelligence, Algorithms & Numerical Methods, Optimization in Machine Learning
  • Activities: University of Toronto Consulting Association (Development Co-Director, Consulting Group Associate)

Professional Experience

Research Associate

07/2024 - Present

HBS and PCRI; Boston, MA, United States

Project 1: Chinese Innovation in Critical Technologies

  • Developed a suite of publication-quality visualizations (e.g., line charts, stacked area charts, histograms) using Matplotlib and Seaborn in Python and summary tables to show Chinese patenting activities from 1985 to 2023
  • Queried Revelio Labs workforce data using SQL, linked it to assignee-inventor patent data to construct firm-level diffusion panels, and ran OLS regressions showing a strong positive correlation between joiners/inventors and future patenting

Project 2: PCRI Research Database

  • Spearheaded the migration of a proprietary research database of private equity funds and transactions from Stata-MySQL to Python, modernizing core research infrastructure for long-term scalability and maintainability
  • Refactored over 15K lines of Stata code into modular, well-structured Python functions using pandas, NumPy, and other standard Python packages to support efficient and reproducible ETL workflows

Data Engineer Intern

05/2023 - 08/2023

Merck KGaA, EMD Digital; Cambridge, MA, United States

Project: Insufficient Follow-up After Paper Retractions Damages the Scientific Record

  • Created a comprehensive biomedical articles and retractions dataset by scraping and retrieving data PubMed, PMC, and iCite
  • Automated ETL pipelines with PySpark in Palantir Foundry, enabling parallel ingestion of 35M+ articles and reducing manual effort by 90%
  • Built interactive dashboards to visualize pre- and post-retraction citation patterns and the impact on scientific publishing

Business Analyst Intern

08/2020 - 05/2021

Sanofi Pasteur; Toronto, ON, Canada

  • Analyzed project financial data ($200M+ CapEx and $60M+ OpEx) to identify root causes of budget to actual variances.
  • Supported metrics creation in the Project Portfolio Prioritization and Optimization initiative to evaluate, score, and rank projects from various functional areas, as well as discovering effective project selection and prioritization strategies.