Scientific Retractions

Highlight

Category: Data Engineering
Year: 2023
Keywords: Pharmaceuticals, Life Sciences, Technology, Python, SQL, Agile Methodologies, Scrum, Jira

Description

Scientific retraction is the process of withdrawing papers with questionable credibility. The objectives are: 1) build an automated data pipeline by pulling in 3 different data sources: PubMed, iCite, and PubMed Central (i.e., information about articles, citations, and scientific retractions), and 2) develop an interactive dashboard in Foundry and demonstrate the key statistics about scientific retractions.

The overall data pipeline starts from retrieving raw data to parsing and processing data all the way to generating output data. Ontology objects such as “articles” are built based on these output datasets and are used for creating the dashboard. In addition, an interactive dashboard is built based on the outputs from the data pipeline. There are three main takeaways. First, looking at the scale and spread of scientific retractions, around 97% of the papers published have a retracted paper in their citation lineage over the course of 10 generations. Generation 0 is the original retracted paper. Generation 1 is the paper that cited a Generation 0 paper. And so on and so forth. Second, looking at the year-over-year trend, both the percentages and the numbers of retracted and Generation 1 papers have increased over time. Third, looking at the latest changes, we see that in the past month, there are still more papers getting retracted and more papers citing retracted papers.