Scientific Retractions
Highlight
- Category: Data Engineering
- Year: 2023
- Keywords: Pharmaceuticals, Life Sciences, Technology, Python, SQL, Agile Methodologies, Scrum, Jira
Description
Scientific retraction is the process of withdrawing papers with questionable credibility. The objectives
are: 1) build an automated data pipeline by pulling in 3 different data sources: PubMed, iCite, and
PubMed Central (i.e., information about articles, citations, and scientific retractions), and 2) develop
an interactive dashboard in Foundry and demonstrate the key statistics about scientific retractions.
The overall data pipeline starts from retrieving raw data to parsing and processing data all the way to
generating output data. Ontology objects such as “articles” are built based on these output datasets and
are used for creating the dashboard. In addition, an interactive dashboard is built based on the outputs
from the data pipeline. There are three main takeaways. First, looking at the scale and spread of
scientific retractions, around 97% of the papers published have a retracted paper in their citation
lineage over the course of 10 generations. Generation 0 is the original retracted paper. Generation 1 is
the paper that cited a Generation 0 paper. And so on and so forth. Second, looking at the year-over-year
trend, both the percentages and the numbers of retracted and Generation 1 papers have increased over
time. Third, looking at the latest changes, we see that in the past month, there are still more papers
getting retracted and more papers citing retracted papers.