Data Curation and SARS-CoV-2: population genomics of 2.5 million genomes

Covid-19 has become a global pandemic, and recently Arkansas has seen a dramatic increase in number of cases, mainly due to a new variant (“Delta”). Using a population genomics approach, we are in the third wave with the current Delta variant accounting for about 83% of the strains sequenced. This has been preceded by the Alpha variant, which peaked in March of this year (2021), and another less characterized variant (Janus), which peaked in September 2020. Each of these variants has become better adapted for infecting and spreading within the human population.

First Steps toward a Data Washing Machine

Data has a life cycle from planning to acquiring, cleansing, storing & sharing, integrating, application, and disposing. While AI and machine learning have taken the application of data to new levels, the other phases remain largely manually mediated processes. The research goal for the Data Life Cycle and Curation thrust is to develop fully automated processes for the other phases of the data life cycle. The presentation today describes some of the progress of the research finding ways to automate data cleansing and data integration phases of the data life cycle.