Introduction
The overall goal of the Data Life Cycle and Curation Theme is to create unsupervised and scalable methods that significantly increase the level of automation in the data curation process from acquisition to disposal. The primary curation processes targeted for automation in DART research are data quality management, data integration, and data governance. While there are many tools already available for these processes, they are all dependent upon human supervision to work effectively. One of the most common complaints of data scientists is they only spend 20% of their time on modeling and problem solving and the other 80% on data preparation. The same is true in industry where most of the effort in data operations is consumed in cleansing, standardizing, and integrating data prior to its actual application in information products. As data volume continues to grow at a rapid rate, the curation process has become a significant bottleneck for data operations resulting in long delays before data are available for analytics and other data-driven operations.
The Arkansas Science and Technology Plan which has three primary objectives.
- To identify opportunities for academic and industrial collaboration
- To align future investments in university research competencies with industry areas of technology focus
- To stimulate improvement in technology skills and talent development
Because the lack of automation in data curation is a problem for both industry and academic research, the theme of automating data curation fits well with the first and second objectives. Collaborating with industry through testing real-world datasets will be an essential component of the research. As the first university research to focus on this research theme, the researchers, including student research assistants at participating schools, will develop high-level skills in data analytics, data governance, and machine learning.
Within Arkansas, the UAMS, UAF, and UALR campuses will be the primary drivers for the research. UAMS will bring into the research the special needs of data curation for biomedical informatics, UAF expertise in AI and machine learning, and UALR the industry perspective.
In addition to the internal partners, the preliminary research has already attracted interest from a number of external academic and industry collaborators including the MIT Chief Data Officer and Information Quality program, PiLog Group, and Noetic Partners. As the research matures it is likely to attract more collaborators and potentially result in the development new open source and commercial products and generate new business opportunities.
Goals
- Automate heterogeneous data curation
- Faculty Lead: John Talburt
- Objectives
- Automate Reference Clustering / Automate Data Quality Assessment
- Automate Data Cleansing
- Automate Data Integration
- Explore secure and private distributed data management
- Faculty Lead: John Talburt
- Objectives
- Build a POC and demo for Positive Data Control (PDC)
- Harmonize multi-organizational and siloed data
- Faculty Lead: David Ussery
- Objectives
- Standardize pipelines for genome and proteome storage, retrieval, and visualization
- Automate quality scores for biological sequence data
- Apply machine learning methods to systems biology
Advancing the State of the Knowledge
The three most time-consuming data preparation processes are data cleaning, data integration, and data tracking (data governance). The vision for the research is a “data washing machine.” People are accustomed to throwing their dirty laundry into the washer along with some soap, setting the dials for the type of clothes, and letting the washer operate automatically. A data washing machine would work in a similar manner on dirty data – simply ‘throw in dirty data’, push a button, and out comes ‘clean’ or curated data.
If such machines can be built, the benefits are enormous. They will revolutionize data operations in research, industry, and government. When data cleaning, data integration, and data governance become unsupervised, automated processes, then much more data can be ingested and analyzed, and greater advances in data analytics can be made is less time. At the same time, the improvements in data governance will make enterprise data assets more secure while making them more available and discoverable for authorized users.