Introduction

The proposed research will be supported by a data science cyberinfrastructure (CI) platform capable of providing secure, distributed, agile, scalable, and on-demand services. We propose to architect and build a private cloud environment, the Arkansas Research Platform (ARP) and integrate it with existing high-performance computing and petabyte scale storage resources.

In combination, these will provide

  1. Libraries of pre-configured containers designed to support a variety of well-known and novel workflows in machine and statistical learning, graph theory, bioinformatics, and geoinformatics
  2. Containers configured for parallel computation and distributed memory on HPC resources for analysis of very large datasets
  3. The ability for researchers to create and share new containers and share
  4. The ability to stream data to visualization environments both proximate and distant from the computing resources to aid in analysis and meta-analysis of experiments.

Goals

  1. Establish the Arkansas Research Platform as a shared data science resource across the jurisdiction
    • Faculty Leads: Jackson Cothren, Fred Prior
    • Objectives
      • Establish the Arkansas Research Computing Collaborative (ARCC)
      • Upgrade cluster for data science research activity and integrate with existing resources
      • Establish a science DMZ in Little Rock (UAMS, UALR) and high-speed connection with UAMS
      • Establish a data and code sharing environment (GitHub and Globus)
      • Establish necessary controls to store and manage controlled unclassified, HIPAA-related, and proprietary information at UA and UAMS (other institutions if possible)
  2. Automate heterogeneous data curation
    • Faculty Lead: Jan Springer
    • Objectives
      • Automate Reference Clustering / Automate Data Quality Assessment
      • Automate Data Cleansing
      • Automate Data Integration

Advancing the State of the Knowledge

A special CI advisory board, chaired by James Deaton, executive director of the Great Plains Network, and composed of the CoPIs of the NSF funded CyberTeam award #1925681, has been formed and will advise in refinement and management of the Arkansas Research Platform described below. This special advisory board will be useful not only in providing external experience in building data science computing platforms, but in coordinating the connection of ARP to the Great Plains Network Research Platform, the Great Plains Augmented Regional Gateway to the Open Science Grid, and on to nationally organized compute and storage resources, complimenting existing connections through XSEDE.

A unifying function of the CI is support for the development, optimization and management of analysis pipelines from each of the research themes. Our preliminary experience with this approach has been quite positive with existing containerized pipelines for image curation, genomics analysis, and machine learning.