World-Class Data Science Firm Builds Data Lake and Pipeline Application to Store & Transform Data

Challenge:

In order to create unique and cutting-edge predictive solutions for their clients, this world-class AI/Analytic data science firm’s data scientists needed a:

  • Robust Data Lake for terabytes of diverse and complex datasets.
  • Data pipeline application to transform operationally raw (dirty) customer data, public data, and third-party data into standardized data sets.

Solution:

  • Using AWS Airflow (Apache) as the Data Pipeline (ETL) orchestration engine, Converge engineers developed the modular applications in Python to ingest, cleanse, parse, enrich, and transform raw data and store into AWS S3. The transformed data sets, in the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM), a healthcare OHDSI standard, would become a commercially offered product by this client.
  • Our engineers processing these large datasets, each 300GB-500GB in size, used sophisticated performance-optimizing vectorized Python with PANDA and other advanced data analytic libraries for the heavy data science processing.
  • The first Cloud-based Data Platform (CDP) leverages DataBricks (SPARK on AWS) to support transformed data on demand for ad-hoc analysis, hypothesis testing, exploratory data analysis (EDA), derivative data set generation, and Machine-Learning (ML) models with fluid multi-cloud interoperability for the Data Lake between AWS data warehouse and GCP BigQuery, all integrated with Tableau visualization and Immuta as the data-governance core.

Results:

  • This technology-capable AI/Analytic data science firm found the right partner in Converge to complement their strengths at their level as equals. Our skilled and experienced consulting team met the stringent requirements of a hedge-fund funded AI start-up in a fast pace emergent high-value data science marketplace.