Machine Learning Engineer

“Machine Learning Engineers are responsible for integrating ML assets into products”

Viewpoint

Machine Learning Engineers are responsible for taking data and math-centric assets like ML models and integrating them into production-ready products. The role encompasses all the challenges of conventional asset development but must also resolve the issues of versioning very large data sets and efficiently moving large volumes of data onto high-performance computing infrastructure for training operations. Typically, ML models are high risk assets as they use large volumes of sensitive data to create assets that undertake decision-making activities, hence there are increased governance and compliance requirements that must be met to maintain safety and privacy.

View

As a Machine Learning Engineer, I want to be able to be able to manage machine learning models safely and reliably in production. I need to be able to access large stores of structured or unstructured data, clean, validate and separate the data into versioned training and testing data sets.

I need to be able to transfer this data to a Cloud facility or dedicated high-performance computing environment where I can leverage large numbers of GPU or TPU resources to process the data and extract trained models.

I would like to be able to version all training data and trained models so that I can run iterative comparisons and regression tests on newly trained models.

As part of the training process, I need to be able to perform integration and acceptance testing on the model produced.

I must be able to maintain a forensic audit trail of all training and deployment activities to support compliance activities. I need models to integrate seamlessly into the asset management and release process for the product within which they are a part.

I must be able to efficiently manage the utilization of expensive compute resources.

Value Add from Continuous Delivery

  • Minimizing risk associated with ML
  • Reduced lead times in delivering new capabilities
  • Reduced time to restore from failure
  • Reduced change failure rates
  • Increased deployment frequency
  • Automated testing
  • Automated deployment
  • Regulatory compliance
Last modified September 12, 2022: Added remaining Views and Viewpoints (b3bfaac)