Brief Summary
This YouTube video features a presentation on MLOps by Abined Gupta, a senior machine learning engineer at Shopify. The presentation covers the critical aspects of MLOps, emphasizing that ML code is just a small part of a larger ML system. Gupta discusses the four pillars of an ML system: data, training and experiment tracking, serving, and monitoring. He uses the example of predicting fraudulent credit card transactions to illustrate key concepts and challenges in building and deploying ML systems in production. The presentation highlights the importance of considering scale, latency, throughput, and data drift when designing and implementing MLOps pipelines.
- ML code is a small part of ML system
- MLOps is infrastructure and processes that helps you turn ML model into reliable production systems
- Ask questions early about scale of model, latency requirement or the throughput requirements
Introduction
Nishant from the Consulting and Analytics Club, IIT Guwahati, introduces Abined Gupta as the keynote speaker for a session on MLOps. Abined Gupta has over 10 years of experience in data science, risk forecasting, and statistical modeling. He currently works as an applied machine learning scientist at Shopify, focusing on building robust and scalable machine learning systems. Gupta expresses his intention to make the session interactive, encouraging audience participation through questions and feedback.
What is MLOps?
MLOps, or ML Operations, is a set of tools and practices used to operationalize machine learning models, ensuring they can be reliably deployed in production. While courses often focus on the ML model itself, MLOps encompasses the broader system required to serve these models to users. An ML system includes configuration, data collection and verification, model serving processes, and performance monitoring to determine when retraining or model changes are necessary. ML code is a small part of the overall system, with most time spent on other components.
ML System Components
The core of an ML system is the ML code, but it relies on several other components. Configuration involves setting up machines, features, and hyperparameters. Data aspects include collection, feature extraction, and verification. Model serving requires processes for user access within specific requirements. Monitoring tracks model performance to decide when to retrain or modify the model.
Motivating Problem: Fraud Detection
The presentation uses the problem of predicting fraudulent credit card transactions as a motivating example. The goal is to determine whether a transaction is fraudulent and decline it, or approve it if it is legitimate. This problem serves as a through-line to illustrate how to build an ML system around ML code.
MLOps Pipeline: Data, Training, Serving, Monitoring
The MLOps pipeline is broken down into four main pillars: data, training and experiment tracking, serving, and monitoring. The ML code component fits into the training pillar. This structure serves as a roadmap for the presentation, with each section focusing on one of these pillars.
ML Code: Model Selection
The audience is asked to suggest suitable ML models for predicting fraud in credit card transactions. Common suggestions include logistic regression, random forests, and neural networks. Logistic regression is a typical first choice for classification problems. Random forests and XGBoost are also good options, with XGBoost offering faster training and better performance due to optimizations and parallelizations.
Data: Labels and Features
The discussion shifts to the data required for training the fraud detection model, focusing on labels and features. Labels are derived from chargebacks, specifically the reason codes indicating suspected fraud. Features include raw data from transactions (amount, location, bank name) and engineered features (hour of the day, distance between merchant and buyer, average spending in the last 10 days). The scale of the data can be very large, involving billions of transactions, which necessitates distributed computing.
Quantifying Typical Behavior
Quantifying typical behavior is crucial for identifying anomalous transactions. This involves engineering features that capture a buyer's or merchant's historical transaction patterns. Window aggregation features are used, which consider a time window of history and engineer features based on that window. These features have four dimensions: aggregate (average, sum, percentile), entity (count of transactions), filter (clothing stores, successful transactions), and window (last 30 days). The feature space grows rapidly due to these dimensions, requiring significant computational resources.
Data ETL
Data ETL (Extract, Transform, Load) is a critical MLOps process. Data is extracted from various sources like databases and CSV files, then transformed to ensure consistency (e.g., converting amounts to USD, timestamps to UTC) and to create features. Finally, the transformed data is loaded into storage solutions like S3 or data warehouses like Snowflake, making it accessible for model training.
Training and Experiment Tracking
For training classification models, appropriate metrics are essential. Given the imbalanced nature of fraud data (fraud rates less than 1%), accuracy is not a good metric. Precision and recall are better, but they depend on selecting a probability threshold. A precision-recall curve and the area under it are used to compare model performance. Training resources must be scalable, often requiring large cloud machines or distributed training. Experiment tracking tools like MLflow help teams collaborate by automatically tracking parameters, models, and metrics from each experiment run.
Model Serving
Model serving involves packaging the trained model with its dependencies and weights into a deployable format like Docker. Considerations include model size, latency, and throughput. Depending on the model size, it can be packaged within the same node as the service or deployed at an endpoint accessed via an API. Online features (available at transaction time) and offline features (pre-computed historical data) must be accessed efficiently to meet latency requirements.
Online and Offline Features
Online features are those available at the time of the transaction, such as amount and time of day. Offline features, like average spending in the last 10 days, are pre-computed and stored due to the computational cost. Online transformations, such as taking the log of the amount or calculating the distance between buyer and merchant, are performed on the fly.
Performance Monitoring
Performance monitoring measures the model's performance on online data. This includes logging latency and throughput, and tracking chargebacks as fraud indicators. Since chargebacks are lagging indicators, rapid signals are needed, such as analyzing chargebacks within seven days. Data drift (changes in feature distributions) and concept drift (changes in how data indicates fraud) are tracked to determine when to retrain the model.
Continuous Integration and Delivery (CI/CD)
Continuous integration and continuous delivery (CI/CD) practices are applied to ML systems to enable continuous updates with new data, features, and models. This includes data schema testing, feature definition monitoring, and data versioning. Model versioning allows for A/B testing and tapered activation. Autotraining pipelines can be set up to automatically retrain and deploy models based on performance metrics.
MLOps Tools
Various tools are used in MLOps for different purposes. Data ETL involves tools for data sources, feature engineering, and orchestration (e.g., Airflow). Experiment tracking uses tools like MLflow and Weights & Biases. Distributed training can be done with Amazon SageMaker or Ray. Model serving utilizes Docker and Kubernetes. Monitoring involves tools like Jupyter notebooks, Grafana, and specialized model monitoring platforms.
Key Takeaways
The three main takeaways are: ML models are a small part of an ML system, MLOps is the infrastructure and processes that turn prototype models into reliable production systems, and asking questions early about scale, latency, and throughput is crucial for success.
Q&A: Feature Integration and Model Comparison
In response to a question about feature integration, Gupta clarifies the difference between online and offline features. He explains that the inability to compare previous and new models arises from data versioning issues. If a feature is removed from a new model, it cannot be used to evaluate the old model on the same dataset, hindering performance comparisons.
Q&A: Best Model for Fraudulent Issues
Regarding the best model for fraud detection, Gupta notes that latency and throughput requirements favor simpler models like random forests and XGBoost. These models are smaller and faster than neural networks. While neural networks can eliminate feature engineering, decision tree models often perform well with tabular data.
Q&A: MLOps Engineer's Role
The self-driven nature of an MLOps engineer's role depends on the company. Larger companies may have established systems, while smaller companies require more system creation. MLOps engineers must stay updated on the latest tools and integrate them to benefit ML engineers.
Q&A: MLOps vs. Data Science
MLOps focuses on building ML systems, while data science involves extracting insights from data. Data scientists analyze data and create dashboards, while ML and MLOps engineers focus on machine learning components and building end-to-end systems.
Q&A: Model Accuracy and Latency
A model with high accuracy but long processing time may not be the best choice if latency requirements are strict. Sacrificing some performance for a model that meets latency requirements is common in industry.
Q&A: Models in Medical Images
Classification models can be applied to medical images, but decision tree models may not be suitable due to the inability to engineer features from images. Convolutional neural networks (CNNs) can extract features from medical images, often using pre-trained models.
Conclusion and Thank You
Nishant thanks Abined Gupta for the insightful session. Gupta expresses his appreciation for the engaged audience and encourages them to reach out with further questions. He wishes them success in their hackathon projects, urging them to consider latency, scale, throughput, and data collection.

