ML Model as a Service: Containerizing CNN Models for Production Deployment
A comprehensive guide to transforming machine learning models into production-ready REST APIs using containerization, Seldon Core, and Kubernetes-native deployment strategies.
Table of Contents
The Production Gap in Data Science
The business value of data science is often measured by model accuracy, but the real value lies in taking models to production. At Gonnect, we've observed a critical disconnect: organizations invest heavily in model development while treating deployment as an afterthought. The truth is, it's not just about the model - it's about the entire ML pipeline.
A production-ready ML system encompasses:
- Data transformation
- Feature extraction and preprocessing
- The ML model itself
- Prediction transformation and post-processing
Each component must be deployable, scalable, and maintainable. This is where the traditional data science workflow falls short.
The Framework Fragmentation Challenge
Modern ML teams work with a diverse ecosystem of frameworks:
| Framework | Primary Use Case | Language |
|---|---|---|
| scikit-learn | Classical ML algorithms | Python |
| TensorFlow | Deep learning at scale | Python/C++ |
| Keras | High-level neural networks | Python |
| PyTorch | Research and production DL | Python |
| Spark ML | Distributed ML pipelines | Scala/Python |
The challenge is enabling data scientists to use their framework of choice while maintaining consistent deployment patterns. The solution? Containerization and language-agnostic serving.
Architecture Overview
The MachineLearningAsService project demonstrates a production-grade approach to serving CNN models as microservices. The architecture leverages Seldon Core for model serving and Source-to-Image (S2I) for containerization.
MLOps Pipeline
The Seldon Core Advantage
Seldon Core is an open-source platform for deploying ML models on Kubernetes. Its key strength is zero-code microservice generation - you provide the model, Seldon handles the REST API, health checks, and Kubernetes integration.
How Seldon Works
- Model Wrapping: Seldon provides language-specific base images that know how to load and serve models
- S2I Build Process: Source-to-Image compiles your model into a production container
- Standardized API: All models expose the same prediction interface
- Kubernetes Native: Seldon deployments are first-class Kubernetes resources
CNN Model Serving: MNIST Example
The project demonstrates serving a Convolutional Neural Network trained on the MNIST handwritten digit dataset. While MNIST is a canonical example, the patterns apply to any CNN architecture.
Model Architecture
# Typical CNN structure for MNIST classification
class DeepMnist:
def __init__(self):
self.model = self._build_model()
self.model.load_weights('model.h5')
def _build_model(self):
model = Sequential([
Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
MaxPooling2D((2, 2)),
Conv2D(64, (3, 3), activation='relu'),
MaxPooling2D((2, 2)),
Flatten(),
Dense(64, activation='relu'),
Dense(10, activation='softmax')
])
return model
def predict(self, X, features_names=None):
"""Seldon-compatible prediction method"""
return self.model.predict(X)
The key insight is the predict method signature. Seldon expects this specific interface, enabling seamless integration without custom API code.
API Contract Definition
Seldon uses contracts to define the expected input/output format:
{
"features": [
{
"name": "image",
"dtype": "FLOAT",
"ftype": "continuous",
"range": [0, 1],
"shape": [1, 28, 28, 1]
}
],
"targets": [
{
"name": "digit",
"dtype": "INT",
"ftype": "categorical",
"range": [0, 9]
}
]
}
This contract enables automatic input validation, documentation generation, and client SDK creation.
Containerization Workflow
The containerization process transforms a Python model into a production-ready Docker image in a single command.
MLOps Pipeline
Build Command
# Build the model container using Seldon's S2I builder
s2i build . seldonio/seldon-core-s2i-python36:0.4 deepmnist:0.1
This single command:
- Pulls the Seldon Python 3.6 base image
- Copies your model code into the container
- Installs Python dependencies from
requirements.txt - Configures the Seldon prediction server
- Tags the resulting image
Running the Service
# Start the prediction service
docker run --name "mnist_prediction_service" --rm -p 5000:5000 deepmnist:0.1
The service immediately exposes:
POST /predict- Model inference endpointGET /health- Liveness check- Prometheus metrics for monitoring
Testing the Deployment
Seldon provides a built-in testing utility that validates the API against the contract:
# Test the prediction endpoint using the contract
seldon-core-tester contract.json 0.0.0.0 5000 -p
This generates sample inputs matching the contract schema and verifies the response format.
API Design Patterns
The REST API exposed by Seldon follows cloud-native best practices:
Prediction Request
POST /predict
{
"data": {
"ndarray": [[[[0.0, 0.1, 0.2, ...]]]]
}
}
Prediction Response
{
"data": {
"ndarray": [[0.01, 0.02, 0.85, 0.01, 0.02, 0.03, 0.02, 0.02, 0.01, 0.01]]
},
"meta": {
"puid": "unique-prediction-id",
"requestPath": {"deepmnist": "deepmnist:0.1"}
}
}
Key API Characteristics
- Stateless: Each request is independent, enabling horizontal scaling
- Idempotent: Same input always produces same output
- Traceable: Every prediction includes a unique ID for debugging
- Versioned: Model version is embedded in response metadata
Kubernetes Deployment
Once containerized, the model can be deployed to any Kubernetes cluster:
Microservices Architecture
Seldon Deployment Manifest
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: mnist-model
spec:
predictors:
- name: default
replicas: 3
graph:
name: deepmnist
implementation: SKLEARN_SERVER
modelUri: s3://models/deepmnist
componentSpecs:
- spec:
containers:
- name: deepmnist
image: deepmnist:0.1
resources:
requests:
memory: "256Mi"
cpu: "100m"
limits:
memory: "512Mi"
cpu: "500m"
Language and Framework Agnosticism
A critical advantage of this approach is language independence. The same containerization pattern works for:
- Python: Keras, TensorFlow, PyTorch, scikit-learn
- Java/Kotlin: DL4J, Tribuo
- R: tidymodels, caret
- Scala: Spark ML
Data scientists can develop in their language of choice, and the deployment pipeline remains consistent.
Production Considerations
Resource Management
resources:
requests:
memory: "256Mi"
cpu: "100m"
nvidia.com/gpu: 1 # Optional GPU support
limits:
memory: "1Gi"
cpu: "1000m"
Health Checks
livenessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 10
periodSeconds: 5
readinessProbe:
httpGet:
path: /health
port: 5000
initialDelaySeconds: 5
periodSeconds: 3
Scaling Policies
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: mnist-hpa
spec:
scaleTargetRef:
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
name: mnist-model
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Beyond MNIST: Real-World Applications
While this project uses MNIST for demonstration, the patterns apply directly to production use cases:
| Use Case | Model Type | Serving Requirements |
|---|---|---|
| Image Classification | CNN (ResNet, VGG) | GPU acceleration, batch processing |
| Object Detection | YOLO, SSD | Low latency, high throughput |
| NLP Classification | BERT, Transformers | Memory-intensive, CPU or GPU |
| Recommendation | Matrix Factorization | High concurrency, caching |
| Fraud Detection | Ensemble Models | Real-time scoring, audit logging |
The Philosophy: Empowering Data Scientists
The architecture mantra behind this approach is simple:
"Empower data scientists and engineers while using data science to add business value by providing agility with velocity and maximum utilization of their talent pool."
By abstracting away the infrastructure complexity:
- Data scientists focus on model quality, not deployment scripts
- ML engineers focus on platform reliability, not framework-specific quirks
- Organizations achieve consistent deployment patterns across diverse models
Prerequisites and Setup
To get started with this approach:
# Install Seldon Core CLI
pip install seldon-core
# Install Keras
pip install keras tensorflow
# Install S2I (macOS)
brew install source-to-image
# Verify Python version
python3 --version # Requires Python 3.6+
Conclusion
The million-dollar question in data science has always been: "How do we take ML models to production in a repeatable, predictable manner while exposing them as services?"
The MachineLearningAsService project answers this through:
- Containerization: Models become portable, versioned artifacts
- Standardized APIs: Consistent interfaces regardless of framework
- Kubernetes-Native Deployment: Enterprise-grade scaling and reliability
- Framework Agnosticism: Freedom to choose the right tool for the job
Not a single line of code is written to create the prediction microservice - the infrastructure handles everything. This is the essence of modern ML productionization: enabling data scientists to focus on what they do best while the platform handles operational concerns.
The approach is cloud-provider agnostic, scales with Kubernetes, and provides a clear path from experimentation to production. For organizations seeking to bridge the gap between data science and production systems, containerized ML services represent the most practical and scalable solution available today.
This project is open source and available at github.com/mgorav/MachineLearningAsService. Explore the implementation and adapt these patterns for your ML deployment needs.