ML Model as a Service: Containerizing CNN Models for Production Deployment

A comprehensive guide to transforming machine learning models into production-ready REST APIs using containerization, Seldon Core, and Kubernetes-native deployment strategies.

GM
Gaurav Malhotra
January 16, 202410 min readView on GitHub
PythonCNNREST APIDocker

The Production Gap in Data Science

The business value of data science is often measured by model accuracy, but the real value lies in taking models to production. At Gonnect, we've observed a critical disconnect: organizations invest heavily in model development while treating deployment as an afterthought. The truth is, it's not just about the model - it's about the entire ML pipeline.

A production-ready ML system encompasses:

  • Data transformation
  • Feature extraction and preprocessing
  • The ML model itself
  • Prediction transformation and post-processing

Each component must be deployable, scalable, and maintainable. This is where the traditional data science workflow falls short.

The Framework Fragmentation Challenge

Modern ML teams work with a diverse ecosystem of frameworks:

FrameworkPrimary Use CaseLanguage
scikit-learnClassical ML algorithmsPython
TensorFlowDeep learning at scalePython/C++
KerasHigh-level neural networksPython
PyTorchResearch and production DLPython
Spark MLDistributed ML pipelinesScala/Python

The challenge is enabling data scientists to use their framework of choice while maintaining consistent deployment patterns. The solution? Containerization and language-agnostic serving.

Architecture Overview

The MachineLearningAsService project demonstrates a production-grade approach to serving CNN models as microservices. The architecture leverages Seldon Core for model serving and Source-to-Image (S2I) for containerization.

MLOps Pipeline

Loading diagram...

The Seldon Core Advantage

Seldon Core is an open-source platform for deploying ML models on Kubernetes. Its key strength is zero-code microservice generation - you provide the model, Seldon handles the REST API, health checks, and Kubernetes integration.

How Seldon Works

  1. Model Wrapping: Seldon provides language-specific base images that know how to load and serve models
  2. S2I Build Process: Source-to-Image compiles your model into a production container
  3. Standardized API: All models expose the same prediction interface
  4. Kubernetes Native: Seldon deployments are first-class Kubernetes resources

CNN Model Serving: MNIST Example

The project demonstrates serving a Convolutional Neural Network trained on the MNIST handwritten digit dataset. While MNIST is a canonical example, the patterns apply to any CNN architecture.

Model Architecture

# Typical CNN structure for MNIST classification
class DeepMnist:
    def __init__(self):
        self.model = self._build_model()
        self.model.load_weights('model.h5')

    def _build_model(self):
        model = Sequential([
            Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
            MaxPooling2D((2, 2)),
            Conv2D(64, (3, 3), activation='relu'),
            MaxPooling2D((2, 2)),
            Flatten(),
            Dense(64, activation='relu'),
            Dense(10, activation='softmax')
        ])
        return model

    def predict(self, X, features_names=None):
        """Seldon-compatible prediction method"""
        return self.model.predict(X)

The key insight is the predict method signature. Seldon expects this specific interface, enabling seamless integration without custom API code.

API Contract Definition

Seldon uses contracts to define the expected input/output format:

{
    "features": [
        {
            "name": "image",
            "dtype": "FLOAT",
            "ftype": "continuous",
            "range": [0, 1],
            "shape": [1, 28, 28, 1]
        }
    ],
    "targets": [
        {
            "name": "digit",
            "dtype": "INT",
            "ftype": "categorical",
            "range": [0, 9]
        }
    ]
}

This contract enables automatic input validation, documentation generation, and client SDK creation.

Containerization Workflow

The containerization process transforms a Python model into a production-ready Docker image in a single command.

MLOps Pipeline

Loading diagram...

Build Command

# Build the model container using Seldon's S2I builder
s2i build . seldonio/seldon-core-s2i-python36:0.4 deepmnist:0.1

This single command:

  • Pulls the Seldon Python 3.6 base image
  • Copies your model code into the container
  • Installs Python dependencies from requirements.txt
  • Configures the Seldon prediction server
  • Tags the resulting image

Running the Service

# Start the prediction service
docker run --name "mnist_prediction_service" --rm -p 5000:5000 deepmnist:0.1

The service immediately exposes:

  • POST /predict - Model inference endpoint
  • GET /health - Liveness check
  • Prometheus metrics for monitoring

Testing the Deployment

Seldon provides a built-in testing utility that validates the API against the contract:

# Test the prediction endpoint using the contract
seldon-core-tester contract.json 0.0.0.0 5000 -p

This generates sample inputs matching the contract schema and verifies the response format.

API Design Patterns

The REST API exposed by Seldon follows cloud-native best practices:

Prediction Request

POST /predict
{
    "data": {
        "ndarray": [[[[0.0, 0.1, 0.2, ...]]]]
    }
}

Prediction Response

{
    "data": {
        "ndarray": [[0.01, 0.02, 0.85, 0.01, 0.02, 0.03, 0.02, 0.02, 0.01, 0.01]]
    },
    "meta": {
        "puid": "unique-prediction-id",
        "requestPath": {"deepmnist": "deepmnist:0.1"}
    }
}

Key API Characteristics

  • Stateless: Each request is independent, enabling horizontal scaling
  • Idempotent: Same input always produces same output
  • Traceable: Every prediction includes a unique ID for debugging
  • Versioned: Model version is embedded in response metadata

Kubernetes Deployment

Once containerized, the model can be deployed to any Kubernetes cluster:

Microservices Architecture

Loading diagram...

Seldon Deployment Manifest

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: mnist-model
spec:
  predictors:
    - name: default
      replicas: 3
      graph:
        name: deepmnist
        implementation: SKLEARN_SERVER
        modelUri: s3://models/deepmnist
      componentSpecs:
        - spec:
            containers:
              - name: deepmnist
                image: deepmnist:0.1
                resources:
                  requests:
                    memory: "256Mi"
                    cpu: "100m"
                  limits:
                    memory: "512Mi"
                    cpu: "500m"

Language and Framework Agnosticism

A critical advantage of this approach is language independence. The same containerization pattern works for:

  • Python: Keras, TensorFlow, PyTorch, scikit-learn
  • Java/Kotlin: DL4J, Tribuo
  • R: tidymodels, caret
  • Scala: Spark ML

Data scientists can develop in their language of choice, and the deployment pipeline remains consistent.

Production Considerations

Resource Management

resources:
  requests:
    memory: "256Mi"
    cpu: "100m"
    nvidia.com/gpu: 1  # Optional GPU support
  limits:
    memory: "1Gi"
    cpu: "1000m"

Health Checks

livenessProbe:
  httpGet:
    path: /health
    port: 5000
  initialDelaySeconds: 10
  periodSeconds: 5

readinessProbe:
  httpGet:
    path: /health
    port: 5000
  initialDelaySeconds: 5
  periodSeconds: 3

Scaling Policies

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: mnist-hpa
spec:
  scaleTargetRef:
    apiVersion: machinelearning.seldon.io/v1
    kind: SeldonDeployment
    name: mnist-model
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Beyond MNIST: Real-World Applications

While this project uses MNIST for demonstration, the patterns apply directly to production use cases:

Use CaseModel TypeServing Requirements
Image ClassificationCNN (ResNet, VGG)GPU acceleration, batch processing
Object DetectionYOLO, SSDLow latency, high throughput
NLP ClassificationBERT, TransformersMemory-intensive, CPU or GPU
RecommendationMatrix FactorizationHigh concurrency, caching
Fraud DetectionEnsemble ModelsReal-time scoring, audit logging

The Philosophy: Empowering Data Scientists

The architecture mantra behind this approach is simple:

"Empower data scientists and engineers while using data science to add business value by providing agility with velocity and maximum utilization of their talent pool."

By abstracting away the infrastructure complexity:

  • Data scientists focus on model quality, not deployment scripts
  • ML engineers focus on platform reliability, not framework-specific quirks
  • Organizations achieve consistent deployment patterns across diverse models

Prerequisites and Setup

To get started with this approach:

# Install Seldon Core CLI
pip install seldon-core

# Install Keras
pip install keras tensorflow

# Install S2I (macOS)
brew install source-to-image

# Verify Python version
python3 --version  # Requires Python 3.6+

Conclusion

The million-dollar question in data science has always been: "How do we take ML models to production in a repeatable, predictable manner while exposing them as services?"

The MachineLearningAsService project answers this through:

  1. Containerization: Models become portable, versioned artifacts
  2. Standardized APIs: Consistent interfaces regardless of framework
  3. Kubernetes-Native Deployment: Enterprise-grade scaling and reliability
  4. Framework Agnosticism: Freedom to choose the right tool for the job

Not a single line of code is written to create the prediction microservice - the infrastructure handles everything. This is the essence of modern ML productionization: enabling data scientists to focus on what they do best while the platform handles operational concerns.

The approach is cloud-provider agnostic, scales with Kubernetes, and provides a clear path from experimentation to production. For organizations seeking to bridge the gap between data science and production systems, containerized ML services represent the most practical and scalable solution available today.


This project is open source and available at github.com/mgorav/MachineLearningAsService. Explore the implementation and adapt these patterns for your ML deployment needs.