ML Model Serving as REST API Using Clipper
A comprehensive guide to deploying machine learning models as scalable REST APIs using Clipper, the low-latency prediction serving system designed for production ML workloads.
Table of Contents
The Model Serving Challenge
Building accurate machine learning models is only half the battle. The real challenge lies in deploying these models to production where they can serve predictions at scale, with low latency, and high reliability. At Gonnect, we've seen countless organizations struggle with this transition - their models perform brilliantly in notebooks but falter when exposed to real-world traffic.
Clipper addresses this fundamental challenge by providing a prediction serving system that sits between your applications and ML models, managing the complexity of model deployment, versioning, and scaling.
What is Clipper?
Clipper is a low-latency prediction serving system developed at UC Berkeley's RISE Lab. It provides a general-purpose platform that:
- Exposes ML models as REST APIs without custom server code
- Supports multiple ML frameworks (scikit-learn, TensorFlow, PyTorch, XGBoost)
- Enables online model updates and A/B testing
- Implements intelligent caching and batching for performance
- Provides fault tolerance and model versioning
The key philosophy behind Clipper is framework agnosticism - data scientists can use any ML library they prefer, and Clipper handles the serving infrastructure.
Architecture Overview
Clipper's architecture separates concerns into distinct layers:
┌─────────────────────────────────────┐
│ Application Layer │
│ (REST API / gRPC Clients) │
└─────────────────┬───────────────────┘
│
┌─────────────────▼───────────────────┐
│ Clipper Frontend │
│ - Request routing │
│ - Caching │
│ - Batching │
│ - Model selection │
└─────────────────┬───────────────────┘
│
┌────────────────────────────┼────────────────────────────┐
│ │ │
┌────────▼────────┐ ┌─────────▼─────────┐ ┌────────▼────────┐
│ Model Container │ │ Model Container │ │ Model Container │
│ (scikit-learn) │ │ (TensorFlow) │ │ (PyTorch) │
└─────────────────┘ └───────────────────┘ └─────────────────┘
Key Components
- Query Frontend: Receives prediction requests via REST API
- Model Containers: Docker containers running individual models
- Clipper Manager: Handles model deployment and lifecycle management
- Selection Policy: Routes requests to appropriate model versions
HR Hiring Prediction: A Practical Example
The Clipper project demonstrates model serving with a practical HR use case - predicting hiring decisions based on candidate attributes. This involves:
- A Decision Tree classifier trained on HR data
- REST API endpoints for real-time predictions
- Docker containerization for deployment
The HR Dataset
import pandas as pd
# Load HR training data
hr_data = pd.read_csv('HR.csv')
# Features typically include:
# - Years of experience
# - Education level
# - Technical skills assessment
# - Interview scores
# - Previous employment history
Training the Decision Tree Model
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import pickle
class HRDecisionTree:
"""HR Hiring Decision Tree Classifier"""
def __init__(self):
self.model = DecisionTreeClassifier(
max_depth=10,
min_samples_split=5,
min_samples_leaf=2,
random_state=42
)
def train(self, X, y):
"""Train the model on HR data"""
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
self.model.fit(X_train, y_train)
# Evaluate accuracy
accuracy = self.model.score(X_test, y_test)
print(f"Model accuracy: {accuracy:.4f}")
return self
def predict(self, features):
"""Make hiring predictions"""
return self.model.predict(features)
def save(self, path='hr_model.pkl'):
"""Serialize model for deployment"""
with open(path, 'wb') as f:
pickle.dump(self.model, f)
Connecting to Clipper
The ClipperConnection class manages the interaction between your application and the Clipper serving system:
from clipper_admin import ClipperConnection, DockerContainerManager
class ClipperModelServer:
"""Manages Clipper model deployment and serving"""
def __init__(self, host='localhost'):
self.clipper_conn = ClipperConnection(
DockerContainerManager()
)
self.host = host
def start_clipper(self):
"""Initialize Clipper cluster"""
self.clipper_conn.start_clipper()
print("Clipper started successfully")
print(f"Query endpoint: http://{self.host}:1337")
print(f"Management endpoint: http://{self.host}:1338")
def register_application(self, name, input_type, default_output, slo_micros):
"""Register a new application with Clipper"""
self.clipper_conn.register_application(
name=name,
input_type=input_type,
default_output=default_output,
slo_micros=slo_micros # Service Level Objective in microseconds
)
print(f"Application '{name}' registered")
def deploy_model(self, name, version, input_type, func, pkgs_to_install):
"""Deploy a Python model to Clipper"""
from clipper_admin.deployers import python as python_deployer
python_deployer.deploy_python_closure(
self.clipper_conn,
name=name,
version=version,
input_type=input_type,
func=func,
pkgs_to_install=pkgs_to_install
)
print(f"Model '{name}' version {version} deployed")
def link_model_to_app(self, app_name, model_name):
"""Connect a model to an application"""
self.clipper_conn.link_model_to_app(
app_name=app_name,
model_name=model_name
)
print(f"Model '{model_name}' linked to app '{app_name}'")
Deploying the HR Model
Here's a complete workflow for deploying the HR hiring prediction model:
import pickle
import numpy as np
# Initialize Clipper connection
server = ClipperModelServer()
server.start_clipper()
# Register the HR application
server.register_application(
name='hr-hiring',
input_type='doubles',
default_output='-1.0',
slo_micros=100000 # 100ms SLO
)
# Load the trained model
with open('hr_model.pkl', 'rb') as f:
hr_model = pickle.load(f)
# Define the prediction function
def predict_hiring(inputs):
"""
Clipper-compatible prediction function
Args:
inputs: List of feature arrays
Returns:
List of predictions as strings
"""
predictions = hr_model.predict(inputs)
return [str(pred) for pred in predictions]
# Deploy the model
server.deploy_model(
name='hr-decision-tree',
version='1',
input_type='doubles',
func=predict_hiring,
pkgs_to_install=['scikit-learn']
)
# Link model to application
server.link_model_to_app('hr-hiring', 'hr-decision-tree')
print("HR Hiring model deployed and ready for predictions!")
REST API Usage
Once deployed, the model is accessible via REST API:
Making Predictions
# Predict hiring decision for a candidate
curl -X POST http://localhost:1337/hr-hiring/predict \
-H "Content-Type: application/json" \
-d '{
"input": [5.0, 3.0, 85.0, 4.5, 2.0]
}'
Response Format
{
"query_id": 1,
"output": "1",
"default": false
}
Python Client Example
import requests
import json
class HRPredictionClient:
"""Client for HR hiring predictions via Clipper"""
def __init__(self, host='localhost', port=1337):
self.base_url = f"http://{host}:{port}"
def predict(self, candidate_features):
"""
Predict hiring decision for a candidate
Args:
candidate_features: List of numeric features
- years_experience
- education_level
- skill_score
- interview_rating
- previous_jobs
Returns:
dict: Prediction result
"""
url = f"{self.base_url}/hr-hiring/predict"
payload = {"input": candidate_features}
response = requests.post(
url,
headers={"Content-Type": "application/json"},
data=json.dumps(payload)
)
return response.json()
def batch_predict(self, candidates):
"""Predict hiring decisions for multiple candidates"""
results = []
for candidate in candidates:
result = self.predict(candidate)
results.append(result)
return results
# Usage
client = HRPredictionClient()
# Single prediction
result = client.predict([5.0, 3.0, 85.0, 4.5, 2.0])
print(f"Hiring decision: {'Hire' if result['output'] == '1' else 'No Hire'}")
# Batch predictions
candidates = [
[5.0, 3.0, 85.0, 4.5, 2.0],
[2.0, 2.0, 70.0, 3.0, 1.0],
[8.0, 4.0, 95.0, 5.0, 3.0]
]
batch_results = client.batch_predict(candidates)
Docker Containerization
Clipper uses Docker containers to isolate and scale model serving:
Model Container Dockerfile
FROM python:3.8-slim
# Install dependencies
RUN pip install --no-cache-dir \
scikit-learn \
numpy \
pandas \
clipper_admin
# Copy model files
COPY hr_model.pkl /app/
COPY hr_prediction_service.py /app/
WORKDIR /app
# Expose Clipper's default port
EXPOSE 1337
CMD ["python", "hr_prediction_service.py"]
Docker Compose for Full Stack
version: '3.8'
services:
clipper-query-frontend:
image: clipper/query_frontend:latest
ports:
- "1337:1337"
networks:
- clipper-network
environment:
- CLIPPER_MANAGEMENT_PORT=1338
clipper-mgmt-frontend:
image: clipper/management_frontend:latest
ports:
- "1338:1338"
networks:
- clipper-network
redis:
image: redis:6-alpine
ports:
- "6379:6379"
networks:
- clipper-network
hr-model:
build: ./hr_model
networks:
- clipper-network
depends_on:
- clipper-query-frontend
- redis
networks:
clipper-network:
driver: bridge
Model Versioning and Updates
Clipper supports seamless model updates without downtime:
def update_model(server, new_model_path):
"""Deploy a new version of the HR model"""
# Load improved model
with open(new_model_path, 'rb') as f:
improved_model = pickle.load(f)
def improved_predict(inputs):
predictions = improved_model.predict(inputs)
return [str(pred) for pred in predictions]
# Deploy new version (version 2)
server.deploy_model(
name='hr-decision-tree',
version='2',
input_type='doubles',
func=improved_predict,
pkgs_to_install=['scikit-learn']
)
# Traffic automatically shifts to new version
print("Model updated to version 2")
A/B Testing Multiple Models
# Deploy multiple model versions for comparison
def setup_ab_test(server):
"""Configure A/B testing between model versions"""
# Version 1: Decision Tree
server.deploy_model(
name='hr-model-dt',
version='1',
input_type='doubles',
func=decision_tree_predict,
pkgs_to_install=['scikit-learn']
)
# Version 2: Random Forest
server.deploy_model(
name='hr-model-rf',
version='1',
input_type='doubles',
func=random_forest_predict,
pkgs_to_install=['scikit-learn']
)
# Clipper can route traffic between models
# based on configurable policies
Performance Optimization
Clipper includes several built-in optimizations:
Adaptive Batching
# Clipper automatically batches requests for efficiency
# Configure batch parameters when starting Clipper
clipper_conn.start_clipper(
cache_size=1000, # LRU cache size
num_frontend_replicas=2, # Scale query frontend
)
Caching Configuration
# Enable prediction caching for repeated queries
clipper_conn.register_application(
name='hr-hiring',
input_type='doubles',
default_output='-1.0',
slo_micros=100000,
cache=True # Enable caching
)
Latency Monitoring
import time
def benchmark_predictions(client, num_requests=100):
"""Measure prediction latency"""
test_input = [5.0, 3.0, 85.0, 4.5, 2.0]
latencies = []
for _ in range(num_requests):
start = time.time()
client.predict(test_input)
latency = (time.time() - start) * 1000 # ms
latencies.append(latency)
avg_latency = sum(latencies) / len(latencies)
p99_latency = sorted(latencies)[int(0.99 * len(latencies))]
print(f"Average latency: {avg_latency:.2f}ms")
print(f"P99 latency: {p99_latency:.2f}ms")
Error Handling and Resilience
Default Predictions
Clipper returns default predictions when models fail:
# Register with meaningful default
server.register_application(
name='hr-hiring',
input_type='doubles',
default_output='{"status": "pending_review", "confidence": 0.0}',
slo_micros=100000
)
Health Monitoring
def check_clipper_health(conn):
"""Monitor Clipper cluster health"""
apps = conn.get_all_apps()
models = conn.get_all_models()
containers = conn.get_all_model_replicas()
print(f"Registered applications: {len(apps)}")
print(f"Deployed models: {len(models)}")
print(f"Active containers: {len(containers)}")
# Check each container's status
for container in containers:
print(f" - {container['model_name']}:{container['model_version']} "
f"Status: {container['status']}")
Production Deployment Best Practices
1. Resource Allocation
# Kubernetes deployment for production
apiVersion: apps/v1
kind: Deployment
metadata:
name: clipper-hr-model
spec:
replicas: 3
template:
spec:
containers:
- name: hr-model
image: hr-decision-tree:v1
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
2. Logging and Observability
import logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger('clipper-hr-service')
def logged_prediction(inputs):
"""Prediction function with logging"""
logger.info(f"Received prediction request: {len(inputs)} samples")
start_time = time.time()
predictions = hr_model.predict(inputs)
inference_time = time.time() - start_time
logger.info(f"Prediction completed in {inference_time:.4f}s")
return [str(pred) for pred in predictions]
3. Graceful Shutdown
def cleanup_clipper(conn):
"""Clean shutdown of Clipper cluster"""
# Unlink models from applications
conn.unlink_model_from_app('hr-hiring', 'hr-decision-tree')
# Stop model containers
conn.stop_models('hr-decision-tree')
# Stop Clipper
conn.stop_all()
print("Clipper shutdown complete")
Clipper vs. Other Serving Solutions
| Feature | Clipper | Seldon | TensorFlow Serving | TorchServe |
|---|---|---|---|---|
| Multi-framework | Yes | Yes | TF only | PyTorch only |
| REST API | Yes | Yes | Yes | Yes |
| gRPC | No | Yes | Yes | Yes |
| A/B Testing | Yes | Yes | Manual | Manual |
| Adaptive Batching | Yes | Yes | Yes | Yes |
| Caching | Yes | No | No | No |
| Python-native | Yes | Yes | No | Yes |
Conclusion
Clipper provides a practical solution for deploying ML models as REST APIs with minimal infrastructure code. Its key strengths include:
- Framework Agnosticism: Deploy models from any Python ML library
- Simple REST Interface: Standard HTTP endpoints for predictions
- Built-in Optimization: Automatic batching and caching
- Model Management: Version control and seamless updates
- Container-based Scaling: Docker-native deployment
The HR hiring prediction example demonstrates how to transform a scikit-learn model into a production-ready service. The same patterns apply to any ML use case - from image classification to recommendation systems.
For teams seeking to bridge the gap between data science experimentation and production deployment, Clipper offers a lightweight yet powerful approach. It handles the operational complexity of model serving, allowing data scientists to focus on what they do best - building accurate and impactful models.
The containerized approach ensures portability across environments, from local development to cloud-native Kubernetes deployments. Whether you're serving a single model or managing a fleet of ML services, Clipper provides the foundation for reliable, scalable prediction APIs.
Explore the implementation at github.com/mgorav/clipper and adapt these patterns for your ML model serving requirements.