Run Inference on a Local Sandbox
Deploy a model to a Triton inference server running in a local Michelangelo sandbox cluster.
Prerequisites
- Repository: Local checkout with
$REPOROOTpointing to the repo root - Tooling:
poetry,docker,k3d
Procedure
- Change to the Python workspace:
cd $REPOROOT/python
- Create the Michelangelo sandbox:
poetry run ma sandbox create
- Initialize the inference demo environment:
poetry run ma sandbox demo inference
This command:
- Creates an
InferenceServerCR namedinference-server-examplein thedefaultnamespace - Deploys a Triton inference server with the model-sync sidecar
- Creates the model ConfigMap for dynamic model loading
- Sets up the Gateway and HTTPRoute infrastructure
- Upload your Triton model to MinIO storage:
This step can be done manually or through a Uniflow Pipeline. Your model artifacts should be placed in the deploy-models bucket. The sandbox includes a MinIO instance accessible at http://localhost:9000 (credentials: minioadmin/minioadmin).
- Apply a Deployment CR to load the model:
apiVersion: michelangelo.api/v2
kind: Deployment
metadata:
name: bert-cola-deployment
namespace: default
labels:
app: bert-cola-example
spec:
desiredRevision:
name: bert-cola-example
namespace: default
inferenceServer:
name: inference-server-example
namespace: default
selector:
matchLabels:
environment: production
deletionSpec:
deleted: false
strategy:
rolling:
incrementPercentage: 20
definition:
type: TARGET_TYPE_INFERENCE_SERVER
subType: realtime-serving
modelFamily:
name: bert-cola-family
namespace: default
owner:
name: "user-1234"
kubectl apply -f deployment.yaml
- Run inference against the deployed model:
curl -X POST http://localhost:8080/inference-server-example/bert-cola-deployment/infer \
-H "Content-Type: application/json" \
-d '{
"inputs": [
{
"name": "input_ids",
"shape": [1, 10],
"datatype": "INT64",
"data": [101, 7592, 999, 102, 0, 0, 0, 0, 0, 0]
},
{
"name": "attention_mask",
"shape": [1, 10],
"datatype": "INT64",
"data": [1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
}
]
}'
Outcome:
- Sandbox cluster is running with Michelangelo controllers
- Triton inference server is deployed and healthy
- Model is loaded and serving inference requests
Note: A remote cluster solution where inference servers are hosted in clusters separate from the control plane is coming soon.