Register a Compute Cluster
Register an existing compute Kubernetes cluster with the Michelangelo control plane to enable running Ray jobs.
Prerequisites
- Existing Kubernetes compute cluster accessible via kubectl
- KubeRay operator installed in the compute cluster (
ray-systemnamespace) - Michelangelo control plane running
- Access to object storage (S3/MinIO) used by the control plane
- RBAC manifest for service account with permissions to run Ray Jobs / Ray Clusters.
- Network connectivity between control plane and compute cluster (control plane must be able to reach compute cluster API server)
What gets configured
- RBAC for
ray-managerServiceAccount indefaultnamespace - Storage configuration in compute cluster:
michelangelo-configConfigMap (S3 endpoint/credentials)aws-credentialsSecret
- Cluster CRD in control plane (
ma-system) pointing to the compute cluster - Secrets in control plane for compute cluster CA and client token
# Example names
COMPUTE_CLUSTER=michelangelo-compute-0
COMPUTE_CONTEXT=my-compute-cluster-context # Your kubectl context for the compute cluster
CONTROL_PLANE_CONTEXT=my-control-plane-context # Your kubectl context for the control plane
1) Configure storage in the compute cluster
Ensure Ray pods inherit the same storage configuration used by the control plane.
Create the michelangelo-config ConfigMap with your storage configuration:
cat <<EOF | kubectl --context "${COMPUTE_CONTEXT}" apply -f -
apiVersion: v1
kind: ConfigMap
metadata:
name: michelangelo-config
data:
MA_FILE_SYSTEM: s3://default
MA_FILE_SYSTEM_S3_SCHEME: http
AWS_ACCESS_KEY_ID: minioadmin
AWS_SECRET_ACCESS_KEY: minioadmin
AWS_ENDPOINT_URL: << MINIO STORAGE URL >>
EOF
Create the aws-credentials Secret for AWS CLI access (adjust values as needed for your environment):
kubectl --context "${COMPUTE_CONTEXT}" create secret generic aws-credentials \
--from-literal=AWS_ACCESS_KEY_ID=minioadmin \
--from-literal=AWS_SECRET_ACCESS_KEY=minioadmin
2) Apply RBAC for Ray management in the compute cluster
kubectl --context "${COMPUTE_CONTEXT}" apply -f docs/batchjobs/Resources/rbac-ray.yaml
This creates ServiceAccount ray-manager and grants permissions on rayclusters and rayjobs.
3) Create a token Secret for the ray-manager ServiceAccount
Production approach: Create a Secret of type kubernetes.io/service-account-token that Kubernetes will automatically populate with a token. This creates a long-lived token that persists until the Secret is deleted.
cat <<EOF | kubectl --context "${COMPUTE_CONTEXT}" apply -f -
apiVersion: v1
kind: Secret
metadata:
name: ray-manager-token
namespace: default
annotations:
kubernetes.io/service-account.name: ray-manager
type: kubernetes.io/service-account-token
EOF
# Wait for Kubernetes to populate the token (usually takes a few seconds)
kubectl --context "${COMPUTE_CONTEXT}" -n default wait --for=jsonpath='{.data.token}' --timeout=30s secret/ray-manager-token
# Extract the token
kubectl --context "${COMPUTE_CONTEXT}" -n default get secret ray-manager-token -o jsonpath='{.data.token}' | base64 -d > /tmp/${COMPUTE_CLUSTER}-token
Note for testing/development only: If you need a token with a specific expiration time for testing, you can use kubectl create token, but this is not recommended for production as it creates short-lived tokens (default 1 hour, max configurable duration varies by cluster):
# Testing only - not recommended for production
kubectl --context "${COMPUTE_CONTEXT}" -n default create token ray-manager --duration=87600h > /tmp/${COMPUTE_CLUSTER}-token
4) Extract CA data from the compute cluster
Extract the kubeconfig for the compute cluster and parse the certificate authority data.
# Get kubeconfig (adjust command based on your setup)
kubectl --context "${COMPUTE_CONTEXT}" config view --minify --raw > /tmp/${COMPUTE_CLUSTER}-kubeconfig
# Extract and decode CA data
# Parse clusters[0].cluster.certificate-authority-data from the kubeconfig
# Base64 decode and save as /tmp/${COMPUTE_CLUSTER}-cadata
Alternatively, if you have the kubeconfig file directly:
# Extract server URL and CA data from kubeconfig
kubectl config view --kubeconfig=/path/to/compute-cluster-kubeconfig --minify --raw > /tmp/${COMPUTE_CLUSTER}-kubeconfig
Parse clusters[0].cluster.certificate-authority-data from the kubeconfig (base64 decode) and save as /tmp/${COMPUTE_CLUSTER}-cadata.
5) Register the compute cluster in the control plane (Cluster CRD)
First, derive the API host and port from clusters[0].cluster.server in the compute kubeconfig (e.g., https://compute-cluster.example.com:6443).
Create the Cluster CRD in the control plane:
cat <<EOF | kubectl --context "${CONTROL_PLANE_CONTEXT}" apply -f -
apiVersion: michelangelo.api/v2
kind: Cluster
metadata:
name: ${COMPUTE_CLUSTER}
namespace: ma-system
spec:
kubernetes:
rest:
host: https://compute-cluster.example.com # replace with your compute cluster API server host
port: "6443" # replace with your compute cluster API server port
tokenTag: cluster-${COMPUTE_CLUSTER}-client-token
caDataTag: cluster-${COMPUTE_CLUSTER}-ca-data
skus: []
EOF
6) Create Secrets in the control plane for CA and token
# CA secret with key 'cadata'
kubectl --context "${CONTROL_PLANE_CONTEXT}" apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: cluster-${COMPUTE_CLUSTER}-ca-data
namespace: default
stringData:
cadata: "$(cat /tmp/${COMPUTE_CLUSTER}-cadata)"
EOF
# Token secret with key 'token'
kubectl --context "${CONTROL_PLANE_CONTEXT}" apply -f - <<EOF
apiVersion: v1
kind: Secret
metadata:
name: cluster-${COMPUTE_CLUSTER}-client-token
namespace: default
stringData:
token: "$(cat /tmp/${COMPUTE_CLUSTER}-token)"
EOF
Verification
- RBAC present:
kubectl --context ${COMPUTE_CONTEXT} -n default get sa ray-manager
- Token Secret present and populated:
kubectl --context ${COMPUTE_CONTEXT} -n default get secret ray-manager-token- Verify the Secret has a
tokenkey:kubectl --context ${COMPUTE_CONTEXT} -n default get secret ray-manager-token -o jsonpath='{.data.token}' | wc -c(should be > 0)
- Storage configuration present:
kubectl --context ${COMPUTE_CONTEXT} get configmap michelangelo-configkubectl --context ${COMPUTE_CONTEXT} get secret aws-credentials
- Cluster registered in control plane:
kubectl --context ${CONTROL_PLANE_CONTEXT} -n ma-system get clusters
- Optional: Run a pipeline as in
docs/batchjobs/Run Uniflow Pipeline on Compute Cluster.md.
Troubleshooting
- context mismatch: Ensure you target
${COMPUTE_CONTEXT}vs${CONTROL_PLANE_CONTEXT}correctly. - network connectivity: Control plane must be able to reach the compute cluster API server. Verify network connectivity and firewall rules.
- token Secret not populated: If the token Secret is not automatically populated, verify the ServiceAccount exists and check the Secret's annotations. The Secret should have
kubernetes.io/service-account.name: ray-managerannotation. - API server access: Ensure the host and port in the Cluster CRD match the actual compute cluster API server endpoint accessible from the control plane.
- token rotation: For production, implement a token rotation policy. The Secret-based token persists until the Secret is deleted, so plan for periodic rotation.
Appendix
Ray RBAC Manifest
The RBAC manifest (docs/batchjobs/Resources/rbac-ray.yaml) should contain:
apiVersion: v1
kind: ServiceAccount
metadata:
name: ray-manager
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
# ClusterRole for Spark/Ray jobs across namespaces; one federated client uses a single ServiceAccount secret to create jobs in multiple namespaces.
kind: ClusterRole
metadata:
name: ray-manager
rules:
- apiGroups: ["ray.io"]
resources: ["rayclusters", "rayjobs"]
verbs: ["create","get","list","watch","update","patch","delete"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: ray-manager-binding
subjects:
- kind: ServiceAccount
name: ray-manager
namespace: default
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: ray-manager