Skip to content

Model-Based Round Robin via CRs

This feature enables model-based round-robin routing, allowing requests to be distributed among multiple LLM backends according to a specified AI model. This guide explains how to enable model-based round robin for an AI API using Kubernetes custom resources (CRs).

Step 1 – Obtain the CRs for Your API Configuration

You can find sample CRs provided in the following file: ai-api-model-routing.yaml.

Replace the sample secret value base64_encoded_api_key with your actual API key from the LLM service provider.

Alternatively, you can follow the steps in the Develop and Deploy an AI API via CRs guide to create the CRs from scratch. If you already have an apk-conf file, you can generate the CRs using the instructions detailed in this section.

Step 2 – Configure Model-Based Round Robin in the APIPolicy CR

Model-based round-robin routing is enabled through an API Policy. First, create or modify your APIPolicy CR to specify how requests should be distributed. Below is a sample API Policy CR:

apiVersion: dp.wso2.com/v1alpha4
kind: APIPolicy
metadata:
  name: chat-round-robin-prod-sand-api-policy
spec:
  default:
    aiProvider:
      name: "my-openai-ai-new1"
    modelBasedRoundRobin:
      onQuotaExceedSuspendDuration: 60
      productionModels:
        - model: "gpt-4o"
          backendRef: 
            group: "dp.wso2.com"
            kind: "Backend"
            name: "chat-backend-1"
          weight: 1
        - model: "o1-mini"
          backendRef: 
            group: "dp.wso2.com"
            kind: "Backend"
            name: "chat-backend-2"
          weight: 1
        - model: "gpt-4o-mini"
          backendRef: 
            group: "dp.wso2.com"
            kind: "Backend"
            name: "chat-backend-1"
          weight: 1
      sandboxModels:
        - model: "gpt-4o"
          backendRef: 
            group: "dp.wso2.com"
            kind: "Backend"
            name: "chat-backend-1"
          weight: 1
        - model: "o1-mini"
          backendRef: 
            group: "dp.wso2.com"
            kind: "Backend"
            name: "chat-backend-2"
          weight: 1
        - model: "gpt-4o-mini"
          backendRef: 
            group: "dp.wso2.com"
            kind: "Backend"
            name: "chat-backend-2"
          weight: 1
  targetRef:
    group: gateway.networking.k8s.io
    kind: API
    name: chat-service-api-prod-sand

Note

Tip for Easier Configuration: APK provides a VS Code plugin for syntax highlighting and intelligent suggestions. This plugin simplifies adding rate limits, creating new resources, and configuring security for your API. Adjust the APK configuration file as needed. For more information, see Enhance Configuration with APK Config Language Support.

Note

  • The OnQuotaExceedSuspendDuration parameter specifies the duration in seconds to suspend the AI Model for round robin when the quota is exceeded.
  • The productionModels parameter specifies the list of models to be used in the production environment.
  • The sandboxModels parameter specifies the list of models to be used in the sandbox environment.
  • The model parameter specifies the model name.
  • The backendRef parameter specifies the reference to the backend.
  • The weight parameter specifies the weight of the model.

Step 3 - Deploy the CRs

Once you have designed your APIs using these essential CRs, the next step is to apply them to the Kubernetes API server. APK will process and deploy your APIs seamlessly, taking full advantage of the Kubernetes infrastructure.

kubectl apply -f <path_to_CR_files> -n apk

Step 4 - Verify the API Invocation

Generate an access token and invoke the API using the following command:

curl -k --location 'https://default.gw.wso2.com:9095/chat-service-prod-sand/1.0/chat/completions' \
--header 'Host: default.gw.wso2.com' \
--header 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsICJ0eXAiOiJKV1QiLCAia2lkIjoiZ2F0ZXdheV9jZXJ0aWZpY2F0ZV9hbGlhcyJ9.eyJpc3MiOiJodHRwczovL2lkcC5hbS53c28yLmNvbS90b2tlbiIsICJzdWIiOiI0NWYxYzVjOC1hOTJlLTExZWQtYWZhMS0wMjQyYWMxMjAwMDIiLCAiYXVkIjoiYXVkMSIsICJleHAiOjE3MzkwNzMxOTYsICJuYmYiOjE3MzkwNjk1OTYsICJpYXQiOjE3MzkwNjk1OTYsICJqdGkiOiIwMWVmZTY5MS0wMjk4LTFmNjAtYTdjYy1kOTZiYmQyMTFhNjYiLCAiY2xpZW50SWQiOiI0NWYxYzVjOC1hOTJlLTExZWQtYWZhMS0wMjQyYWMxMjAwMDIiLCAic2NvcGUiOiJhcGs6YXBpX2NyZWF0ZSJ9.aTm7WUGWfEk8ZnSgCRZRcas9fNAvBo0Yj3zpo07o8Fq0rE2b8XNUU8GJqujo4DIRupEV6GDxk3ECKs-_BdprQQLVDidU7knUUeaSYsAk6xP0AdYFL1PhNKRS_1XbIPvILc5mLEYgeo9PRjkFbuD0FZKdKgCicY5sWze2tGyxGwiErMuxQTLrTNJdO48HACP4WlvbOIoOKg_xlxwOyYq5a0X6aTA218eZW3KKaxPmJK7kDzMDfm8r-UeEMfDP2go2IaBs4Pz0b8qsFR7jSMiOSjST_Id8ueyLy_EfbFAdd93qL3ByK-NKviqNG9JAUWEq_zbokJnn7a6IDqBq2DG7uw' \
--header 'Content-Type: application/json' \
--request POST \
-d '{
"model": "gpt-4o-mini",
"store": true,
"messages": [
{"role": "user", "content": "write a haiku about ai"}
]
}'
curl -X POST "https://<host>:9095/<basePath>/chat/completions" \
-H "Content-Type: application/json" \
-H 'Authorization: Bearer <access-token>' \
-d <data> -k     
{
"id": "chatcmpl-AyrkK7FERTeVJ4vHr29eaWh1laAWC",
"object": "chat.completion",
"created": 1739069644,
"model": "gpt-4o-mini-2024-07-18",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Silent thoughts unfold,  \nLines of code weave minds anew,  \nDreams in circuits hum.",
"refusal": null
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 13,
"completion_tokens": 20,
"total_tokens": 33,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0,
"audio_tokens": 0,
"accepted_prediction_tokens": 0,
"rejected_prediction_tokens": 0
}
},
"service_tier": "default",
"system_fingerprint": "fp_bd83329f63"
}

After invoking the request multiple times, you will see responses from different configured models, verifying that model-based round-robin routing is working as expected.