Skip to content

Model-Based Round Robin via APK Conf

This feature enables model-based round-robin routing, distributing requests among multiple LLM backends according to specified AI models. This guide explains how to configure model-based round robin using the APK-Conf file.

Step 1 – Add the Model-Based Round Robin Policy to the apk-conf File

Below is a sample policy configuration for model-based round robin:

apiPolicies:
  request:
    - policyName: "ModelBasedRoundRobin"
      policyVersion: v1
      parameters:
        onQuotaExceedSuspendDuration: 60
        productionModels:
          - model: "gpt-4o"
            endpoint: "https://api.openai.com/v1"
            weight: 1
          - model: "o1-mini"
            endpoint: "https://api.openai.com/v1"
            weight: 1
          - model: "gpt-4o-mini"
            endpoint: "https://api.openai.com/v1"
            weight: 1
        sandboxModels:
          - model: "gpt-4o"
            endpoint: "https://api.openai.com/v1"
            weight: 1
          - model: "o1-mini"
            endpoint: "https://api.openai.com/v1"
            weight: 1
          - model: "gpt-4o-mini"
            endpoint: "https://api.openai.com/v1"
            weight: 1

This policy uses round-robin routing based on weights among the configured models.

Note

To optimize the configuration process, APK presents a VS Code plugin designed to offer syntax highlighting and intelligent suggestions. This plugin simplifies the incorporation of rate limitations, new resources, and security configurations into your API. Adapt the contents of the APK Configuration file as needed. For further details, refer to the section on Enhance Configuration with APK Config Language Support.

Note

  • The OnQuotaExceedSuspendDuration parameter specifies the duration in seconds to suspend the AI Model for round robin when the quota is exceeded.
  • The productionModels parameter specifies the list of models to be used in the production environment.
  • The sandboxModels parameter specifies the list of models to be used in the sandbox environment.
  • The model parameter specifies the model name.
  • The endpoint parameter specifies the endpoint of the model.
  • The weight parameter specifies the weight of the model.

Below is a complete apk-conf file with this policy included:

name: "chat-service-api-prod-sand"
basePath: "/chat-service-prod-sand"
version: "1.0"
type: "REST"
defaultVersion: false
subscriptionValidation: false
aiProvider:
  name: "OpenAI"
  apiVersion: "v1"
endpointConfigurations:
  production:
    - endpoint: "https://api.openai.com/v1"
      endpointSecurity:
        enabled: true
        securityType:
          secretName: "open-ai-secret"
          in: "Header"
          apiKeyNameKey: "Authorization"
          apiKeyValueKey: "apiKey"
  sandbox:
    - endpoint: "https://api.openai.com/v1"
      endpointSecurity:
        enabled: true
        securityType:
          secretName: "open-ai-secret"
          in: "Header"
          apiKeyNameKey: "Authorization"
          apiKeyValueKey: "apiKey"
operations:
- target: "/chat/completions"
  verb: "POST"
  secured: true
  scopes: []
apiPolicies:
  request:
    - policyName: "ModelBasedRoundRobin"
      policyVersion: v1
      parameters:
        onQuotaExceedSuspendDuration: 60
        productionModels:
          - model: "gpt-4o"
            endpoint: "https://api.openai.com/v1"
            weight: 1
          - model: "o1-mini"
            endpoint: "https://api.openai.com/v1"
            weight: 1
          - model: "gpt-4o-mini"
            endpoint: "https://api.openai.com/v1"
            weight: 1
        sandboxModels:
          - model: "gpt-4o"
            endpoint: "https://api.openai.com/v1"
            weight: 1
          - model: "o1-mini"
            endpoint: "https://api.openai.com/v1"
            weight: 1
          - model: "gpt-4o-mini"
            endpoint: "https://api.openai.com/v1"
            weight: 1

Step 2 – Create the Secret CR to Store the LLM Service API Key

Create a secret containing the API Key of the LLM Service Provider using the following command. Replace the api key of LLM Service value with your API Key generated for LLM Service Provider.

kubectl create secret generic open-ai-secret --from-literal=apiKey='xxxxxxxxxxxxxxxxxxx'
kubectl create secret generic open-ai-secret --from-literal=apiKey='<<api key of LLM Service>>' --namespace=<<namespace>>

Step 3 – Deploy the AI API in APK

Follow the instructions in the Develop and Deploy an AI API via REST API guide to deploy the API using the APK configuration.

Step 4 - Verify the API Invocation

Generate an access token and invoke the API using the following command:

curl -k --location 'https://default.gw.wso2.com:9095/chat-service-prod-sand/1.0/chat/completions' \
--header 'Host: default.gw.wso2.com' \
--header 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsICJ0eXAiOiJKV1QiLCAia2lkIjoiZ2F0ZXdheV9jZXJ0aWZpY2F0ZV9hbGlhcyJ9.eyJpc3MiOiJodHRwczovL2lkcC5hbS53c28yLmNvbS90b2tlbiIsICJzdWIiOiI0NWYxYzVjOC1hOTJlLTExZWQtYWZhMS0wMjQyYWMxMjAwMDIiLCAiYXVkIjoiYXVkMSIsICJleHAiOjE3MzkwNzMxOTYsICJuYmYiOjE3MzkwNjk1OTYsICJpYXQiOjE3MzkwNjk1OTYsICJqdGkiOiIwMWVmZTY5MS0wMjk4LTFmNjAtYTdjYy1kOTZiYmQyMTFhNjYiLCAiY2xpZW50SWQiOiI0NWYxYzVjOC1hOTJlLTExZWQtYWZhMS0wMjQyYWMxMjAwMDIiLCAic2NvcGUiOiJhcGs6YXBpX2NyZWF0ZSJ9.aTm7WUGWfEk8ZnSgCRZRcas9fNAvBo0Yj3zpo07o8Fq0rE2b8XNUU8GJqujo4DIRupEV6GDxk3ECKs-_BdprQQLVDidU7knUUeaSYsAk6xP0AdYFL1PhNKRS_1XbIPvILc5mLEYgeo9PRjkFbuD0FZKdKgCicY5sWze2tGyxGwiErMuxQTLrTNJdO48HACP4WlvbOIoOKg_xlxwOyYq5a0X6aTA218eZW3KKaxPmJK7kDzMDfm8r-UeEMfDP2go2IaBs4Pz0b8qsFR7jSMiOSjST_Id8ueyLy_EfbFAdd93qL3ByK-NKviqNG9JAUWEq_zbokJnn7a6IDqBq2DG7uw' \
--header 'Content-Type: application/json' \
--request POST \
-d '{
"model": "gpt-4o-mini",
"store": true,
"messages": [
{"role": "user", "content": "write a haiku about ai"}
]
}'
curl -X POST "https://<host>:9095/<basePath>/chat/completions" \
-H "Content-Type: application/json" \
-H 'Authorization: Bearer <access-token>' \
-d <data> -k     
{
"id": "chatcmpl-AyrkK7FERTeVJ4vHr29eaWh1laAWC",
"object": "chat.completion",
"created": 1739069644,
"model": "gpt-4o-mini-2024-07-18",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Silent thoughts unfold,  \nLines of code weave minds anew,  \nDreams in circuits hum.",
"refusal": null
},
"logprobs": null,
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 13,
"completion_tokens": 20,
"total_tokens": 33,
"prompt_tokens_details": {
"cached_tokens": 0,
"audio_tokens": 0
},
"completion_tokens_details": {
"reasoning_tokens": 0,
"audio_tokens": 0,
"accepted_prediction_tokens": 0,
"rejected_prediction_tokens": 0
}
},
"service_tier": "default",
"system_fingerprint": "fp_bd83329f63"
}

After making multiple requests, you will see responses coming from different configured models, confirming that model-based round-robin routing is working as intended.