Skip to content

AI Ratelimit Policies Overview

AIRateLimitPolicy Custom Resource (CR) defines the specific rate-limiting rules for AI requests, including limits based on token usage and request count. Below is a breakdown of the key fields:

organization

Defines the organization or tenant for which the rate limit applies.

tokenCount

Specifies limits based on token usage within a defined time unit.

  • unit: Time unit for the rate limit.
  • requestTokenCount: Limit on the number of tokens for processing requests within the time unit.
  • responseTokenCount: Limit on the number of tokens for responses within the time unit.
  • totalTokenCount: Combined total limit on tokens for both requests and responses within the time unit.

requestCount

Defines rate limits based on the number of requests.

  • requestsPerUnit: Limit on the number of requests that can be processed within the time unit.
  • unit: Time unit for the request limit.

targetRef

Specifies the backend to which the rate limit applies.

Example AIRatelimitPolicy CR:

apiVersion: dp.wso2.com/v1alpha3
kind: AIRateLimitPolicy
metadata:
  name: llm-backend-rl
  namespace: apk
spec:
  override:
    organization: default
    tokenCount:
      unit: Minute
      requestTokenCount: 5000
      responseTokenCount: 10000
      totalTokenCount: 15000
    requestCount:
      requestsPerUnit: 6000
      unit: Minute
  targetRef:
    kind: Backend
    name: backend-33eb53282e93f5fd3f26935311af727d58bd42c3-api
    group: gateway.networking.k8s.io