Your AI bill is not a surprise you should discover at month end.
A corporate AI token governance layer that sits between your users, your agents, and every LLM provider API. Intercept every call. Attribute every token. Enforce every budget. Flag every anomaly — before it costs you.
Every enterprise deploying AI at scale hits these failure modes. TokenGuard was built to close all four.
01
Agentic loops consume 10–20x expected tokens
Autonomous agents running in a loop — a misconfigured retry, an infinite planning cycle, a context window overflow that triggers repeated calls — can consume a month's token budget in hours. Monthly budget checks catch this weeks after the damage is done. You need a circuit breaker that fires in minutes, not a finance review that fires at billing time.
02
Employees use corporate AI keys for personal projects
Corporate API keys are powerful and convenient. Employees route personal coding projects, side hustles, and family document drafting through corporate AI credentials. This is a cost problem, a data exposure problem, and an acceptable-use policy violation. Without token-level attribution and anomaly detection, it is also invisible until someone notices the bill.
03
No team-level visibility means no accountability
Most enterprises share one or two LLM API keys across all teams and all applications. Finance sees a total bill. No team lead knows what their team is spending. No cost centre is accountable. Optimisation conversations cannot happen because the data to have them does not exist. TokenGuard produces the attribution layer that makes AI spend manageable.
04
Budget enforcement happens outside agent code — or not at all
Putting budget logic inside agent code is fragile: it is easy to bypass, hard to audit, and inconsistent across teams. The only reliable enforcement layer is the gateway — the single point all LLM calls pass through. Enforcement that lives outside agent code cannot be circumvented by agent code.
Five enforcement layers, one gateway pass
Every LLM call passes through the same enforcement pipeline. All five layers execute within the 50ms latency budget.
01
Attribute
The gateway identifies the user or agent making the call, maps it to a cost centre and team, and estimates the token cost of the request before it is sent. Key attribution takes under 5ms. No LLM call proceeds without identity attribution.
02
Budget check
The attributed cost is evaluated against per-request ceilings, per-session rolling budgets, and per-key monthly caps. If any limit is exceeded, the enforcement action (block, throttle, or alert) is applied before the call reaches the LLM provider. Under 10ms.
03
Circuit breaker
Rate-of-spend is evaluated against each key's rolling 7-day baseline. A key consuming tokens at 5x its baseline within a 15-minute window triggers the circuit breaker: the key is suspended, an alert is sent, and a human review is required to reinstate. This catches runaway agent loops before they consume material budget.
04
Anomaly score
Signal analysis runs asynchronously: time of day, volume pattern, model selection, session metadata, and role-based baseline deviation are combined into a personal-use anomaly score from 0 to 100. Scores above 70 are flagged for human review. This layer runs in parallel and does not add to request latency.
05
Log & attribute cost
Every call — forwarded or blocked — is written to an immutable audit log. After the LLM provider responds, the actual token counts and cost are recorded and attributed to the user, agent, team, and cost centre. This is the source of truth for monthly chargeback reports.
What TokenGuard delivers
Five enforcement layers plus full observability — in a single gateway deployment.
Capability
What it does
Enforcement layer
Per-request ceiling
Block requests whose estimated token cost exceeds a configurable ceiling per call
Layer 1 — per request
Per-session budget
Track rolling token spend within a session and enforce a session-level budget cap
Layer 2 — per session
Monthly key cap
Enforce a hard monthly spend limit per API key, with configurable alert thresholds (80%, 95%, 100%)
Layer 3 — per key / per month
Model-tier routing
Redirect requests to a lower-cost model tier when budget thresholds are approached, without blocking
Layer 4 — model routing
Circuit breaker
Suspend keys showing anomalous rate-of-spend vs. rolling baseline — catching runaway agent loops before they exhaust monthly budgets
Layer 5 — rate anomaly
Personal-use detection
Probabilistic anomaly scoring for personal-use misuse patterns; flags for human review, never auto-blocks
Async — human review queue
Cost attribution
Per-user, per-agent, per-team, per-cost-centre attribution of actual token costs after provider response
Audit & chargeback
Chargeback reports
Monthly spend reports by team and cost centre, schedulable delivery, CSV export and API access
Finance & accountability
Three scenarios TokenGuard was built for
Real failure modes from enterprises that deployed AI without a governance layer first.
Engineering
Agentic loop consumes $40K in a weekend
Situation
An autonomous coding agent entered an infinite retry loop on a Friday afternoon. By Monday morning it had made 2.4 million LLM API calls, consuming the team's entire quarterly budget in 60 hours
Failure
No circuit breaker, no rate-of-spend monitoring, no per-session budget cap. Monthly budget check would have caught it 6 weeks later
With TokenGuard
Circuit breaker would have fired at 5x baseline within 15 minutes of loop start, suspending the key automatically and alerting the on-call engineer
Outcome prevented
$39,500 in avoidable LLM spend; Friday-to-Monday outage for other teams sharing the same key
Enterprise
30% of LLM spend traced to personal-use misuse
Situation
A financial services firm deployed TokenGuard for chargeback reporting. Anomaly scoring revealed a cluster of users with sustained high-volume after-hours usage inconsistent with their job functions
Finding
Eleven employees identified as likely personal-use misuse cases, accounting for 29% of total monthly LLM spend. HR process initiated after human review confirmed the pattern
With TokenGuard
Flagging happened within the first billing cycle. Without TokenGuard, the pattern would have been invisible — the total bill looked normal because it was spread across many users
Outcome
31% reduction in monthly LLM costs after misuse addressed; acceptable-use policy updated with enforcement mechanism; no further systematic misuse detected
Finance
CFO requests team-level AI spend accountability
Situation
CFO of a scale-up asked engineering and product leads to provide AI spend forecasts for the next quarter. No team had visibility into their own LLM spend — all calls went through shared corporate keys
Blockers
No attribution layer, no team-level reporting, no historical data to forecast from. Finance team could only see provider invoices with no breakdown
With TokenGuard
Deployed in attribution-only mode first to build historical data. Within 30 days: per-team spend reports available, cost-per-feature baseline established, Q3 forecast produced with 90% confidence
Outcome
First AI spend budget with team-level accountability; three teams voluntarily optimised model selection after seeing their per-request costs; 22% reduction in total LLM spend in Q3
Common questions
TokenGuard deploys as a transparent gateway between your users and agents and the LLM provider APIs. If you are currently calling OpenAI, Anthropic, Google, Azure OpenAI, or Mistral directly, you point your SDK base URL at the TokenGuard gateway — one configuration change, no application logic changes. TokenGuard issues per-user or per-agent synthetic API keys that route through the gateway; the gateway translates to your real provider credentials, which are stored in a secrets vault and never exposed to end users or agents. For LangChain, LlamaIndex, and other orchestration frameworks, we provide pre-built callback handlers and middleware that require no changes to agent logic.
The enforcement action is configurable per budget policy: Block (the LLM call returns an error), Throttle (the call is queued and rate-limited), or Alert (the call goes through but the budget owner and administrator are notified immediately). Most teams start with Alert on first breach and escalate to Block after a review period. Budget violations produce an immutable audit log entry with the full context of the blocked or alerted call — who made it, which model, how many tokens were estimated at request time, and the actual cost recorded after the response. Budget resets are explicit administrative actions, not automatic — preventing end-of-month budget renewal surprises.
The circuit breaker monitors rate-of-spend rather than cumulative spend — catching runaway loops before they exhaust a monthly budget in hours. If a user or agent key shows token consumption exceeding a configurable multiple of its rolling baseline (default: 5x the 7-day rolling average within a 15-minute window), the circuit breaker fires: the key is suspended, an alert is sent to the administrator, and a human review is required to reinstate. This catches the failure pattern of agentic loops where one misconfigured agent can consume $50,000–$200,000 in tokens before a monthly budget check would have caught it. The circuit breaker is per-key, so a single runaway agent does not affect other users or teams.
TokenGuard uses probabilistic signal analysis rather than content reading. Signals include: time of request (outside business hours), request volume pattern (single-user sustained high-volume sessions characteristic of personal projects), model selection (higher capability models with no business justification on record), session metadata (request diversity patterns inconsistent with role-based use cases), and volume baseline deviation. These signals are combined into an anomaly score from 0 to 100. Scores above 70 are flagged for human review — they are never auto-blocked, because the signals are probabilistic, not deterministic. A reviewer sees the supporting signals and decides: dismiss or confirm. This approach avoids content analysis, preserving user privacy while still detecting systematic misuse.
Monthly chargeback reports are generated per billing period and include: total cost by user, by team, by cost centre, by model, and by provider; request counts and average cost per request; month-over-month trend; top consumers ranked by spend; and a detailed event log available for download as CSV or via API. Reports are available in the TokenGuard dashboard and can be scheduled for automated delivery to finance and team leads. The cost attribution model is configurable — organisations can attribute by the API key used, by the user identity propagated in request headers, or by the project or cost centre tag set in the request metadata.
Yes. TokenGuard is available as a fully self-hosted deployment for organisations in regulated industries that cannot route LLM call metadata through a cloud-hosted governance product. The on-premise deployment runs on standard Linux infrastructure, requires no external service dependencies at runtime, and stores all audit logs, budget policies, and anomaly events within your environment. Deployment is supported on Kubernetes, Docker Compose, or bare-metal Linux. For organisations already using GDPR Oversight, the two products share an infrastructure layer — reducing deployment footprint when both governance capabilities are needed.
Other products in the governance stack
Compliance
GDPR Oversight
Real-time personal data egress detection, PII blocking, and Article 33-ready breach evidence for AI-connected enterprises.
Ready to put a governance layer on your LLM spend?
Tell us your LLM provider setup, your team structure, and your biggest cost concern. We will show you what TokenGuard detects in your environment and how quickly you can have team-level attribution in place.