How Inference Pulse works
A closer look at the methodology behind the numbers.
What is Inference Pulse?
Inference Pulse is an independent observability tool that tracks the real-world reliability and performance of public AI model provider APIs. We run actual LLM inference requests — not synthetic pings or health checks — from multiple geographic regions, on a regular cadence, and surface the results publicly.
Every data point you see on the dashboard comes from a real API call: we send a prompt, receive a streaming or non-streaming response, and measure what happens.
How probes work
Probes run from distinct cloud regions. Each probe is a fresh, independent HTTP request to a provider's public chat completions endpoint. No state is shared across probes, and no provider is given advance notice of when a probe will fire.
We use two categories of probe to capture different dimensions of provider performance:
- Short probes — a request capped at a small number of output tokens. These run frequently (every few minutes) and measure availability (did the request succeed with a 2xx?) and latency (time to first token). Short probes are the primary signal for the availability percentage and the latency columns.
- Mid-length probes — a request that generates a moderate stream of output tokens. These run less frequently and measure sustained throughput in tokens per second, reported in the TPS column.
Metrics
- Availability
- The fraction of short probes that returned a successful 2xx response over the rolling 24-hour window. A probe is considered failed if it returns a non-2xx status, times out, or encounters a connection error.
- Latency (TTFT)
- Time to first token — the interval between sending the request and receiving the first token of the response. Reported as the average and P95 over successful short probes in the rolling 24-hour window.
- Price ($/1M tokens)
- Estimated cost per million tokens, blended at a 7:2:1 ratio of cached input, uncached input, and output tokens — a common mix for reasoning-heavy workloads. The blended price is then multiplied by the provider's credit price ratio (e.g., 0.9 for 10% off listed prices) to reflect what you actually pay after discounts. Provider-specific pricing overrides the model-level defaults when available.
- TPS (tok/s)
- End-to-end output tokens per second — total output tokens divided by total end-to-end latency. Measured from probes that generate at least 100 output tokens, aggregated over the rolling 24-hour window. This captures the real-world throughput you experience, including any network overhead from connection setup through the last token.
- Availability timeline
- Twenty-four one-hour buckets covering the past day (UTC). Each bar shows the availability ratio for that hour — green (≥98%), yellow (≥50%), red (<50%), or gray (no data).
Probe regions & cadence
Probes originate from multiple cloud regions to surface geographic variation in latency and reliability. Short probes run every few minutes; mid-length probes run on a longer interval to keep provider load reasonable. The exact set of regions and intervals may evolve over time.
Scope & limitations
We monitor public provider endpoints only. Private deployments, fine-tuned models, and authenticated-only APIs are out of scope.
The metrics reflect the view from our probe regions at the time of each check. They are a best-effort signal, not a contractual SLA. A provider showing 100% availability here may still experience issues in regions or time windows we do not cover.
We do not send customer data, PII, or proprietary prompts through the probes. Every request uses a fixed, benign prompt designed solely for measurement.