Methodology

How PowerOut.ai measures AI model performance

What We Measure

PowerOut.ai makes real API calls to 393 AI models across 4 provider gateways. Check frequency varies by model — flagship models are checked as often as every five minutes; long-tail and budget-tier models are checked less frequently to keep total monitoring cost under control (see Check Cadence & Tiers below). Each call is a lightweight inference request — a short prompt that requires the model to generate a response. We measure the complete request lifecycle:

Response Time (TTFT) — Time from request sent to first token received. This is the latency a user experiences before the model starts "typing."
Total Time — Time from request sent to last token received.
Connect Time — TCP/TLS handshake duration. Isolates network latency from model processing time.
HTTP Status — Whether the API returned successfully, errored, or timed out.
Token Counts — Input and output token counts reported by the provider.

How We Classify Status

Each model is classified into one of four states based on consecutive check results:

Healthy — Model responded successfully with normal latency.

Degraded — 1-2 consecutive failures. Could be transient. Not classified as an outage.

Down — 3+ consecutive failures. Confirmed outage. Incident created.

Recovering — Model was down but has started responding. Needs 2 consecutive successes to return to healthy.

This prevents false alarms from transient network blips. A single failed check does NOT trigger an outage — it takes 3 consecutive failures (15 minutes at default check intervals).

Check Cadence & Tiers

Each model is assigned to one of four check-cadence tiers. The tier determines how often we send a health-check request. Longer intervals lower our monitoring spend but increase the delay before an outage is detected.

featured

5 minutes · detection ≈ ~10–15 min — Flagship cross-comparable models and latency-sensitive specialty-hardware providers (e.g., fast inference services whose speed is the product).

fast

15 minutes · detection ≈ ~45 min — High-interest popular models.

medium

30 minutes · detection ≈ ~90 min — Standard cross-provider comparable models (same base model hosted at multiple gateways).

slow

2 hours · detection ≈ ~6 hours — Long-tail models and budget-tier providers. The cost/visibility tradeoff is explicit here.

Detection latency above assumes the default 3-consecutive-failure threshold. Providers whose value proposition is speed (specialty-hardware gateways) can lower that threshold so a flagship on the featured tier gets ~10-minute detection. Budget-tier providers keep the default threshold — their 6-hour detection is part of why they cost less to monitor.

Provider Categories

Provider gateways are grouped into four categories so the dashboard can filter and contextualize results. The category is a property of the gateway, not the model — the same Llama checkpoint can appear under gateways of different categories at different latencies and price points.

Mainstream — Broad-catalog aggregators hosting most common open-weight models at standard pricing and latency.
Budget — Low-cost gateways with wider catalogs but generally higher latency; monitored on longer cadences to keep our own bill down.
Specialty Hardware — Gateways whose differentiator is speed (custom inference silicon or optimized runtimes). Flagship models here run on faster cadences so their speed claim is verifiable.
Web-Grounded — Gateways that return cited, live-web-sourced answers alongside generated text (e.g., search-style LLM APIs).

Existing providers predate this taxonomy and are shown without a category.

Error Classification

When a check fails, we classify the error to distinguish provider outages from other issues:

Credential Failure (401/403) — Our API key is invalid or expired. This is our problem, not a provider outage. These checks are quarantined from status calculations.
Rate Limited (429) — We've been throttled. Not a provider outage.
Provider Error (5xx) — Server-side issue at the provider.
Timeout — Request didn't complete within 30 seconds.
Network Error — Connection couldn't be established.

Network Baseline

We maintain baseline pings to Cloudflare (1.1.1.1), Google DNS, and Quad9 DNS alongside model checks. If baseline pings degrade simultaneously with model checks, it indicates a network issue on our end — not a provider outage. This helps us distinguish "our internet is slow" from "the provider is down."

Data Pipeline

Every check result is written to disk immediately as a JSONL record with a SHA-256 checksum. A directory watcher tails these files and pushes records to MongoDB within seconds. Completed files are uploaded to S3 for durable backup. The raw data on disk is the authoritative source of truth — MongoDB can be rebuilt from it at any time.

AI100 Index

The AI100 is a composite score representing the health of the AI ecosystem. It is calculated as a running window average across all monitored models, weighted by status: healthy = 100%, recovering = 75%, degraded = 50%, down = 0%.

The score is available at six time windows: 1 minute, 5 minutes, 1 hour, 1 day, 1 week, and 1 month. The 5-minute window is the default "hero" score displayed on the homepage.

Independence

PowerOut.ai is not affiliated with any AI provider. We pay for our own API access and receive no compensation or preferential treatment from any provider. Our monitoring infrastructure runs on independent servers with independent network connectivity.

Open Data

Aggregate methodology and scores are published openly. Raw per-check telemetry data is retained as our proprietary dataset — this historical data is the foundation of the platform and cannot be replicated retroactively.