LLM Providers

Ghost’s AI agent needs a language model to understand traffic, generate tests, find security issues, and write bug reports. Think of the provider as Ghost’s “brain” — you choose which AI service powers it, provide your API key, and Ghost handles all the complexity of communicating with the model, managing conversation history, and staying within token limits.

Ghost supports three providers. You can switch between them at any time through Settings → AI, and the change takes effect immediately without restarting Ghost.

Providers

Anthropic (Default)

Anthropic makes Claude, Ghost’s default AI model. This is the recommended provider — Ghost’s prompt engineering is optimized for Claude, and Anthropic’s prompt caching feature significantly reduces costs for long agent sessions.

Setting	Value
Default model	Claude Sonnet 4.6 (`claude-sonnet-4-6`)
Available models	Claude Opus 4.6 (`claude-opus-4-6`), Claude Sonnet 4.6 (`claude-sonnet-4-6`), Claude Sonnet 4.5 (`claude-sonnet-4-5-20250514`), Claude Haiku 4.5 (`claude-haiku-4-5-20250514`)
Context window	200,000 tokens (~150,000 words of conversation history the model can “see” at once)
API key required	Yes — get one from console.anthropic.com (starts with `sk-ant-...`)
Prompt caching	Yes — system prompt and tool definitions are cached across turns

Opus 4.6 is the most capable model for complex multi-step analysis, security testing, and code generation. Sonnet 4.6 is the default — it provides an excellent balance of capability and cost for most QA and traffic analysis tasks. Haiku 4.5 is the fastest and cheapest option for simple queries.

[llm]
provider = "anthropic"
api_key = "sk-ant-..."
model = "claude-sonnet-4-6"  # optional — defaults to Claude Sonnet 4.6

Prompt caching explained: Every time the agent sends a message to Claude, it includes a large system prompt (describing Ghost’s capabilities, the current session, security findings, etc.) and a list of all available tools (up to 60+ tool definitions). Without caching, Anthropic would charge full price for processing these unchanged blocks on every single turn. With caching, Ghost marks the system prompt and the last tool definition with CacheControl: ephemeral, telling Anthropic “you’ve seen this before — use the cached version.” This provides approximately a 90% discount on cached input tokens, which adds up significantly during long agent sessions with many tool calls.

Message merging: The Anthropic API requires strict alternation between user and assistant messages — two consecutive user messages cause a 400 error. Ghost’s Anthropic provider handles this automatically:

Consecutive user messages are merged into a single message with multiple text blocks
Consecutive tool result messages are merged into a single user message (tool results are user-role in the Anthropic API)
Critically, tool result messages absorb any immediately following user messages into the same Anthropic user message. Without this, a tool result followed by a user message (common when present_options pauses the agent and the user responds) would create two consecutive user-role messages, causing a "tool_use blocks must have matching tool_result blocks" error

This merging is Anthropic-specific — the OpenAI provider does not need it because OpenAI uses a separate tool role for tool results.

OpenAI

OpenAI makes GPT and o-series reasoning models. Ghost uses the official OpenAI SDK and supports all chat completion models.

Setting	Value
Default model	GPT-4o (`gpt-4o`)
Available models	GPT-5.4 (`gpt-5.4`), o3 (`o3`), o3 Pro (`o3-pro`), o4-mini (`o4-mini`), GPT-4.1 (`gpt-4.1`), GPT-4.1 Mini (`gpt-4.1-mini`), GPT-4.1 Nano (`gpt-4.1-nano`), GPT-4o (`gpt-4o`), GPT-4o Mini (`gpt-4o-mini`)
Context window	128,000 tokens (GPT-4o, GPT-4.1), up to 1,000,000 tokens (GPT-4.1 with extended context)
API key required	Yes — get one from platform.openai.com (starts with `sk-...`)
Prompt caching	No — OpenAI provider does not use any caching mechanism

GPT-5.4 is OpenAI’s latest flagship model. o3 and o3 Pro are reasoning models that excel at complex multi-step analysis. GPT-4.1 offers improved instruction following with up to 1M token context. o4-mini is a fast, cost-efficient reasoning model.

[llm]
provider = "openai"
api_key = "sk-..."
model = "gpt-4o"  # optional — defaults to GPT-4o

Ollama (Local)

Ollama lets you run open-source AI models entirely on your own machine — no API key needed, no data sent to the cloud. This is ideal for air-gapped environments, privacy-sensitive testing, or when you want to experiment without API costs. The trade-off is that local models are typically less capable than cloud models, and require significant GPU resources for good performance.

Setting	Value
Default model	Llama 3.2 (`llama3.2`)
Available models	Llama 4 (`llama4`), Llama 3.3 70B (`llama3.3`), Llama 3.2 (`llama3.2`), Qwen 3 (`qwen3`), Qwen 2.5 (`qwen2.5`), DeepSeek R1 (`deepseek-r1`), DeepSeek Coder V2 (`deepseek-coder-v2`), Mistral (`mistral`), Mixtral 8x22B (`mixtral`), Phi-4 (`phi4`), Gemma 2 (`gemma2`)
Context window	32,000 tokens (default for unknown models)
API key required	No — Ollama runs locally with no authentication
Default endpoint	`http://localhost:11434`

Llama 4 and Qwen 3 are the latest open-source flagships with strong reasoning capabilities. DeepSeek R1 excels at step-by-step reasoning tasks. DeepSeek Coder V2 is optimized for code generation and analysis. Mixtral 8x22B offers the best cost-to-performance ratio for general tasks. Any model available in Ollama’s library works — these are just the pre-configured options in the dropdown.

[llm]
provider = "ollama"
ollama_endpoint = "http://localhost:11434"
model = "llama3.2"  # optional — defaults to Llama 3.2

How it works internally: Ghost connects to Ollama using the OpenAI-compatible /v1 endpoint that Ollama provides. It literally reuses the OpenAI SDK with a different base URL and a dummy API key ("ollama") since Ollama ignores authentication. This means any Ollama model that supports the OpenAI chat completion format will work with Ghost.

Extended retry logic: Because Ollama is a local service that might be starting up, loading a model into GPU memory, or recovering from an error, the Ollama provider retries on 7 additional error patterns beyond the standard rate limit handling:

Error Pattern	What it means
`connection refused`	Ollama daemon hasn’t started yet
`connection reset`	Connection was interrupted
`500`	Internal server error
`503`	Service temporarily unavailable
`EOF`	Connection closed unexpectedly
`no such host`	DNS resolution failed
`model is loading`	Ollama is loading the model into memory (can take seconds to minutes)

All of these are retried with a 2-second wait between attempts — giving Ollama time to finish starting up or loading the model.

Rate Limit Handling

All three providers automatically retry when the AI service says “slow down” (HTTP 429 Too Many Requests). This is handled transparently — the agent pauses, waits, and tries again without any user intervention.

Setting	Value
Max retries	3 (up to 4 total attempts including the original request)
Default wait	2 seconds between retries
Retry-after parsing	Extracts the exact delay from error messages matching `"try again in Xs"` (e.g., `"try again in 3.5s"` → wait 3.5 seconds)

Rate limit detection: Ghost checks if the error message contains either "429" or "rate_limit". If found, it first tries to extract a specific wait duration from the message using a regex pattern. If no specific duration is found, it falls back to the 2-second default wait.

Streaming retry safety: When the agent is streaming a response (live text appearing in the chat), retries only happen if no data has been sent to the UI yet. If the stream started successfully and then fails mid-way through, Ghost reports the error instead of retrying — because retrying a partially-delivered response would create confusing duplicate or inconsistent output.

Context Window Management

Every AI model has a limit on how much text it can process at once — this is called the “context window,” measured in tokens (roughly 1 token per 4 characters of English text). Ghost actively manages the conversation history to stay within this limit, automatically pruning older messages when the conversation gets too long.

Context Window Sizes

Ghost determines the context window based on the model name using substring matching (first match wins):

Model Pattern	Context Window	Examples
Contains `claude-3`, `claude-4`, `claude-opus`, `claude-sonnet`, or `claude-haiku`	200,000 tokens	claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5
Contains `gpt-4o` or `gpt-4-turbo`	128,000 tokens	gpt-4o, gpt-4o-mini, gpt-4-turbo
Contains `gpt-4` (but not the above)	8,192 tokens	gpt-4 (base)
Contains `gpt-3.5`	16,385 tokens	gpt-3.5-turbo
No pattern match — falls back by provider	Anthropic: 200,000 / OpenAI: 128,000 / Ollama or unknown: 32,000	gpt-5.4, o3, o4-mini, llama4, qwen3, deepseek-r1

Token Estimation

Ghost uses a fast approximation instead of running each provider’s actual tokenizer (which would be slow and require provider-specific libraries):

tokens ≈ (byte length of text ÷ 4) + 1

This is a deliberate over-estimate for safety — English text averages about 4 characters per token, but code and JSON (which are common in Ghost’s agent conversations) tend to have more tokens per character. The +1 ensures even empty-seeming messages get counted.

Per-message overhead: Each message adds 4 tokens of overhead (for the role label, message delimiters, etc.), and each tool call adds 10 tokens of overhead (for the tool call structure, name, etc.).

Output Token Reservation

Ghost reserves 8,192 tokens (the defaultMaxTokens constant) from the context window for the model’s response. This means if a model has a 200,000 token context window, Ghost uses up to 191,808 tokens for conversation history and reserves 8,192 for the model to generate its reply. This high output reservation is intentional — tool-heavy autonomous agent sessions often produce long responses with multiple tool calls.

Pruning Strategy

When the conversation exceeds the available token budget, Ghost progressively prunes older messages while preserving the most important context:

What is always preserved:

The system prompt (Ghost’s instructions to the AI — always first)
The original user message (the task the user asked the agent to perform)
The most recent exchanges (where the agent is actively working)

What gets pruned: Middle exchanges — the older back-and-forth between agent actions and tool results. An “exchange” is defined as one assistant message with tool calls followed by all the tool result messages until the next assistant message.

Progressive reduction: Ghost tries keeping 10 exchanges, then 9, then 8, all the way down to 1, checking after each reduction whether the conversation fits within the token budget. It uses the first count that fits. The absolute minimum is the system prompt + original user message + 1 exchange.

Mechanical summary: When exchanges are pruned, Ghost generates a brief summary of what was removed — without calling the AI model (which would cost tokens and time). The summary lists which tools were called and how many times, plus up to 2 key findings extracted from assistant text (first 200 characters each). This summary is inserted as a user message so the agent has some context about what happened earlier.

Anthropic alternation fix: The Anthropic API requires strict alternation between user and assistant messages — two assistant messages in a row cause an error. After pruning, Ghost runs a fixAlternation pass that inserts [Continuing from previous step.] spacer messages between consecutive assistant messages.

Hot Swap

The agent’s LLM provider can be changed at runtime through the Settings UI. When you change the provider, API key, or model:

Reconfigure() is called on the agent manager
If the provider is cloud-based (Anthropic or OpenAI) and the API key is empty, the agent is disabled entirely — any API call to the agent returns HTTP 503 “Service Unavailable”
If credentials are valid, a new provider instance is created
If an agent already exists, its provider is hot-swapped — the new provider is used for the next LLM call, but any in-flight request continues with the old provider until it completes
If no agent existed yet (first-time configuration), a new agent is created

This means you can switch from Anthropic to OpenAI mid-conversation, and the next agent action will use the new provider. The conversation history is preserved — only the AI “brain” changes.

Validation

Before committing a provider change, you can test the connection:

POST /api/v1/settings/llm/validate

This sends a minimal test prompt to the configured provider with a 15-second timeout. The response indicates success or a specific error:

Invalid API key
Model not found
Connection refused (Ollama not running)
Rate limited
Network error

The validation always returns HTTP 200 with a JSON body containing valid: true/false and an error message — it never returns a 4xx/5xx status code itself.

Agent Constants

These are the core constants that govern the agent’s behavior, regardless of which provider is selected:

Constant	Value	What it controls
`maxIterations`	25	Maximum number of plan-execute-reflect cycles the agent can run before being forced to stop. This prevents runaway agent loops that could consume unlimited API tokens.
`defaultMaxTokens`	8,192	Maximum tokens the model can generate in a single response. Set high to accommodate tool-heavy responses where the agent calls multiple tools in one turn.
Steer channel capacity	5	How many user steering messages (“focus on X”, “stop doing Y”) can be queued. If the queue is full, new steering messages are dropped with a warning log.
Stream channel buffer	64	Per-provider buffer for streaming SSE events. All three providers use the same buffer size.
Agent output channel buffer	128	Buffer for the agent’s main output stream that feeds SSE events to the frontend.

API Key Security

API keys are encrypted at rest using AES-256-GCM with a machine-specific derived key. They are never sent to the frontend — the Settings API returns a masked version showing only the last 4 characters (e.g., ****ab12). Keys with 4 or fewer characters show only ****.

Settings export explicitly clears API keys, and settings import explicitly ignores them. See Config File Reference for the full encryption mechanism.