Skip to content

LLM Providers

Ghost’s AI agent needs a language model to understand traffic, generate tests, find security issues, and write bug reports. Think of the provider as Ghost’s “brain” — you choose which AI service powers it, provide your API key, and Ghost handles all the complexity of communicating with the model, managing conversation history, and staying within token limits.

Ghost supports three providers. You can switch between them at any time through Settings → AI, and the change takes effect immediately without restarting Ghost.


Anthropic makes Claude, Ghost’s default AI model. This is the recommended provider — Ghost’s prompt engineering is optimized for Claude, and Anthropic’s prompt caching feature significantly reduces costs for long agent sessions.

SettingValue
Default modelClaude Sonnet 4.6 (claude-sonnet-4-6)
Available modelsClaude Opus 4.6 (claude-opus-4-6), Claude Sonnet 4.6 (claude-sonnet-4-6), Claude Sonnet 4.5 (claude-sonnet-4-5-20250514), Claude Haiku 4.5 (claude-haiku-4-5-20250514)
Context window200,000 tokens (~150,000 words of conversation history the model can “see” at once)
API key requiredYes — get one from console.anthropic.com (starts with sk-ant-...)
Prompt cachingYes — system prompt and tool definitions are cached across turns

Opus 4.6 is the most capable model for complex multi-step analysis, security testing, and code generation. Sonnet 4.6 is the default — it provides an excellent balance of capability and cost for most QA and traffic analysis tasks. Haiku 4.5 is the fastest and cheapest option for simple queries.

[llm]
provider = "anthropic"
api_key = "sk-ant-..."
model = "claude-sonnet-4-6" # optional — defaults to Claude Sonnet 4.6

Prompt caching explained: Every time the agent sends a message to Claude, it includes a large system prompt (describing Ghost’s capabilities, the current session, security findings, etc.) and a list of all available tools (up to 60+ tool definitions). Without caching, Anthropic would charge full price for processing these unchanged blocks on every single turn. With caching, Ghost marks the system prompt and the last tool definition with CacheControl: ephemeral, telling Anthropic “you’ve seen this before — use the cached version.” This provides approximately a 90% discount on cached input tokens, which adds up significantly during long agent sessions with many tool calls.

Message merging: The Anthropic API requires strict alternation between user and assistant messages — two consecutive user messages cause a 400 error. Ghost’s Anthropic provider handles this automatically:

  • Consecutive user messages are merged into a single message with multiple text blocks
  • Consecutive tool result messages are merged into a single user message (tool results are user-role in the Anthropic API)
  • Critically, tool result messages absorb any immediately following user messages into the same Anthropic user message. Without this, a tool result followed by a user message (common when present_options pauses the agent and the user responds) would create two consecutive user-role messages, causing a "tool_use blocks must have matching tool_result blocks" error

This merging is Anthropic-specific — the OpenAI provider does not need it because OpenAI uses a separate tool role for tool results.

OpenAI makes GPT and o-series reasoning models. Ghost uses the official OpenAI SDK and supports all chat completion models.

SettingValue
Default modelGPT-4o (gpt-4o)
Available modelsGPT-5.4 (gpt-5.4), o3 (o3), o3 Pro (o3-pro), o4-mini (o4-mini), GPT-4.1 (gpt-4.1), GPT-4.1 Mini (gpt-4.1-mini), GPT-4.1 Nano (gpt-4.1-nano), GPT-4o (gpt-4o), GPT-4o Mini (gpt-4o-mini)
Context window128,000 tokens (GPT-4o, GPT-4.1), up to 1,000,000 tokens (GPT-4.1 with extended context)
API key requiredYes — get one from platform.openai.com (starts with sk-...)
Prompt cachingNo — OpenAI provider does not use any caching mechanism

GPT-5.4 is OpenAI’s latest flagship model. o3 and o3 Pro are reasoning models that excel at complex multi-step analysis. GPT-4.1 offers improved instruction following with up to 1M token context. o4-mini is a fast, cost-efficient reasoning model.

[llm]
provider = "openai"
api_key = "sk-..."
model = "gpt-4o" # optional — defaults to GPT-4o

Ollama lets you run open-source AI models entirely on your own machine — no API key needed, no data sent to the cloud. This is ideal for air-gapped environments, privacy-sensitive testing, or when you want to experiment without API costs. The trade-off is that local models are typically less capable than cloud models, and require significant GPU resources for good performance.

SettingValue
Default modelLlama 3.2 (llama3.2)
Available modelsLlama 4 (llama4), Llama 3.3 70B (llama3.3), Llama 3.2 (llama3.2), Qwen 3 (qwen3), Qwen 2.5 (qwen2.5), DeepSeek R1 (deepseek-r1), DeepSeek Coder V2 (deepseek-coder-v2), Mistral (mistral), Mixtral 8x22B (mixtral), Phi-4 (phi4), Gemma 2 (gemma2)
Context window32,000 tokens (default for unknown models)
API key requiredNo — Ollama runs locally with no authentication
Default endpointhttp://localhost:11434

Llama 4 and Qwen 3 are the latest open-source flagships with strong reasoning capabilities. DeepSeek R1 excels at step-by-step reasoning tasks. DeepSeek Coder V2 is optimized for code generation and analysis. Mixtral 8x22B offers the best cost-to-performance ratio for general tasks. Any model available in Ollama’s library works — these are just the pre-configured options in the dropdown.

[llm]
provider = "ollama"
ollama_endpoint = "http://localhost:11434"
model = "llama3.2" # optional — defaults to Llama 3.2

How it works internally: Ghost connects to Ollama using the OpenAI-compatible /v1 endpoint that Ollama provides. It literally reuses the OpenAI SDK with a different base URL and a dummy API key ("ollama") since Ollama ignores authentication. This means any Ollama model that supports the OpenAI chat completion format will work with Ghost.

Extended retry logic: Because Ollama is a local service that might be starting up, loading a model into GPU memory, or recovering from an error, the Ollama provider retries on 7 additional error patterns beyond the standard rate limit handling:

Error PatternWhat it means
connection refusedOllama daemon hasn’t started yet
connection resetConnection was interrupted
500Internal server error
503Service temporarily unavailable
EOFConnection closed unexpectedly
no such hostDNS resolution failed
model is loadingOllama is loading the model into memory (can take seconds to minutes)

All of these are retried with a 2-second wait between attempts — giving Ollama time to finish starting up or loading the model.


All three providers automatically retry when the AI service says “slow down” (HTTP 429 Too Many Requests). This is handled transparently — the agent pauses, waits, and tries again without any user intervention.

SettingValue
Max retries3 (up to 4 total attempts including the original request)
Default wait2 seconds between retries
Retry-after parsingExtracts the exact delay from error messages matching "try again in Xs" (e.g., "try again in 3.5s" → wait 3.5 seconds)

Rate limit detection: Ghost checks if the error message contains either "429" or "rate_limit". If found, it first tries to extract a specific wait duration from the message using a regex pattern. If no specific duration is found, it falls back to the 2-second default wait.

Streaming retry safety: When the agent is streaming a response (live text appearing in the chat), retries only happen if no data has been sent to the UI yet. If the stream started successfully and then fails mid-way through, Ghost reports the error instead of retrying — because retrying a partially-delivered response would create confusing duplicate or inconsistent output.


Every AI model has a limit on how much text it can process at once — this is called the “context window,” measured in tokens (roughly 1 token per 4 characters of English text). Ghost actively manages the conversation history to stay within this limit, automatically pruning older messages when the conversation gets too long.

Ghost determines the context window based on the model name using substring matching (first match wins):

Model PatternContext WindowExamples
Contains claude-3, claude-4, claude-opus, claude-sonnet, or claude-haiku200,000 tokensclaude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5
Contains gpt-4o or gpt-4-turbo128,000 tokensgpt-4o, gpt-4o-mini, gpt-4-turbo
Contains gpt-4 (but not the above)8,192 tokensgpt-4 (base)
Contains gpt-3.516,385 tokensgpt-3.5-turbo
No pattern match — falls back by providerAnthropic: 200,000 / OpenAI: 128,000 / Ollama or unknown: 32,000gpt-5.4, o3, o4-mini, llama4, qwen3, deepseek-r1

Ghost uses a fast approximation instead of running each provider’s actual tokenizer (which would be slow and require provider-specific libraries):

tokens ≈ (byte length of text ÷ 4) + 1

This is a deliberate over-estimate for safety — English text averages about 4 characters per token, but code and JSON (which are common in Ghost’s agent conversations) tend to have more tokens per character. The +1 ensures even empty-seeming messages get counted.

Per-message overhead: Each message adds 4 tokens of overhead (for the role label, message delimiters, etc.), and each tool call adds 10 tokens of overhead (for the tool call structure, name, etc.).

Ghost reserves 8,192 tokens (the defaultMaxTokens constant) from the context window for the model’s response. This means if a model has a 200,000 token context window, Ghost uses up to 191,808 tokens for conversation history and reserves 8,192 for the model to generate its reply. This high output reservation is intentional — tool-heavy autonomous agent sessions often produce long responses with multiple tool calls.

When the conversation exceeds the available token budget, Ghost progressively prunes older messages while preserving the most important context:

What is always preserved:

  1. The system prompt (Ghost’s instructions to the AI — always first)
  2. The original user message (the task the user asked the agent to perform)
  3. The most recent exchanges (where the agent is actively working)

What gets pruned: Middle exchanges — the older back-and-forth between agent actions and tool results. An “exchange” is defined as one assistant message with tool calls followed by all the tool result messages until the next assistant message.

Progressive reduction: Ghost tries keeping 10 exchanges, then 9, then 8, all the way down to 1, checking after each reduction whether the conversation fits within the token budget. It uses the first count that fits. The absolute minimum is the system prompt + original user message + 1 exchange.

Mechanical summary: When exchanges are pruned, Ghost generates a brief summary of what was removed — without calling the AI model (which would cost tokens and time). The summary lists which tools were called and how many times, plus up to 2 key findings extracted from assistant text (first 200 characters each). This summary is inserted as a user message so the agent has some context about what happened earlier.

Anthropic alternation fix: The Anthropic API requires strict alternation between user and assistant messages — two assistant messages in a row cause an error. After pruning, Ghost runs a fixAlternation pass that inserts [Continuing from previous step.] spacer messages between consecutive assistant messages.


The agent’s LLM provider can be changed at runtime through the Settings UI. When you change the provider, API key, or model:

  1. Reconfigure() is called on the agent manager
  2. If the provider is cloud-based (Anthropic or OpenAI) and the API key is empty, the agent is disabled entirely — any API call to the agent returns HTTP 503 “Service Unavailable”
  3. If credentials are valid, a new provider instance is created
  4. If an agent already exists, its provider is hot-swapped — the new provider is used for the next LLM call, but any in-flight request continues with the old provider until it completes
  5. If no agent existed yet (first-time configuration), a new agent is created

This means you can switch from Anthropic to OpenAI mid-conversation, and the next agent action will use the new provider. The conversation history is preserved — only the AI “brain” changes.

Before committing a provider change, you can test the connection:

POST /api/v1/settings/llm/validate

This sends a minimal test prompt to the configured provider with a 15-second timeout. The response indicates success or a specific error:

  • Invalid API key
  • Model not found
  • Connection refused (Ollama not running)
  • Rate limited
  • Network error

The validation always returns HTTP 200 with a JSON body containing valid: true/false and an error message — it never returns a 4xx/5xx status code itself.


These are the core constants that govern the agent’s behavior, regardless of which provider is selected:

ConstantValueWhat it controls
maxIterations25Maximum number of plan-execute-reflect cycles the agent can run before being forced to stop. This prevents runaway agent loops that could consume unlimited API tokens.
defaultMaxTokens8,192Maximum tokens the model can generate in a single response. Set high to accommodate tool-heavy responses where the agent calls multiple tools in one turn.
Steer channel capacity5How many user steering messages (“focus on X”, “stop doing Y”) can be queued. If the queue is full, new steering messages are dropped with a warning log.
Stream channel buffer64Per-provider buffer for streaming SSE events. All three providers use the same buffer size.
Agent output channel buffer128Buffer for the agent’s main output stream that feeds SSE events to the frontend.

API keys are encrypted at rest using AES-256-GCM with a machine-specific derived key. They are never sent to the frontend — the Settings API returns a masked version showing only the last 4 characters (e.g., ****ab12). Keys with 4 or fewer characters show only ****.

Settings export explicitly clears API keys, and settings import explicitly ignores them. See Config File Reference for the full encryption mechanism.