LLM Providers
Ghost’s AI agent needs a language model to understand traffic, generate tests, find security issues, and write bug reports. Think of the provider as Ghost’s “brain” — you choose which AI service powers it, provide your API key, and Ghost handles all the complexity of communicating with the model, managing conversation history, and staying within token limits.
Ghost supports three providers. You can switch between them at any time through Settings → AI, and the change takes effect immediately without restarting Ghost.
Providers
Section titled “Providers”Anthropic (Default)
Section titled “Anthropic (Default)”Anthropic makes Claude, Ghost’s default AI model. This is the recommended provider — Ghost’s prompt engineering is optimized for Claude, and Anthropic’s prompt caching feature significantly reduces costs for long agent sessions.
| Setting | Value |
|---|---|
| Default model | Claude Sonnet 4.6 (claude-sonnet-4-6) |
| Available models | Claude Opus 4.6 (claude-opus-4-6), Claude Sonnet 4.6 (claude-sonnet-4-6), Claude Sonnet 4.5 (claude-sonnet-4-5-20250514), Claude Haiku 4.5 (claude-haiku-4-5-20250514) |
| Context window | 200,000 tokens (~150,000 words of conversation history the model can “see” at once) |
| API key required | Yes — get one from console.anthropic.com (starts with sk-ant-...) |
| Prompt caching | Yes — system prompt and tool definitions are cached across turns |
Opus 4.6 is the most capable model for complex multi-step analysis, security testing, and code generation. Sonnet 4.6 is the default — it provides an excellent balance of capability and cost for most QA and traffic analysis tasks. Haiku 4.5 is the fastest and cheapest option for simple queries.
[llm]provider = "anthropic"api_key = "sk-ant-..."model = "claude-sonnet-4-6" # optional — defaults to Claude Sonnet 4.6Prompt caching explained: Every time the agent sends a message to Claude, it includes a large system prompt (describing Ghost’s capabilities, the current session, security findings, etc.) and a list of all available tools (up to 60+ tool definitions). Without caching, Anthropic would charge full price for processing these unchanged blocks on every single turn. With caching, Ghost marks the system prompt and the last tool definition with CacheControl: ephemeral, telling Anthropic “you’ve seen this before — use the cached version.” This provides approximately a 90% discount on cached input tokens, which adds up significantly during long agent sessions with many tool calls.
Message merging: The Anthropic API requires strict alternation between user and assistant messages — two consecutive user messages cause a 400 error. Ghost’s Anthropic provider handles this automatically:
- Consecutive user messages are merged into a single message with multiple text blocks
- Consecutive tool result messages are merged into a single user message (tool results are user-role in the Anthropic API)
- Critically, tool result messages absorb any immediately following user messages into the same Anthropic user message. Without this, a tool result followed by a user message (common when
present_optionspauses the agent and the user responds) would create two consecutive user-role messages, causing a"tool_use blocks must have matching tool_result blocks"error
This merging is Anthropic-specific — the OpenAI provider does not need it because OpenAI uses a separate tool role for tool results.
OpenAI
Section titled “OpenAI”OpenAI makes GPT and o-series reasoning models. Ghost uses the official OpenAI SDK and supports all chat completion models.
| Setting | Value |
|---|---|
| Default model | GPT-4o (gpt-4o) |
| Available models | GPT-5.4 (gpt-5.4), o3 (o3), o3 Pro (o3-pro), o4-mini (o4-mini), GPT-4.1 (gpt-4.1), GPT-4.1 Mini (gpt-4.1-mini), GPT-4.1 Nano (gpt-4.1-nano), GPT-4o (gpt-4o), GPT-4o Mini (gpt-4o-mini) |
| Context window | 128,000 tokens (GPT-4o, GPT-4.1), up to 1,000,000 tokens (GPT-4.1 with extended context) |
| API key required | Yes — get one from platform.openai.com (starts with sk-...) |
| Prompt caching | No — OpenAI provider does not use any caching mechanism |
GPT-5.4 is OpenAI’s latest flagship model. o3 and o3 Pro are reasoning models that excel at complex multi-step analysis. GPT-4.1 offers improved instruction following with up to 1M token context. o4-mini is a fast, cost-efficient reasoning model.
[llm]provider = "openai"api_key = "sk-..."model = "gpt-4o" # optional — defaults to GPT-4oOllama (Local)
Section titled “Ollama (Local)”Ollama lets you run open-source AI models entirely on your own machine — no API key needed, no data sent to the cloud. This is ideal for air-gapped environments, privacy-sensitive testing, or when you want to experiment without API costs. The trade-off is that local models are typically less capable than cloud models, and require significant GPU resources for good performance.
| Setting | Value |
|---|---|
| Default model | Llama 3.2 (llama3.2) |
| Available models | Llama 4 (llama4), Llama 3.3 70B (llama3.3), Llama 3.2 (llama3.2), Qwen 3 (qwen3), Qwen 2.5 (qwen2.5), DeepSeek R1 (deepseek-r1), DeepSeek Coder V2 (deepseek-coder-v2), Mistral (mistral), Mixtral 8x22B (mixtral), Phi-4 (phi4), Gemma 2 (gemma2) |
| Context window | 32,000 tokens (default for unknown models) |
| API key required | No — Ollama runs locally with no authentication |
| Default endpoint | http://localhost:11434 |
Llama 4 and Qwen 3 are the latest open-source flagships with strong reasoning capabilities. DeepSeek R1 excels at step-by-step reasoning tasks. DeepSeek Coder V2 is optimized for code generation and analysis. Mixtral 8x22B offers the best cost-to-performance ratio for general tasks. Any model available in Ollama’s library works — these are just the pre-configured options in the dropdown.
[llm]provider = "ollama"ollama_endpoint = "http://localhost:11434"model = "llama3.2" # optional — defaults to Llama 3.2How it works internally: Ghost connects to Ollama using the OpenAI-compatible /v1 endpoint that Ollama provides. It literally reuses the OpenAI SDK with a different base URL and a dummy API key ("ollama") since Ollama ignores authentication. This means any Ollama model that supports the OpenAI chat completion format will work with Ghost.
Extended retry logic: Because Ollama is a local service that might be starting up, loading a model into GPU memory, or recovering from an error, the Ollama provider retries on 7 additional error patterns beyond the standard rate limit handling:
| Error Pattern | What it means |
|---|---|
connection refused | Ollama daemon hasn’t started yet |
connection reset | Connection was interrupted |
500 | Internal server error |
503 | Service temporarily unavailable |
EOF | Connection closed unexpectedly |
no such host | DNS resolution failed |
model is loading | Ollama is loading the model into memory (can take seconds to minutes) |
All of these are retried with a 2-second wait between attempts — giving Ollama time to finish starting up or loading the model.
Rate Limit Handling
Section titled “Rate Limit Handling”All three providers automatically retry when the AI service says “slow down” (HTTP 429 Too Many Requests). This is handled transparently — the agent pauses, waits, and tries again without any user intervention.
| Setting | Value |
|---|---|
| Max retries | 3 (up to 4 total attempts including the original request) |
| Default wait | 2 seconds between retries |
| Retry-after parsing | Extracts the exact delay from error messages matching "try again in Xs" (e.g., "try again in 3.5s" → wait 3.5 seconds) |
Rate limit detection: Ghost checks if the error message contains either "429" or "rate_limit". If found, it first tries to extract a specific wait duration from the message using a regex pattern. If no specific duration is found, it falls back to the 2-second default wait.
Streaming retry safety: When the agent is streaming a response (live text appearing in the chat), retries only happen if no data has been sent to the UI yet. If the stream started successfully and then fails mid-way through, Ghost reports the error instead of retrying — because retrying a partially-delivered response would create confusing duplicate or inconsistent output.
Context Window Management
Section titled “Context Window Management”Every AI model has a limit on how much text it can process at once — this is called the “context window,” measured in tokens (roughly 1 token per 4 characters of English text). Ghost actively manages the conversation history to stay within this limit, automatically pruning older messages when the conversation gets too long.
Context Window Sizes
Section titled “Context Window Sizes”Ghost determines the context window based on the model name using substring matching (first match wins):
| Model Pattern | Context Window | Examples |
|---|---|---|
Contains claude-3, claude-4, claude-opus, claude-sonnet, or claude-haiku | 200,000 tokens | claude-opus-4-6, claude-sonnet-4-6, claude-haiku-4-5 |
Contains gpt-4o or gpt-4-turbo | 128,000 tokens | gpt-4o, gpt-4o-mini, gpt-4-turbo |
Contains gpt-4 (but not the above) | 8,192 tokens | gpt-4 (base) |
Contains gpt-3.5 | 16,385 tokens | gpt-3.5-turbo |
| No pattern match — falls back by provider | Anthropic: 200,000 / OpenAI: 128,000 / Ollama or unknown: 32,000 | gpt-5.4, o3, o4-mini, llama4, qwen3, deepseek-r1 |
Token Estimation
Section titled “Token Estimation”Ghost uses a fast approximation instead of running each provider’s actual tokenizer (which would be slow and require provider-specific libraries):
tokens ≈ (byte length of text ÷ 4) + 1This is a deliberate over-estimate for safety — English text averages about 4 characters per token, but code and JSON (which are common in Ghost’s agent conversations) tend to have more tokens per character. The +1 ensures even empty-seeming messages get counted.
Per-message overhead: Each message adds 4 tokens of overhead (for the role label, message delimiters, etc.), and each tool call adds 10 tokens of overhead (for the tool call structure, name, etc.).
Output Token Reservation
Section titled “Output Token Reservation”Ghost reserves 8,192 tokens (the defaultMaxTokens constant) from the context window for the model’s response. This means if a model has a 200,000 token context window, Ghost uses up to 191,808 tokens for conversation history and reserves 8,192 for the model to generate its reply. This high output reservation is intentional — tool-heavy autonomous agent sessions often produce long responses with multiple tool calls.
Pruning Strategy
Section titled “Pruning Strategy”When the conversation exceeds the available token budget, Ghost progressively prunes older messages while preserving the most important context:
What is always preserved:
- The system prompt (Ghost’s instructions to the AI — always first)
- The original user message (the task the user asked the agent to perform)
- The most recent exchanges (where the agent is actively working)
What gets pruned: Middle exchanges — the older back-and-forth between agent actions and tool results. An “exchange” is defined as one assistant message with tool calls followed by all the tool result messages until the next assistant message.
Progressive reduction: Ghost tries keeping 10 exchanges, then 9, then 8, all the way down to 1, checking after each reduction whether the conversation fits within the token budget. It uses the first count that fits. The absolute minimum is the system prompt + original user message + 1 exchange.
Mechanical summary: When exchanges are pruned, Ghost generates a brief summary of what was removed — without calling the AI model (which would cost tokens and time). The summary lists which tools were called and how many times, plus up to 2 key findings extracted from assistant text (first 200 characters each). This summary is inserted as a user message so the agent has some context about what happened earlier.
Anthropic alternation fix: The Anthropic API requires strict alternation between user and assistant messages — two assistant messages in a row cause an error. After pruning, Ghost runs a fixAlternation pass that inserts [Continuing from previous step.] spacer messages between consecutive assistant messages.
Hot Swap
Section titled “Hot Swap”The agent’s LLM provider can be changed at runtime through the Settings UI. When you change the provider, API key, or model:
Reconfigure()is called on the agent manager- If the provider is cloud-based (Anthropic or OpenAI) and the API key is empty, the agent is disabled entirely — any API call to the agent returns HTTP 503 “Service Unavailable”
- If credentials are valid, a new provider instance is created
- If an agent already exists, its provider is hot-swapped — the new provider is used for the next LLM call, but any in-flight request continues with the old provider until it completes
- If no agent existed yet (first-time configuration), a new agent is created
This means you can switch from Anthropic to OpenAI mid-conversation, and the next agent action will use the new provider. The conversation history is preserved — only the AI “brain” changes.
Validation
Section titled “Validation”Before committing a provider change, you can test the connection:
POST /api/v1/settings/llm/validateThis sends a minimal test prompt to the configured provider with a 15-second timeout. The response indicates success or a specific error:
- Invalid API key
- Model not found
- Connection refused (Ollama not running)
- Rate limited
- Network error
The validation always returns HTTP 200 with a JSON body containing valid: true/false and an error message — it never returns a 4xx/5xx status code itself.
Agent Constants
Section titled “Agent Constants”These are the core constants that govern the agent’s behavior, regardless of which provider is selected:
| Constant | Value | What it controls |
|---|---|---|
maxIterations | 25 | Maximum number of plan-execute-reflect cycles the agent can run before being forced to stop. This prevents runaway agent loops that could consume unlimited API tokens. |
defaultMaxTokens | 8,192 | Maximum tokens the model can generate in a single response. Set high to accommodate tool-heavy responses where the agent calls multiple tools in one turn. |
| Steer channel capacity | 5 | How many user steering messages (“focus on X”, “stop doing Y”) can be queued. If the queue is full, new steering messages are dropped with a warning log. |
| Stream channel buffer | 64 | Per-provider buffer for streaming SSE events. All three providers use the same buffer size. |
| Agent output channel buffer | 128 | Buffer for the agent’s main output stream that feeds SSE events to the frontend. |
API Key Security
Section titled “API Key Security”API keys are encrypted at rest using AES-256-GCM with a machine-specific derived key. They are never sent to the frontend — the Settings API returns a masked version showing only the last 4 characters (e.g., ****ab12). Keys with 4 or fewer characters show only ****.
Settings export explicitly clears API keys, and settings import explicitly ignores them. See Config File Reference for the full encryption mechanism.