Agent System

Ghost’s AI agent is like a senior engineer you can pair with — it doesn’t just answer questions, it makes plans, executes them step by step, reflects on what it found, and decides when it’s done. You give it a goal (“find security vulnerabilities in this API” or “generate test cases for the checkout flow”), and it autonomously plans its approach, uses the right tools in the right order, adapts when it discovers something unexpected, and produces a structured report.

Under the hood, the agent runs a plan-execute-reflect-terminate loop — up to 25 iterations where it plans what to do, executes tools, reflects on results, and checks whether it should stop. It has access to ~60 tools (depending on mode and what’s connected), but each LLM call only sees 15-25 relevant tools — a dynamic router filters tools based on what the agent is currently doing.

Architecture

What this diagram shows — how the agent’s components work together:

The Agent Core is the brain — it runs a loop up to 25 times, checking termination conditions each iteration, reading any steering messages the user sent mid-run, and pausing for interactive choices when the agent needs user input via present_options. Each iteration, it calls the LLM Provider (Claude, GPT-4o, or Ollama) via streaming, which returns text and/or tool calls. The Tool Router filters the full registry (~60 tools) down to 15-25 relevant tools based on what the agent is currently doing — so the LLM isn’t overwhelmed with irrelevant options. Tool calls are executed by the Concurrent Executor, which runs read-only tools in parallel (up to 4 at once) and mutating tools one at a time. After execution, the Context Manager estimates token usage and prunes old messages if the conversation is approaching the LLM’s context window limit — preserving the system prompt, the user’s original question, and recent exchanges.

Plan-Execute-Reflect Loop

Each iteration (maximum 25) follows 8 steps:

Step 1: Check Termination

Six signals are evaluated in priority order (see Termination Signals below). If any signal fires, the agent either stops immediately or enters a “report then stop” mode where it gets 1-2 more iterations to write a summary.

Step 2: Check Steering

The agent reads from a buffered channel (capacity 5) without blocking. If the user sent a steering message via POST /api/v1/agent/steer, it’s injected as a [USER STEERING] block into the conversation — appearing before the agent’s next LLM call. This lets you redirect the agent mid-run: “focus on the authentication endpoints” or “skip performance testing and go to reporting.”

Step 3: Build Messages

The full message history is assembled: system prompt + conversation history + engagement state (an XML block showing the current plan progress, discovered endpoints, findings so far, and tool call counts). This engagement state acts as “Layer 2 memory” — it survives message pruning and gives the LLM awareness of overall progress even when older messages have been removed.

Step 4: Inject Reflections

Two types of reflection prompts can be injected:

Step reflection — after a complete_step tool call, an XML block asks the agent to assess: What did we learn? Was the evidence sufficient? Any unexpected discoveries? What should the next step focus on?
Final reflection — when all plan steps are done, a longer prompt asks: Were all goals addressed? Quality of findings? What areas were missed? Confidence level? Recommendations for follow-up?

These reflections force the agent to pause and think critically rather than blindly executing the next step.

Step 5: Get Tool Definitions

The tool router filters the full registry down to the tools relevant for the current plan step’s category. This is crucial for LLM performance — giving the LLM 60 tools at once leads to confusion and poor tool selection. Giving it 15-25 relevant tools produces much better results.

Step 6: Call LLM (Stream)

The provider’s StreamChat method is called with the message history, filtered tools, and MaxTokens: 8192. The response streams back as events — text chunks and tool call definitions arrive incrementally. SSE events are emitted to the frontend in real time so the user sees the agent “thinking.”

Step 7: Handle Response

Two paths depending on whether the LLM made tool calls:

Text-only response (the LLM just wrote text without calling any tools):

If the plan is completed → stop (the agent is done)
If reportThenStop was set by a termination signal → stop
If no plan exists yet → inject a “planning nudge” asking the agent to create a plan (maximum 2 nudges before giving up)
Otherwise → inject a “continuation nudge” asking the agent to proceed, or stop if it shouldn’t continue

Tool call response (the LLM wants to use tools):

Execute the tools (parallel or sequential — see Concurrent Tool Execution)
Compress tool output to save context tokens
Update engagement state (endpoints discovered, findings, tool call counts, phase transitions)
Reset the consecutive text-only counter

Step 8: Manage Context

After each iteration, the context manager estimates the total token count and prunes old messages if approaching the provider’s context window limit. This prevents the conversation from exceeding the LLM’s maximum input size.

Tool Router

The router solves a critical problem: the agent has ~60 tools available, but sending all of them to the LLM every call wastes tokens and confuses the model. Instead, the router uses a 4-layer filtering strategy to select 15-25 tools per call.

What this diagram shows — how tool filtering works:

The full registry of ~60 tools is never sent to the LLM all at once (except during initial planning). Instead, four layers contribute tools that are merged, deduplicated, and sorted alphabetically. Layer 1 provides 16 base tools that every task needs (plan management including think and present_options, traffic search, filesystem, journey export). Layer 2 adds tools specific to the current phase — if the agent is doing reconnaissance, it gets session/journey listing tools; if it’s doing active testing, it gets fuzzing and scanning tools. Layer 3 adds any tools the plan step explicitly requested. Layer 4 adds conditional tools that depend on what’s connected — browser tools only appear when the browser extension is active, inspector tools when a mobile device is connected, Frida tools when Frida is available. The alphabetical sort is deliberate — it produces deterministic tool ordering that maximizes Anthropic’s prompt cache hits.

Layer 1 — Base Tools (16, always available)

These tools are included in every LLM call because they’re fundamental to how the agent operates:

Category	Tools	Purpose
Plan management	`create_plan`, `revise_plan`, `complete_step`, `think`, `present_options`	Creating and progressing through the execution plan, private reasoning, and presenting interactive choices to the user
Traffic analysis	`search_traffic`, `get_flow`, `get_flow_body`, `find_endpoints`, `get_traffic_stats`	Searching and reading captured HTTP traffic
Flow annotation	`tag_flows`, `annotate_flow`	Marking flows with tags and notes for organization
Journey	`record_journey`, `journey_export`	Start/stop journey recording and export journeys as deterministic test code (always available, not phase-dependent)
Filesystem	`fs_read`, `fs_write`, `fs_list`	Reading and writing files (test output, reports, configs)

Layer 2 — Phase-Specific Tools

Different plan step categories unlock different tool sets. The category is set when the agent creates its plan — each step has a Category field that maps to one of these groups:

Step Category	Tools Added	Count
recon	`list_sessions`, `list_journeys`, `get_journey_steps`, `list_ws_messages`, `list_console_errors`, `list_navigations`, `list_storage_changes`, `replay_request`, `get_page_resources`	9
analysis	`detect_anomalies`, `detect_schema_drift`, `detect_regression`, `suggest_edge_cases`, `analyze_api_coverage`, `analyze_sequences`, `list_findings`, `map_page_apis`, `analyze_auth_pattern`, `analyze_jwt`, `find_sensitive_data`, `detect_idor`, `security_headers_audit`	13
active_test	`send_http_request`, `replay_request`, `fuzz_endpoint`, `compare_environments`, `test_form`, `request_approval`, `replay_journey`, `attack_request`, `list_wordlists`, + 10 external scanner tools (`run_nuclei`, `run_dalfox`, `run_ffuf`, `run_sqlmap`, `run_katana`, `run_trufflehog`, `run_semgrep`, `run_nmap`, `run_ssl_scan`, `run_hydra`)	19+scanners
exploit	`send_http_request`, `replay_request`, `request_approval`, `attack_request`, `list_wordlists`, `run_sqlmap`	6
report	`generate_test`, `generate_bug_report`, `export_as`, `generate_api_docs`, `generate_mock_server`, `generate_session_report`, `visual_regression`, `list_findings`	8

Layer 3 — Step-Explicit Tools

When the agent creates a plan, each step can include a Tools field listing specific tools. These are always included for that step, even if they don’t match the step’s category. This lets the agent plan ahead: “In step 3, I’ll need the fuzz_endpoint tool even though this is an analysis step.”

Layer 4 — Conditional Tools

These tools only appear when their corresponding capability is available. The router checks the tool registry — if the tools are registered, they’re included:

Capability	Condition	Tools Added
Browser extension	Extension WebSocket connected, browser tools registered	Up to 17 browser tools defined (6 actually registered: `browser_read_page`, `browser_query_all`, `browser_click`, `browser_fill`, `browser_screenshot`, `browser_inject`)
Mobile inspector	Device connected, inspector tools registered	7 tools: `get_device_screen`, `get_element_tree`, `find_elements`, `get_element_selectors`, `correlate_element_traffic`, `tap_device`, `type_device`
Frida	Frida available, Frida tools registered	6 tools: `frida_attach`, `frida_detach`, `frida_trace`, `frida_inject`, `frida_list_methods`, `frida_check_bypass`

No Plan? Full Registry

When no plan exists yet (the agent hasn’t created one), the router returns the full registry — all ~60 tools. This gives the agent complete awareness of its capabilities when planning. Once a plan is created, subsequent calls get the filtered 15-25 tool set.

Tool Registries

Ghost maintains two separate tool registries — one for QA mode and one for Security mode. Each registers a different combination of tools:

QA Registry (~59 tools)

Includes: core tools (13), plan tools (5), QA generation tools (7: generate_test, generate_bug_report, detect_regression, export_as, generate_api_docs, generate_mock_server, generate_test_scenarios), QA advanced analysis tools (11: fuzz_endpoint, compare_environments, test_form, detect_anomalies, detect_schema_drift, suggest_edge_cases, analyze_api_coverage, analyze_sequences, map_page_apis, visual_regression, generate_session_report), QA external tools (2: run_k6, run_hey), inspector tools (7), proxy_inject_script (1), TestRail tools (4: testrail_list_projects, testrail_get_cases, testrail_push_results, testrail_suggest_cases), and browser tools (6, conditional).

Security Registry (~57 tools)

Includes: core tools (13), plan tools (5), security tools (6: list_findings, send_http_request, get_page_resources, attack_request, list_wordlists, request_approval), Frida tools (6, conditional), external scanner tools (10: run_nuclei, run_dalfox, run_ffuf, run_sqlmap, run_katana, run_trufflehog, run_semgrep, run_nmap, run_ssl_scan, run_hydra), inspector tools (7), proxy_inject_script (1), and browser tools (6, conditional).

Tool Timeouts

Different tools have different timeout durations based on their expected execution time:

Tool Pattern	Timeout	Why
Default	30 seconds	Most tools (search, get, list) complete quickly
`browser_*`	60 seconds	Browser automation involves page loads and network waits
`frida_*`	60 seconds	Frida operations involve device communication
`frida_trace`, `frida_inject`	3 minutes	Tracing and injection may involve long-running scripts
`attack_request`	5 minutes	Attacker engine sends many requests sequentially
`run_*` (external scanners)	5 min + 10s	External tools like Nuclei or SQLMap can run for minutes

Engagement Phases

The agent doesn’t just randomly use tools — it follows a structured methodology appropriate to its mode. Phases represent stages of a security assessment or QA test cycle.

Security Mode (PTES Methodology)

PTES (Penetration Testing Execution Standard) is a widely-used framework for security assessments. Ghost’s security agent follows these phases:

What this diagram shows — the security assessment progression:

The agent starts by analyzing existing traffic to understand the application’s API surface. It then moves to passive detection — looking for vulnerabilities in the captured traffic without sending any new requests (missing headers, insecure cookies, information leakage). If the scan mode allows it, the agent progresses to active scanning — actually sending crafted requests to test for SQL injection, XSS, and other vulnerabilities. Confirmed vulnerabilities may lead to exploitation — proving the vulnerability is real with a proof-of-concept. Finally, the agent generates a report summarizing all findings with severity ratings and remediation advice.

Phase	Name	Tools Available	Auto-Advance After
1	Traffic Analysis	Recon tools	3 tool calls
2	Passive Detection	Analysis tools	4 tool calls
3	Active Scanning	Active test tools + scanners	3 tool calls
4	Exploitation	Exploit tools	4 tool calls
5	Reporting	Report tools	— (terminal)

QA Mode

What this diagram shows — the QA testing progression:

The agent starts with reconnaissance — understanding the application by examining traffic patterns, sessions, and page resources. It then performs functional testing — verifying expected behavior of API endpoints and user flows. Edge case testing explores boundary conditions and unusual inputs. Error handling testing checks how the application responds to invalid requests and error conditions. Performance testing uses tools like k6 and hey to measure response times under load. Finally, QA reporting generates test cases, bug reports, and coverage summaries.

Phase	Name	Auto-Advance After
1	QA Recon	3 tool calls
2	Functional Testing	3 tool calls
3	Edge Case Testing	3 tool calls
4	Error Handling	3 tool calls
5	Performance Testing	2 tool calls
6	QA Reporting	— (terminal)

Auto-Advance

Phase transitions happen automatically based on PhaseToolCalls — the number of tool calls in the current phase. When the threshold is reached, the agent advances to the next phase. The PhaseToolCalls counter resets to 0 on each transition.

This prevents the agent from spending too long in any single phase. If the agent calls 3 tools during traffic analysis, it automatically advances to passive detection — even if it hasn’t finished analyzing everything. The agent can always use tools from previous phases (they’re still available), but the phase change influences which tools appear first in the filtered set.

Engagement State

The agent maintains a state object that tracks everything discovered during the run:

Field	Type	Description
Mode	string	`"qa"` or `"security"` — determines which phases and prompt to use
Phase	string	Current engagement phase (e.g., `"passive_detection"`, `"functional_testing"`)
Endpoints	list	Discovered API endpoints (cap: 200). Deduplicated by a `map[string]bool`.
Findings	list	Vulnerabilities or bugs found (cap: 100). Each has an ID (`VULN-001` for security, `BUG-001` for QA), type, severity, and summary.
ToolsCalled	int	Total number of tool invocations across the entire run
PhaseToolCalls	int	Tool calls in the current phase — resets on phase transition, drives auto-advance
Iteration	int	Current loop iteration (0-24)
TargetHosts	list	Hosts the agent is testing
ActiveInjections	list	Proxy injection rules the agent has created (cap: 50)
Plan	TaskPlan	The structured execution plan (see below)
ReflectionCount	int	How many reflection prompts have been injected
FinalReflectionDone	bool	Whether the final reflection has been completed
QualityAssessment	string	The agent’s self-assessment of its work quality
MissedAreas	list	Areas the agent identified as not yet covered
StopAfterCurrentStep	bool	Set to true when the user requests a stop via the API

This state is serialized to XML and injected into every LLM call, giving the agent persistent awareness of its progress even when older messages are pruned.

ShouldContinue Logic

After a text-only response, the agent checks whether it should continue looping or stop:

Returns false (stop) when:

Phase is "done" — the agent has completed all phases
Last tool was proxy_inject_script with action "add" — the agent just injected a script and should wait for results
Phase is a reporting phase ("reporting" or "qa_reporting") — the agent is writing its final report
Iteration >= 23 (maxIterations - 2) — approaching the hard cap, time to wrap up

Termination Signals

Six signals are evaluated in priority order at the start of each iteration. Higher-priority signals override lower ones:

Priority	Signal	Condition	Action	Description
1	Plan complete + final reflection	Plan status is `"completed"` AND `FinalReflectionDone` is true	`stop`	The agent finished its plan and reflected on the results — a clean exit.
2	Plan complete, no tools	Plan status is `"completed"` AND the last iteration had zero tool calls	`stop`	The agent completed its plan and the LLM responded with only text (no more tools to call) — done.
3	Loop detection	Same tool name + same input hash appears 3+ times in the last 10 tool call records	`report_then_stop`	The agent is stuck in a loop — calling the same tool with the same arguments repeatedly. Input is hashed with FNV-1a (32-bit) for efficient comparison.
4	Diminishing returns	Iteration >= 8 AND the last 6 tool calls are all the same tool AND the current step is `"in_progress"`	`report_then_stop`	The agent isn’t making progress — it keeps using the same tool without advancing the plan. The iteration >= 8 guard prevents false positives during early exploration.
5	Budget reservation	Iteration >= 22 (`maxIterations - 3`)	`report_then_stop`	Running out of iterations — reserve the last 3 for the agent to write a summary and complete its current step.
6	User stop	`StopAfterCurrentStep` is true (set via `POST /api/v1/agent/stop`)	`report_then_stop`	The user explicitly asked the agent to stop.

report_then_stop Behavior

When the action is report_then_stop (signals 3-6), the agent doesn’t stop immediately. Instead:

A termination prompt is injected as an XML block: <termination_notice reason="...">Complete your current step with complete_step and write a final summary.</termination_notice>
The reportThenStop flag is set
The agent gets 1-2 more iterations to finish its current step and write a summary
On the next iteration, if the LLM responds with text only (no tool calls), the agent stops

This ensures the agent always produces useful output — even when interrupted, it summarizes what it found.

Interactive Choices (present_options)

The agent can present structured choices to the user mid-run using the present_options tool. This creates an interactive UI element — a card with labeled option buttons and a free-text input — instead of requiring the user to type a response.

Flow

How it works:

The LLM calls present_options with a question string and an array of options (each with label, description, and value)
The agent emits a StreamEventOptions event containing the question and options
The agent sets waitingForUserChoice = true and exits the loop cleanly — emitting metrics with termination reason "waiting for user choice"
The frontend renders an OptionsPanel component with a 2-column card grid of options plus a free-text input field
When the user selects an option (or types a custom response), the frontend sends a new chat message containing the selected value
A new agent run starts with the user’s choice in the conversation history, and the agent continues from where it left off

The Run Summary is suppressed when the termination reason is "waiting for user choice" — the agent isn’t actually done, it’s just pausing for input. The options panel shows the question with interactive buttons, and after the user responds, the answered panel fades to 50% opacity with a check icon on the selected option.

Tool Definition

Field	Type	Description
`question`	string	The question to present to the user (required)
`options`	array	2-6 options, each with `label` (short display text), `description` (optional context), and `value` (returned when selected)

Think Tool

The think tool gives the agent a private scratchpad for reasoning. When the LLM calls think with a thought string, the tool simply returns "ok" — the thought is visible to the LLM in the conversation history but is not displayed as regular text to the user.

In the frontend, the think tool renders as a ThinkIndicator — a purple-bordered card with a brain icon and bouncing dots animation while pending, then “Reasoned about the approach” when complete. It does not use the standard accordion card UI. The think tool is always available (registered as a plan tool).

TaskPlan Model

The agent’s execution plan is a structured object, not free-form text. This gives the system precise control over progress tracking:

Field	Type	Constraints	Description
Goal	string	—	High-level objective (e.g., “Find authentication vulnerabilities in the checkout API”)
Scope	string	—	Boundaries of the assessment (e.g., “api.example.com, POST endpoints only”)
Steps	list	1-15 steps	Ordered execution steps. Minimum 1 allows focused single-step tasks (like “export journey as Cypress UI”), maximum 15 prevents over-planning.
CurrentStep	int	—	Index of the active step
Revisions	int	Max 5	How many times the plan has been revised. Cap prevents infinite replanning.
Status	string	`active` / `completed` / `aborted`	Overall plan status

Each step has:

Field	Type	Values	Description
ID	int	—	Step number
Description	string	—	What this step will do
Category	string	`recon`, `analysis`, `active_test`, `exploit`, `report`	Maps to tool router for phase-specific tool selection
Tools	list	—	Specific tools this step plans to use (included via Layer 3 routing)
Status	string	`pending`, `in_progress`, `completed`, `skipped`, `failed`	Step progress
Result	string	Max 500 chars	Summary of what the step accomplished
Substeps	list	—	Finer-grained breakdown (optional)

XML Serialization

The plan is serialized to XML and injected into every LLM call so the agent always knows where it is:

<current_plan progress="3/7" phase="active_test">
  Goal: Find authentication bypass vulnerabilities
  Scope: api.example.com checkout endpoints
  Steps:
    [DONE] Step 1: Analyze traffic patterns (Found 12 endpoints...)
    [DONE] Step 2: Check authentication headers (Missing auth on 3...)
    [IN PROGRESS] Step 3: Fuzz authentication parameters
    [PENDING] Step 4: Test session management
    [PENDING] Step 5: Generate proof-of-concept
    [PENDING] Step 6: Write security report
</current_plan>

Step results are truncated to 80 characters in the XML to save tokens.

Concurrent Tool Execution

When the LLM requests multiple tool calls in a single response, Ghost can execute them in parallel — but only if they’re safe to parallelize.

26 read-only tools (safe for parallel execution): search_traffic, get_flow, get_flow_body, find_endpoints, get_traffic_stats, list_sessions, list_journeys, list_ws_messages, list_console_errors, list_navigations, list_storage_changes, get_page_resources, detect_anomalies, detect_schema_drift, detect_regression, suggest_edge_cases, analyze_api_coverage, analyze_sequences, list_findings, map_page_apis, fs_read, fs_list, get_device_screen, get_element_tree, find_elements, get_element_selectors

All other tools (like tag_flows, send_http_request, fuzz_endpoint, attack_request) are mutating — they change state or send requests. These always execute sequentially to avoid race conditions.

Parallel execution uses a semaphore channel with capacity 4 — at most 4 read-only tools run simultaneously. This limits resource usage while still providing significant speedup when the LLM requests many searches or listings at once.

Tool Output Compression

Tool results can be large — a traffic search might return thousands of flows, a scanner might produce megabytes of output. Before feeding results back to the LLM, Ghost compresses them to a maximum of 16,000 characters (~4,000 tokens):

Tool	Compression Strategy
nuclei	Keep all critical+high findings, cap medium at 20, low at 10
sqlmap	Preserve lines confirming injectable parameters
katana	Prioritize parameterized URLs (cap 100), plain paths (cap 50)
ffuf	Non-200 status codes first, cap at 50 results
trufflehog	Cap at 50 findings
semgrep	Error severity: 30, warning: 20, info: 10
search_traffic	Cap JSON results at 50 flows
get_flow_body	Direct truncation
get_device_screen	Strip base64 image data (too large for context)
Default	UTF-8-safe truncation with summary: `"...\n[Truncated — showing first N of M chars]"` (150 chars reserved for suffix)

Context Management

LLMs have a maximum input size (called the context window). As the agent’s conversation grows with tool calls and results, it can exceed this limit. The context manager prevents this by estimating token usage and pruning old messages.

Token Estimation

Ghost uses a simple heuristic: ~4 characters per token + overhead:

tokens ≈ len(text) / 4 + 1

Per-message overhead: +4 tokens for role/delimiter formatting. Tool calls add: name tokens + input tokens + 10 for JSON structure.

This is deliberately approximate — exact tokenization would require running the provider’s tokenizer, which is slow. The estimate errs on the side of overestimating (pruning slightly early is better than exceeding the context window).

Provider Context Windows

Provider / Model	Context Window	Output Reserved
Claude (Sonnet 4.6, Opus, Haiku)	200,000 tokens	8,192 tokens
GPT-4o, GPT-4 Turbo	128,000 tokens	8,192 tokens
GPT-4 (original)	8,192 tokens	8,192 tokens
GPT-3.5	16,385 tokens	8,192 tokens
Ollama (default)	32,000 tokens	8,192 tokens
Unknown/fallback	32,000 tokens	8,192 tokens

Budget = context window - output reservation. For Claude: 200,000 - 8,192 = 191,808 tokens available for input.

Message Pruning

When the estimated token count exceeds the budget:

Skip if too few messages — won’t prune if there are 4 or fewer messages (need minimum context)
Try progressively aggressive pruning — starts by keeping the last 10 exchanges, reduces to 9, 8, … down to 1 until the budget is met
Force minimum — if even keeping 1 exchange exceeds budget, force it anyway

An “exchange” is one assistant message (with tool calls) + all its subsequent tool result messages. This keeps related tool calls and their results together.

What’s preserved (never pruned):

System message (index 0) — the agent’s identity and instructions
Original user message (index 1) — the user’s initial question/goal
Compact summary of pruned exchanges (synthetic)
Last N exchanges (the most recent work)

Compact Summary

The summary of pruned exchanges is mechanical (no LLM call needed):

Lists tools used with invocation counts: "Tools used: search_traffic(3), get_flow(1)"
Includes at most 2 key findings (first 200 characters each)
Format: "[Earlier conversation context — N tool exchanges pruned... See engagement state for full progress.]"

Message Alternation Fix

Some providers (particularly Anthropic) require strict alternation between user and assistant messages — no two consecutive messages from the same role. The fixAlternation function inserts synthetic "[Continuing from previous step.]" user messages between consecutive assistant messages. Tool result messages are treated as user-role for this purpose.

LLM Providers

Ghost supports three LLM providers, each with its own adapter:

Anthropic (Claude)

Default model: Claude Sonnet 4.6
Prompt caching: System messages and the last tool definition get CacheControl: Ephemeral — Anthropic caches these across calls, reducing cost and latency for the agent’s iterative loop
Message merging: Consecutive user messages are merged into a single message with multiple text blocks (Anthropic API requirement). Consecutive tool results are merged similarly. Critically, tool result messages absorb any immediately following user messages into the same Anthropic user message — because tool results become user-role messages in the Anthropic API, a real user message immediately after tool results would create invalid consecutive user messages. This merging prevents 400 errors from the Anthropic API (“tool_use blocks must have matching tool_result blocks”). OpenAI does not have this issue because it uses a separate tool role.
Stream buffer: 64 events

OpenAI (GPT-4o)

Default model: GPT-4o
Retry logic: Up to 3 retries with 2-second base wait. Rate limit detection parses “try again in Xs” from error messages using regex.
Stream buffer: 64 events
Uses ChatCompletionAccumulator for streaming assembly

Ollama (Local)

Default endpoint: http://localhost:11434
Default model: llama3.2
API compatibility: Uses OpenAI-compatible API at /v1 path with dummy API key "ollama"
Retry resilience: Retries on connection refused, connection reset, HTTP 500/503, EOF, “no such host”, and “model is loading” — Ollama often needs time to load models into memory

System Prompts

The agent’s system prompt varies by mode and is assembled from multiple sections:

QA Mode Prompt Sections

Identity — who the agent is (“Ghost QA Agent”)
Session context — current session info
Strategic directives — testing strategy guidance
File upload directive — how to handle uploaded files
Planning protocol — rules for creating and executing plans
Data parsing protocol — how to parse traffic data formats
Injection rules (conditional) — included only when browser extension is connected
Inspector context (conditional) — included only when a mobile device is connected
Constraints — limitations and boundaries
Few-shot examples — example interactions showing good agent behavior
GQL reference — GraphQL-specific guidance

Security Mode Prompt Sections

Two variants: web security and mobile security (which includes Frida context).

3 scan modes restrict what the agent can do:

Scan Mode	Description	Key Restrictions
passive	Observation only — analyze existing traffic, never send requests	FORBIDDEN: `send_http_request`, `fuzz_endpoint`, `attack_request`, all `run_` scanners, all `frida_` tools
active-safe	Non-destructive testing — send requests but no exploitation	ALLOWED: `send_http_request`, `fuzz_endpoint`, most scanners. FORBIDDEN: `run_sqlmap`, `run_hydra`, `attack_request` (cluster_bomb/battering_ram modes)
active-full	Full offensive testing — all tools available	REQUIRES: `request_approval` before destructive actions. All tools allowed.

The security prompt includes vulnerability taxonomies organized by severity:

Critical: SQL injection, RCE, auth bypass, SSRF to internal networks, hardcoded secrets
High: IDOR, stored XSS, path traversal, JWT issues, mass assignment
Medium: CORS misconfiguration, reflected XSS, info disclosure, rate limiting, CSRF
Low: Verbose errors, missing headers, cookie flags, version disclosure

Mobile-specific additions (when Frida is available): certificate pinning bypass, insecure local storage, biometric bypass, IPC vulnerabilities, root/jailbreak detection bypass.

Conversation Persistence

Conversations and messages are stored in SQLite, linked to sessions:

Table	Key Fields	Notes
conversations	id, session_id, title, created_at, updated_at	CASCADE delete with session
messages	id, conversation_id, role, content, tool_calls (JSON), tool_call_id	CASCADE delete with conversation

Messages store tool calls as a JSON array — each tool call includes the tool name, input arguments, and the tool call ID. This allows conversations to be resumed: the full tool call history is preserved and can be replayed.

When loading a conversation for continuation, sanitizeConversationHistory fixes common issues: orphaned tool results (without a preceding assistant message), consecutive same-role messages (inserts synthetic bridging messages), and ensures proper alternation for provider compatibility.

SSE Event Types

The agent streams events to the frontend via Server-Sent Events (SSE). Each event has a type and JSON data:

Event Type	When Emitted	Payload
`chunk`	Each text token from the LLM	Text delta string
`tool_call`	LLM requests a tool	Tool name + input JSON
`tool_result`	Tool execution completes	Tool name + output
`assistant_message`	Complete assistant message assembled	Full message content
`plan_created`	Agent creates a new plan	TaskPlan object
`step_started`	Plan step begins execution	Step ID + description
`step_completed`	Plan step finished	Step ID + result
`plan_revised`	Agent revises its plan	Updated TaskPlan
`plan_completed`	All plan steps finished	Final plan state
`options`	Agent presents interactive choices to the user	Question string + array of option objects (label, description, value)
`steer`	User steering message injected	Steering text
`metrics`	Agent run completed	RunMetrics (duration, iterations, tool calls, findings)
`error`	Error occurred	Error message
`done`	Agent run finished	—

API Reference

Method	Endpoint	Description
`POST`	`/api/v1/agent/chat`	Start or continue an agent conversation. Body limit: 64 KB. Returns SSE stream. Creates new conversation if no `conversation_id` provided.
`GET`	`/api/v1/agent/conversations`	List all agent conversations (ordered by updated_at DESC)
`GET`	`/api/v1/agent/conversations/{id}`	Get a specific conversation with all messages
`DELETE`	`/api/v1/agent/conversations/{id}`	Delete a conversation and all its messages
`POST`	`/api/v1/agent/steer`	Send a steering message to the running agent. Injected at the next iteration.
`POST`	`/api/v1/agent/stop`	Request the agent to stop after completing its current step
`POST`	`/api/v1/agent/upload`	Upload a file for agent context. Body limit: 512 KB. Allowed: .txt, .md (100 KB), .json, .yaml, .yml (200 KB).

Run Metrics

When the agent completes (regardless of how it stopped), it emits a metrics event with detailed statistics:

Metric	Description
Duration	Total wall-clock time
Iterations	How many loop iterations were used (out of 25 max)
ToolCalls	Total number of tool invocations
UniqueTools	Number of distinct tools used
FailedTools	How many tool calls returned errors
PlanSteps	Total steps in the plan
StepsCompleted	How many steps were completed
PlanRevisions	How many times the plan was revised
Reflections	Number of reflection prompts injected
TerminationReason	Why the agent stopped (plan_complete, loop_detected, budget, user_stop, waiting_for_user_choice, etc.)
LoopsDetected	Number of loop detection triggers
FindingsTotal	Total findings discovered
FindingsBySeverity	Breakdown: `{critical: 1, high: 3, medium: 5, ...}`