Skip to content

Agent System

Ghost’s AI agent is like a senior engineer you can pair with — it doesn’t just answer questions, it makes plans, executes them step by step, reflects on what it found, and decides when it’s done. You give it a goal (“find security vulnerabilities in this API” or “generate test cases for the checkout flow”), and it autonomously plans its approach, uses the right tools in the right order, adapts when it discovers something unexpected, and produces a structured report.

Under the hood, the agent runs a plan-execute-reflect-terminate loop — up to 25 iterations where it plans what to do, executes tools, reflects on results, and checks whether it should stop. It has access to ~60 tools (depending on mode and what’s connected), but each LLM call only sees 15-25 relevant tools — a dynamic router filters tools based on what the agent is currently doing.

What this diagram shows — how the agent’s components work together:

The Agent Core is the brain — it runs a loop up to 25 times, checking termination conditions each iteration, reading any steering messages the user sent mid-run, and pausing for interactive choices when the agent needs user input via present_options. Each iteration, it calls the LLM Provider (Claude, GPT-4o, or Ollama) via streaming, which returns text and/or tool calls. The Tool Router filters the full registry (~60 tools) down to 15-25 relevant tools based on what the agent is currently doing — so the LLM isn’t overwhelmed with irrelevant options. Tool calls are executed by the Concurrent Executor, which runs read-only tools in parallel (up to 4 at once) and mutating tools one at a time. After execution, the Context Manager estimates token usage and prunes old messages if the conversation is approaching the LLM’s context window limit — preserving the system prompt, the user’s original question, and recent exchanges.

Each iteration (maximum 25) follows 8 steps:

Six signals are evaluated in priority order (see Termination Signals below). If any signal fires, the agent either stops immediately or enters a “report then stop” mode where it gets 1-2 more iterations to write a summary.

The agent reads from a buffered channel (capacity 5) without blocking. If the user sent a steering message via POST /api/v1/agent/steer, it’s injected as a [USER STEERING] block into the conversation — appearing before the agent’s next LLM call. This lets you redirect the agent mid-run: “focus on the authentication endpoints” or “skip performance testing and go to reporting.”

The full message history is assembled: system prompt + conversation history + engagement state (an XML block showing the current plan progress, discovered endpoints, findings so far, and tool call counts). This engagement state acts as “Layer 2 memory” — it survives message pruning and gives the LLM awareness of overall progress even when older messages have been removed.

Two types of reflection prompts can be injected:

  • Step reflection — after a complete_step tool call, an XML block asks the agent to assess: What did we learn? Was the evidence sufficient? Any unexpected discoveries? What should the next step focus on?
  • Final reflection — when all plan steps are done, a longer prompt asks: Were all goals addressed? Quality of findings? What areas were missed? Confidence level? Recommendations for follow-up?

These reflections force the agent to pause and think critically rather than blindly executing the next step.

The tool router filters the full registry down to the tools relevant for the current plan step’s category. This is crucial for LLM performance — giving the LLM 60 tools at once leads to confusion and poor tool selection. Giving it 15-25 relevant tools produces much better results.

The provider’s StreamChat method is called with the message history, filtered tools, and MaxTokens: 8192. The response streams back as events — text chunks and tool call definitions arrive incrementally. SSE events are emitted to the frontend in real time so the user sees the agent “thinking.”

Two paths depending on whether the LLM made tool calls:

Text-only response (the LLM just wrote text without calling any tools):

  • If the plan is completed → stop (the agent is done)
  • If reportThenStop was set by a termination signal → stop
  • If no plan exists yet → inject a “planning nudge” asking the agent to create a plan (maximum 2 nudges before giving up)
  • Otherwise → inject a “continuation nudge” asking the agent to proceed, or stop if it shouldn’t continue

Tool call response (the LLM wants to use tools):

  • Execute the tools (parallel or sequential — see Concurrent Tool Execution)
  • Compress tool output to save context tokens
  • Update engagement state (endpoints discovered, findings, tool call counts, phase transitions)
  • Reset the consecutive text-only counter

After each iteration, the context manager estimates the total token count and prunes old messages if approaching the provider’s context window limit. This prevents the conversation from exceeding the LLM’s maximum input size.

The router solves a critical problem: the agent has ~60 tools available, but sending all of them to the LLM every call wastes tokens and confuses the model. Instead, the router uses a 4-layer filtering strategy to select 15-25 tools per call.

What this diagram shows — how tool filtering works:

The full registry of ~60 tools is never sent to the LLM all at once (except during initial planning). Instead, four layers contribute tools that are merged, deduplicated, and sorted alphabetically. Layer 1 provides 16 base tools that every task needs (plan management including think and present_options, traffic search, filesystem, journey export). Layer 2 adds tools specific to the current phase — if the agent is doing reconnaissance, it gets session/journey listing tools; if it’s doing active testing, it gets fuzzing and scanning tools. Layer 3 adds any tools the plan step explicitly requested. Layer 4 adds conditional tools that depend on what’s connected — browser tools only appear when the browser extension is active, inspector tools when a mobile device is connected, Frida tools when Frida is available. The alphabetical sort is deliberate — it produces deterministic tool ordering that maximizes Anthropic’s prompt cache hits.

Layer 1 — Base Tools (16, always available)

Section titled “Layer 1 — Base Tools (16, always available)”

These tools are included in every LLM call because they’re fundamental to how the agent operates:

CategoryToolsPurpose
Plan managementcreate_plan, revise_plan, complete_step, think, present_optionsCreating and progressing through the execution plan, private reasoning, and presenting interactive choices to the user
Traffic analysissearch_traffic, get_flow, get_flow_body, find_endpoints, get_traffic_statsSearching and reading captured HTTP traffic
Flow annotationtag_flows, annotate_flowMarking flows with tags and notes for organization
Journeyrecord_journey, journey_exportStart/stop journey recording and export journeys as deterministic test code (always available, not phase-dependent)
Filesystemfs_read, fs_write, fs_listReading and writing files (test output, reports, configs)

Different plan step categories unlock different tool sets. The category is set when the agent creates its plan — each step has a Category field that maps to one of these groups:

Step CategoryTools AddedCount
reconlist_sessions, list_journeys, get_journey_steps, list_ws_messages, list_console_errors, list_navigations, list_storage_changes, replay_request, get_page_resources9
analysisdetect_anomalies, detect_schema_drift, detect_regression, suggest_edge_cases, analyze_api_coverage, analyze_sequences, list_findings, map_page_apis, analyze_auth_pattern, analyze_jwt, find_sensitive_data, detect_idor, security_headers_audit13
active_testsend_http_request, replay_request, fuzz_endpoint, compare_environments, test_form, request_approval, replay_journey, attack_request, list_wordlists, + 10 external scanner tools (run_nuclei, run_dalfox, run_ffuf, run_sqlmap, run_katana, run_trufflehog, run_semgrep, run_nmap, run_ssl_scan, run_hydra)19+scanners
exploitsend_http_request, replay_request, request_approval, attack_request, list_wordlists, run_sqlmap6
reportgenerate_test, generate_bug_report, export_as, generate_api_docs, generate_mock_server, generate_session_report, visual_regression, list_findings8

When the agent creates a plan, each step can include a Tools field listing specific tools. These are always included for that step, even if they don’t match the step’s category. This lets the agent plan ahead: “In step 3, I’ll need the fuzz_endpoint tool even though this is an analysis step.”

These tools only appear when their corresponding capability is available. The router checks the tool registry — if the tools are registered, they’re included:

CapabilityConditionTools Added
Browser extensionExtension WebSocket connected, browser tools registeredUp to 17 browser tools defined (6 actually registered: browser_read_page, browser_query_all, browser_click, browser_fill, browser_screenshot, browser_inject)
Mobile inspectorDevice connected, inspector tools registered7 tools: get_device_screen, get_element_tree, find_elements, get_element_selectors, correlate_element_traffic, tap_device, type_device
FridaFrida available, Frida tools registered6 tools: frida_attach, frida_detach, frida_trace, frida_inject, frida_list_methods, frida_check_bypass

When no plan exists yet (the agent hasn’t created one), the router returns the full registry — all ~60 tools. This gives the agent complete awareness of its capabilities when planning. Once a plan is created, subsequent calls get the filtered 15-25 tool set.

Ghost maintains two separate tool registries — one for QA mode and one for Security mode. Each registers a different combination of tools:

Includes: core tools (13), plan tools (5), QA generation tools (7: generate_test, generate_bug_report, detect_regression, export_as, generate_api_docs, generate_mock_server, generate_test_scenarios), QA advanced analysis tools (11: fuzz_endpoint, compare_environments, test_form, detect_anomalies, detect_schema_drift, suggest_edge_cases, analyze_api_coverage, analyze_sequences, map_page_apis, visual_regression, generate_session_report), QA external tools (2: run_k6, run_hey), inspector tools (7), proxy_inject_script (1), TestRail tools (4: testrail_list_projects, testrail_get_cases, testrail_push_results, testrail_suggest_cases), and browser tools (6, conditional).

Includes: core tools (13), plan tools (5), security tools (6: list_findings, send_http_request, get_page_resources, attack_request, list_wordlists, request_approval), Frida tools (6, conditional), external scanner tools (10: run_nuclei, run_dalfox, run_ffuf, run_sqlmap, run_katana, run_trufflehog, run_semgrep, run_nmap, run_ssl_scan, run_hydra), inspector tools (7), proxy_inject_script (1), and browser tools (6, conditional).

Different tools have different timeout durations based on their expected execution time:

Tool PatternTimeoutWhy
Default30 secondsMost tools (search, get, list) complete quickly
browser_*60 secondsBrowser automation involves page loads and network waits
frida_*60 secondsFrida operations involve device communication
frida_trace, frida_inject3 minutesTracing and injection may involve long-running scripts
attack_request5 minutesAttacker engine sends many requests sequentially
run_* (external scanners)5 min + 10sExternal tools like Nuclei or SQLMap can run for minutes

The agent doesn’t just randomly use tools — it follows a structured methodology appropriate to its mode. Phases represent stages of a security assessment or QA test cycle.

PTES (Penetration Testing Execution Standard) is a widely-used framework for security assessments. Ghost’s security agent follows these phases:

What this diagram shows — the security assessment progression:

The agent starts by analyzing existing traffic to understand the application’s API surface. It then moves to passive detection — looking for vulnerabilities in the captured traffic without sending any new requests (missing headers, insecure cookies, information leakage). If the scan mode allows it, the agent progresses to active scanning — actually sending crafted requests to test for SQL injection, XSS, and other vulnerabilities. Confirmed vulnerabilities may lead to exploitation — proving the vulnerability is real with a proof-of-concept. Finally, the agent generates a report summarizing all findings with severity ratings and remediation advice.

PhaseNameTools AvailableAuto-Advance After
1Traffic AnalysisRecon tools3 tool calls
2Passive DetectionAnalysis tools4 tool calls
3Active ScanningActive test tools + scanners3 tool calls
4ExploitationExploit tools4 tool calls
5ReportingReport tools— (terminal)

What this diagram shows — the QA testing progression:

The agent starts with reconnaissance — understanding the application by examining traffic patterns, sessions, and page resources. It then performs functional testing — verifying expected behavior of API endpoints and user flows. Edge case testing explores boundary conditions and unusual inputs. Error handling testing checks how the application responds to invalid requests and error conditions. Performance testing uses tools like k6 and hey to measure response times under load. Finally, QA reporting generates test cases, bug reports, and coverage summaries.

PhaseNameAuto-Advance After
1QA Recon3 tool calls
2Functional Testing3 tool calls
3Edge Case Testing3 tool calls
4Error Handling3 tool calls
5Performance Testing2 tool calls
6QA Reporting— (terminal)

Phase transitions happen automatically based on PhaseToolCalls — the number of tool calls in the current phase. When the threshold is reached, the agent advances to the next phase. The PhaseToolCalls counter resets to 0 on each transition.

This prevents the agent from spending too long in any single phase. If the agent calls 3 tools during traffic analysis, it automatically advances to passive detection — even if it hasn’t finished analyzing everything. The agent can always use tools from previous phases (they’re still available), but the phase change influences which tools appear first in the filtered set.

The agent maintains a state object that tracks everything discovered during the run:

FieldTypeDescription
Modestring"qa" or "security" — determines which phases and prompt to use
PhasestringCurrent engagement phase (e.g., "passive_detection", "functional_testing")
EndpointslistDiscovered API endpoints (cap: 200). Deduplicated by a map[string]bool.
FindingslistVulnerabilities or bugs found (cap: 100). Each has an ID (VULN-001 for security, BUG-001 for QA), type, severity, and summary.
ToolsCalledintTotal number of tool invocations across the entire run
PhaseToolCallsintTool calls in the current phase — resets on phase transition, drives auto-advance
IterationintCurrent loop iteration (0-24)
TargetHostslistHosts the agent is testing
ActiveInjectionslistProxy injection rules the agent has created (cap: 50)
PlanTaskPlanThe structured execution plan (see below)
ReflectionCountintHow many reflection prompts have been injected
FinalReflectionDoneboolWhether the final reflection has been completed
QualityAssessmentstringThe agent’s self-assessment of its work quality
MissedAreaslistAreas the agent identified as not yet covered
StopAfterCurrentStepboolSet to true when the user requests a stop via the API

This state is serialized to XML and injected into every LLM call, giving the agent persistent awareness of its progress even when older messages are pruned.

After a text-only response, the agent checks whether it should continue looping or stop:

Returns false (stop) when:

  1. Phase is "done" — the agent has completed all phases
  2. Last tool was proxy_inject_script with action "add" — the agent just injected a script and should wait for results
  3. Phase is a reporting phase ("reporting" or "qa_reporting") — the agent is writing its final report
  4. Iteration >= 23 (maxIterations - 2) — approaching the hard cap, time to wrap up

Six signals are evaluated in priority order at the start of each iteration. Higher-priority signals override lower ones:

PrioritySignalConditionActionDescription
1Plan complete + final reflectionPlan status is "completed" AND FinalReflectionDone is truestopThe agent finished its plan and reflected on the results — a clean exit.
2Plan complete, no toolsPlan status is "completed" AND the last iteration had zero tool callsstopThe agent completed its plan and the LLM responded with only text (no more tools to call) — done.
3Loop detectionSame tool name + same input hash appears 3+ times in the last 10 tool call recordsreport_then_stopThe agent is stuck in a loop — calling the same tool with the same arguments repeatedly. Input is hashed with FNV-1a (32-bit) for efficient comparison.
4Diminishing returnsIteration >= 8 AND the last 6 tool calls are all the same tool AND the current step is "in_progress"report_then_stopThe agent isn’t making progress — it keeps using the same tool without advancing the plan. The iteration >= 8 guard prevents false positives during early exploration.
5Budget reservationIteration >= 22 (maxIterations - 3)report_then_stopRunning out of iterations — reserve the last 3 for the agent to write a summary and complete its current step.
6User stopStopAfterCurrentStep is true (set via POST /api/v1/agent/stop)report_then_stopThe user explicitly asked the agent to stop.

When the action is report_then_stop (signals 3-6), the agent doesn’t stop immediately. Instead:

  1. A termination prompt is injected as an XML block: <termination_notice reason="...">Complete your current step with complete_step and write a final summary.</termination_notice>
  2. The reportThenStop flag is set
  3. The agent gets 1-2 more iterations to finish its current step and write a summary
  4. On the next iteration, if the LLM responds with text only (no tool calls), the agent stops

This ensures the agent always produces useful output — even when interrupted, it summarizes what it found.

The agent can present structured choices to the user mid-run using the present_options tool. This creates an interactive UI element — a card with labeled option buttons and a free-text input — instead of requiring the user to type a response.

How it works:

  1. The LLM calls present_options with a question string and an array of options (each with label, description, and value)
  2. The agent emits a StreamEventOptions event containing the question and options
  3. The agent sets waitingForUserChoice = true and exits the loop cleanly — emitting metrics with termination reason "waiting for user choice"
  4. The frontend renders an OptionsPanel component with a 2-column card grid of options plus a free-text input field
  5. When the user selects an option (or types a custom response), the frontend sends a new chat message containing the selected value
  6. A new agent run starts with the user’s choice in the conversation history, and the agent continues from where it left off

The Run Summary is suppressed when the termination reason is "waiting for user choice" — the agent isn’t actually done, it’s just pausing for input. The options panel shows the question with interactive buttons, and after the user responds, the answered panel fades to 50% opacity with a check icon on the selected option.

FieldTypeDescription
questionstringThe question to present to the user (required)
optionsarray2-6 options, each with label (short display text), description (optional context), and value (returned when selected)

The think tool gives the agent a private scratchpad for reasoning. When the LLM calls think with a thought string, the tool simply returns "ok" — the thought is visible to the LLM in the conversation history but is not displayed as regular text to the user.

In the frontend, the think tool renders as a ThinkIndicator — a purple-bordered card with a brain icon and bouncing dots animation while pending, then “Reasoned about the approach” when complete. It does not use the standard accordion card UI. The think tool is always available (registered as a plan tool).

The agent’s execution plan is a structured object, not free-form text. This gives the system precise control over progress tracking:

FieldTypeConstraintsDescription
GoalstringHigh-level objective (e.g., “Find authentication vulnerabilities in the checkout API”)
ScopestringBoundaries of the assessment (e.g., “api.example.com, POST endpoints only”)
Stepslist1-15 stepsOrdered execution steps. Minimum 1 allows focused single-step tasks (like “export journey as Cypress UI”), maximum 15 prevents over-planning.
CurrentStepintIndex of the active step
RevisionsintMax 5How many times the plan has been revised. Cap prevents infinite replanning.
Statusstringactive / completed / abortedOverall plan status

Each step has:

FieldTypeValuesDescription
IDintStep number
DescriptionstringWhat this step will do
Categorystringrecon, analysis, active_test, exploit, reportMaps to tool router for phase-specific tool selection
ToolslistSpecific tools this step plans to use (included via Layer 3 routing)
Statusstringpending, in_progress, completed, skipped, failedStep progress
ResultstringMax 500 charsSummary of what the step accomplished
SubstepslistFiner-grained breakdown (optional)

The plan is serialized to XML and injected into every LLM call so the agent always knows where it is:

<current_plan progress="3/7" phase="active_test">
Goal: Find authentication bypass vulnerabilities
Scope: api.example.com checkout endpoints
Steps:
[DONE] Step 1: Analyze traffic patterns (Found 12 endpoints...)
[DONE] Step 2: Check authentication headers (Missing auth on 3...)
[IN PROGRESS] Step 3: Fuzz authentication parameters
[PENDING] Step 4: Test session management
[PENDING] Step 5: Generate proof-of-concept
[PENDING] Step 6: Write security report
</current_plan>

Step results are truncated to 80 characters in the XML to save tokens.

When the LLM requests multiple tool calls in a single response, Ghost can execute them in parallel — but only if they’re safe to parallelize.

26 read-only tools (safe for parallel execution): search_traffic, get_flow, get_flow_body, find_endpoints, get_traffic_stats, list_sessions, list_journeys, list_ws_messages, list_console_errors, list_navigations, list_storage_changes, get_page_resources, detect_anomalies, detect_schema_drift, detect_regression, suggest_edge_cases, analyze_api_coverage, analyze_sequences, list_findings, map_page_apis, fs_read, fs_list, get_device_screen, get_element_tree, find_elements, get_element_selectors

All other tools (like tag_flows, send_http_request, fuzz_endpoint, attack_request) are mutating — they change state or send requests. These always execute sequentially to avoid race conditions.

Parallel execution uses a semaphore channel with capacity 4 — at most 4 read-only tools run simultaneously. This limits resource usage while still providing significant speedup when the LLM requests many searches or listings at once.

Tool results can be large — a traffic search might return thousands of flows, a scanner might produce megabytes of output. Before feeding results back to the LLM, Ghost compresses them to a maximum of 16,000 characters (~4,000 tokens):

ToolCompression Strategy
nucleiKeep all critical+high findings, cap medium at 20, low at 10
sqlmapPreserve lines confirming injectable parameters
katanaPrioritize parameterized URLs (cap 100), plain paths (cap 50)
ffufNon-200 status codes first, cap at 50 results
trufflehogCap at 50 findings
semgrepError severity: 30, warning: 20, info: 10
search_trafficCap JSON results at 50 flows
get_flow_bodyDirect truncation
get_device_screenStrip base64 image data (too large for context)
DefaultUTF-8-safe truncation with summary: "...\n[Truncated — showing first N of M chars]" (150 chars reserved for suffix)

LLMs have a maximum input size (called the context window). As the agent’s conversation grows with tool calls and results, it can exceed this limit. The context manager prevents this by estimating token usage and pruning old messages.

Ghost uses a simple heuristic: ~4 characters per token + overhead:

tokens ≈ len(text) / 4 + 1

Per-message overhead: +4 tokens for role/delimiter formatting. Tool calls add: name tokens + input tokens + 10 for JSON structure.

This is deliberately approximate — exact tokenization would require running the provider’s tokenizer, which is slow. The estimate errs on the side of overestimating (pruning slightly early is better than exceeding the context window).

Provider / ModelContext WindowOutput Reserved
Claude (Sonnet 4.6, Opus, Haiku)200,000 tokens8,192 tokens
GPT-4o, GPT-4 Turbo128,000 tokens8,192 tokens
GPT-4 (original)8,192 tokens8,192 tokens
GPT-3.516,385 tokens8,192 tokens
Ollama (default)32,000 tokens8,192 tokens
Unknown/fallback32,000 tokens8,192 tokens

Budget = context window - output reservation. For Claude: 200,000 - 8,192 = 191,808 tokens available for input.

When the estimated token count exceeds the budget:

  1. Skip if too few messages — won’t prune if there are 4 or fewer messages (need minimum context)
  2. Try progressively aggressive pruning — starts by keeping the last 10 exchanges, reduces to 9, 8, … down to 1 until the budget is met
  3. Force minimum — if even keeping 1 exchange exceeds budget, force it anyway

An “exchange” is one assistant message (with tool calls) + all its subsequent tool result messages. This keeps related tool calls and their results together.

What’s preserved (never pruned):

  • System message (index 0) — the agent’s identity and instructions
  • Original user message (index 1) — the user’s initial question/goal
  • Compact summary of pruned exchanges (synthetic)
  • Last N exchanges (the most recent work)

The summary of pruned exchanges is mechanical (no LLM call needed):

  • Lists tools used with invocation counts: "Tools used: search_traffic(3), get_flow(1)"
  • Includes at most 2 key findings (first 200 characters each)
  • Format: "[Earlier conversation context — N tool exchanges pruned... See engagement state for full progress.]"

Some providers (particularly Anthropic) require strict alternation between user and assistant messages — no two consecutive messages from the same role. The fixAlternation function inserts synthetic "[Continuing from previous step.]" user messages between consecutive assistant messages. Tool result messages are treated as user-role for this purpose.

Ghost supports three LLM providers, each with its own adapter:

  • Default model: Claude Sonnet 4.6
  • Prompt caching: System messages and the last tool definition get CacheControl: Ephemeral — Anthropic caches these across calls, reducing cost and latency for the agent’s iterative loop
  • Message merging: Consecutive user messages are merged into a single message with multiple text blocks (Anthropic API requirement). Consecutive tool results are merged similarly. Critically, tool result messages absorb any immediately following user messages into the same Anthropic user message — because tool results become user-role messages in the Anthropic API, a real user message immediately after tool results would create invalid consecutive user messages. This merging prevents 400 errors from the Anthropic API (“tool_use blocks must have matching tool_result blocks”). OpenAI does not have this issue because it uses a separate tool role.
  • Stream buffer: 64 events
  • Default model: GPT-4o
  • Retry logic: Up to 3 retries with 2-second base wait. Rate limit detection parses “try again in Xs” from error messages using regex.
  • Stream buffer: 64 events
  • Uses ChatCompletionAccumulator for streaming assembly
  • Default endpoint: http://localhost:11434
  • Default model: llama3.2
  • API compatibility: Uses OpenAI-compatible API at /v1 path with dummy API key "ollama"
  • Retry resilience: Retries on connection refused, connection reset, HTTP 500/503, EOF, “no such host”, and “model is loading” — Ollama often needs time to load models into memory

The agent’s system prompt varies by mode and is assembled from multiple sections:

  1. Identity — who the agent is (“Ghost QA Agent”)
  2. Session context — current session info
  3. Strategic directives — testing strategy guidance
  4. File upload directive — how to handle uploaded files
  5. Planning protocol — rules for creating and executing plans
  6. Data parsing protocol — how to parse traffic data formats
  7. Injection rules (conditional) — included only when browser extension is connected
  8. Inspector context (conditional) — included only when a mobile device is connected
  9. Constraints — limitations and boundaries
  10. Few-shot examples — example interactions showing good agent behavior
  11. GQL reference — GraphQL-specific guidance

Two variants: web security and mobile security (which includes Frida context).

3 scan modes restrict what the agent can do:

Scan ModeDescriptionKey Restrictions
passiveObservation only — analyze existing traffic, never send requestsFORBIDDEN: send_http_request, fuzz_endpoint, attack_request, all run_* scanners, all frida_* tools
active-safeNon-destructive testing — send requests but no exploitationALLOWED: send_http_request, fuzz_endpoint, most scanners. FORBIDDEN: run_sqlmap, run_hydra, attack_request (cluster_bomb/battering_ram modes)
active-fullFull offensive testing — all tools availableREQUIRES: request_approval before destructive actions. All tools allowed.

The security prompt includes vulnerability taxonomies organized by severity:

  • Critical: SQL injection, RCE, auth bypass, SSRF to internal networks, hardcoded secrets
  • High: IDOR, stored XSS, path traversal, JWT issues, mass assignment
  • Medium: CORS misconfiguration, reflected XSS, info disclosure, rate limiting, CSRF
  • Low: Verbose errors, missing headers, cookie flags, version disclosure

Mobile-specific additions (when Frida is available): certificate pinning bypass, insecure local storage, biometric bypass, IPC vulnerabilities, root/jailbreak detection bypass.

Conversations and messages are stored in SQLite, linked to sessions:

TableKey FieldsNotes
conversationsid, session_id, title, created_at, updated_atCASCADE delete with session
messagesid, conversation_id, role, content, tool_calls (JSON), tool_call_idCASCADE delete with conversation

Messages store tool calls as a JSON array — each tool call includes the tool name, input arguments, and the tool call ID. This allows conversations to be resumed: the full tool call history is preserved and can be replayed.

When loading a conversation for continuation, sanitizeConversationHistory fixes common issues: orphaned tool results (without a preceding assistant message), consecutive same-role messages (inserts synthetic bridging messages), and ensures proper alternation for provider compatibility.

The agent streams events to the frontend via Server-Sent Events (SSE). Each event has a type and JSON data:

Event TypeWhen EmittedPayload
chunkEach text token from the LLMText delta string
tool_callLLM requests a toolTool name + input JSON
tool_resultTool execution completesTool name + output
assistant_messageComplete assistant message assembledFull message content
plan_createdAgent creates a new planTaskPlan object
step_startedPlan step begins executionStep ID + description
step_completedPlan step finishedStep ID + result
plan_revisedAgent revises its planUpdated TaskPlan
plan_completedAll plan steps finishedFinal plan state
optionsAgent presents interactive choices to the userQuestion string + array of option objects (label, description, value)
steerUser steering message injectedSteering text
metricsAgent run completedRunMetrics (duration, iterations, tool calls, findings)
errorError occurredError message
doneAgent run finished
MethodEndpointDescription
POST/api/v1/agent/chatStart or continue an agent conversation. Body limit: 64 KB. Returns SSE stream. Creates new conversation if no conversation_id provided.
GET/api/v1/agent/conversationsList all agent conversations (ordered by updated_at DESC)
GET/api/v1/agent/conversations/{id}Get a specific conversation with all messages
DELETE/api/v1/agent/conversations/{id}Delete a conversation and all its messages
POST/api/v1/agent/steerSend a steering message to the running agent. Injected at the next iteration.
POST/api/v1/agent/stopRequest the agent to stop after completing its current step
POST/api/v1/agent/uploadUpload a file for agent context. Body limit: 512 KB. Allowed: .txt, .md (100 KB), .json, .yaml, .yml (200 KB).

When the agent completes (regardless of how it stopped), it emits a metrics event with detailed statistics:

MetricDescription
DurationTotal wall-clock time
IterationsHow many loop iterations were used (out of 25 max)
ToolCallsTotal number of tool invocations
UniqueToolsNumber of distinct tools used
FailedToolsHow many tool calls returned errors
PlanStepsTotal steps in the plan
StepsCompletedHow many steps were completed
PlanRevisionsHow many times the plan was revised
ReflectionsNumber of reflection prompts injected
TerminationReasonWhy the agent stopped (plan_complete, loop_detected, budget, user_stop, waiting_for_user_choice, etc.)
LoopsDetectedNumber of loop detection triggers
FindingsTotalTotal findings discovered
FindingsBySeverityBreakdown: {critical: 1, high: 3, medium: 5, ...}