AI risk deep dive

The attacks we simulate and the controls we verify

This page details the specific vulnerabilities we test for and the safeguards we check. Use it to understand our scope or prepare your team.

Coverage areas
  • Prompt injection and jailbreaking.
  • Data leakage and retrieval attacks.
  • Output safety and policy bypass.
  • Tool/function calling abuse.
Attack simulation

Prompt injection testing

We attempt to make your AI do things it shouldn't.

Direct jailbreaking

Roleplay attacks, DAN-style prompts, and multi-turn manipulation to bypass system instructions.

System prompt extraction

Attempts to leak your system prompt, revealing business logic and security controls.

Indirect injection

Malicious instructions hidden in retrieved documents, user profiles, or external content.

Instruction hierarchy bypass

Trying to override developer instructions with user-level prompts.

Context manipulation

Exploiting long context windows to hide attacks or confuse the model about instruction source.

Encoding tricks

Base64, ROT13, and other encoding schemes to sneak harmful instructions past filters.

Attack simulation

Data leakage testing

We try to extract information your AI shouldn't reveal.

Cross-tenant retrieval

Attempting to access other users' documents through crafted queries in multi-tenant RAG systems.

PII extraction

Probing for personal data that may have been embedded or retrieved unintentionally.

Training data extraction

Attempts to recover verbatim training examples or fine-tuning data from model responses.

Context window leakage

Extracting previous conversation content or system context from the current session.

Metadata exposure

Leaking file paths, database schemas, or internal identifiers through model outputs.

Credential disclosure

Checking if API keys, passwords, or secrets can be extracted from prompts or retrieved content.

Attack simulation

Output safety testing

We verify your AI refuses harmful requests and doesn't generate dangerous content.

Harmful content generation

Testing for generation of illegal, violent, or explicitly harmful content.

Policy bypass

Attempts to generate content that violates your stated usage policies.

XSS and injection payloads

Checking if model outputs can contain executable code or injection attacks.

Malicious links

Testing if the AI can be tricked into generating phishing URLs or malware links.

Unintended tool execution

For agentic systems: testing if attackers can trigger dangerous function calls.

Social engineering outputs

Checking if the AI can be used to craft scam messages or impersonate others.

Control verification

Safeguards we check

Beyond attack simulation, we verify you have controls in place:

  • Input filtering: Guardrails that detect and block malicious prompts.
  • Output filtering: Post-processing that catches harmful responses.
  • Access controls: RAG retrieval respects user permissions.
  • Logging: Prompts and responses are logged for incident response.
  • Kill switches: Ability to disable AI features quickly.

What you get

  • Vulnerability findings with severity ratings
  • Reproduction steps for each finding
  • Remediation guidance
  • Control gap analysis
  • Executive summary for customers

Want the full technical scope?

We can walk through exactly which tests apply to your system architecture.

Schedule a technical call