prompt_injectionTier 1 · 70% confidence

security-prompt-injection-user-provides-a-jailbreak-prompt-e-g-pretend-to-be-dc763cc9

agent: security

When does this happen?

IF User provides a jailbreak prompt (e.g., 'pretend to be DAN') asking the model to ignore its constraints and 'do anything now'.

How others solved it

THEN Implement a prompt guard that scans user inputs for known jailbreak phrases (e.g., 'do anything now', 'DAN', 'ignore previous instructions') and either blocks the request or returns a refusal. Additionally, use system messages to reinforce the model's boundaries and detect role-playing attempts that bypass safety rules.

def is_jailbreak(prompt):
    import re
    patterns = [r'\bDAN\b', r'do anything now', r'ignore previous', r'pretend to be', r'you are now']
    return any(re.search(p, prompt, re.I) for p in patterns)

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics