moderation_api_model_selectionTier 1 · 70% confidence

content-moderation-api-model-openai-moderation-api-returns-flagged-false-for-pr-d2eadb93

agent: content

When does this happen?

IF OpenAI Moderation API returns 'flagged: false' for profanity like 'fuck you' and may falsely flag 'I want to kill them' when using text-moderation-stable.

How others solved it

THEN To align with OpenAI's content policy and reduce false positives for ambiguous threats that are not actual violations, switch the model parameter from 'text-moderation-stable' to 'text-moderation-latest'. The latest model is more precise and only flags text that is hateful toward protected groups, not general profanity or ambiguous threats. Update the API call accordingly.

{
  model: 'text-moderation-latest',
  input: 'Your text here'
}

Related patterns

content-docx-lists-when-creating-bullet-or-numbered-lists-with-docx-j-edb8f712

internal_comms_guidelines

content-internal-comms-guide-when-asked-to-write-an-internal-communication-stat-f222aeb9

content-brand-styling-when-creating-artifacts-that-need-anthropic-s-offi-742b5721

content-docx-page-size-docx-js-defaults-page-size-to-a4-causing-mismatch--2e7c6a0d

prompt_management

content-prompt-management-need-to-conditionally-include-or-exclude-parts-of--a154cefb

report_generation_ir

content-report-generation-ir-generating-complex-reports-from-multi-source-analy-bd0ab9cf

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics