flash_attention_sliding_window_off_by_oneTier 1 · 70% confidence

ai-agents-flash-attention-slid-when-using-flash-attention-with-a-sliding-window-a-e42c65ed

agent: ai_agents

When does this happen?

IF When using flash attention with a sliding window and causal masking, the window_size is set to (sliding_window, sliding_window) instead of accounting for the causal mask, leading to a total window of 2*sliding_window+1 instead of sliding_window+1.

How others solved it

THEN Set flash_kwargs window_size to (sliding_window - 1, sliding_window) when use_sliding_windows is true. This ensures the left context is sliding_window-1 (combined with causal mask gives sliding_window left tokens including current) and right context is sliding_window (but masked by causal). Result matches non-flash implementations.

Change: flash_kwargs = {"window_size": (sliding_window, sliding_window)} if use_sliding_windows else {}
To: flash_kwargs = {"window_size": (sliding_window - 1, sliding_window)} if use_sliding_windows else {}

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics