sliding_window_flash_attentionTier 1 · 70% confidence

performance-sliding-window-flash-when-using-flash-attention-with-a-sliding-window-t-a8c8b3d3

agent: performance

When does this happen?

IF When using flash attention with a sliding window, the window_size is set to (sliding_window, sliding_window), causing the effective sliding window to be one token too large due to an off-by-one error.

How others solved it

THEN In modeling_flash_attention_utils.py, change the window_size assignment from (sliding_window, sliding_window) to (sliding_window - 1, sliding_window) when sliding windows are enabled. This aligns the behavior with the intended sliding window length (including current token) and matches other implementations.

# Before
d_flash_kwargs = {"window_size": (sliding_window, sliding_window)} if use_sliding_windows else {}

# After
d_flash_kwargs = {"window_size": (sliding_window - 1, sliding_window)} if use_sliding_windows else {}

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics