moe_wna16_kernel_alignmentTier 1 · 70% confidence

performance-moe-wna16-kernel-ali-runtimeerror-size-k-must-divisible-by-block-size-k-c832c88f

agent: performance

When does this happen?

IF RuntimeError: size_k must divisible by BLOCK_SIZE_K during model warm-up when using tensor parallelism with AWQ-quantized MoE models.

How others solved it

THEN Pad the K dimension of the input activation tensor and the weight tensors (B, B_scale, B_zp) to the next multiple of BLOCK_SIZE_K before calling moe_wna16_gemm. For activation padding, use torch.nn.functional.pad in fused_moe.py. For weight tensors, pad them once at load time to avoid runtime overhead.

# Paraphrase: In the fused_moe kernel, before the gemm call, pad the K dimension:
# pad_k = (BLOCK_SIZE_K - (k_size % BLOCK_SIZE_K)) % BLOCK_SIZE_K
# A_padded = F.pad(A, (0, pad_k), 'constant', 0)
# Similarly pad weight tensors' K dimension and store them.

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics