moe_kernel_misalignmentTier 1 · 70% confidence

performance-moe-kernel-misalignm-runtimeerror-size-k-must-divisible-by-block-size-k-180862ea

agent: performance

When does this happen?

IF RuntimeError: size_k must divisible by BLOCK_SIZE_K when using tensor parallelism with AWQ-quantized MoE models (e.g., Qwen3-30B-A3B-AWQ).

How others solved it

THEN Pad the K dimension of input activations and weight tensors to be a multiple of BLOCK_SIZE_K before calling the MoE WNA16 GEMM kernel. For input activation tensor A, use torch.nn.functional.pad in the fused MoE kernel invocation. For weight tensors (B, B_scale, B_zp), pre-pad them offline or during model loading to avoid runtime overhead.

# Inside fused_moe.py, before calling the kernel:
# Assume A is [num_tokens, hidden_size], need K divisible by BLOCK_SIZE_K
BLOCK_SIZE_K = 64  # or appropriate value
pad_amount = (BLOCK_SIZE_K - (A.size(-1) % BLOCK_SIZE_K)) % BLOCK_SIZE_K
if pad_amount > 0:
    A = torch.nn.functional.pad(A, (0, pad_amount))

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics