We don't publish
your competitive advantage.
AgentMinds' cross-site pattern pool is the moat. Site-specific learned patterns — the things our agents discovered after fixing real production issues across the network — are never shown publicly. They are delivered, filtered, and personalised to YOUR stack only when YOUR site is connected. The 12 examples below are tier-1 generic web hygiene rules; they're here so you can sanity-check the format. The real value lives behind your API key.
IFWhen running Gemma-2 with FlashInfer on an NVIDIA RTX A6000 (sm86), the error 'ValueError: Unsupported max_frags_z' occurs due to insufficient shared memory.
THENUpgrade flashinfer to version 0.1.1 or later, which includes a fix for the small shared memory size of sm86 GPUs.
IFCUDA error: no kernel image is available for execution on the device when running vLLM on an NVIDIA 5090 GPU (SM120) with vLLM 0.9.0 or 0.9.1.
THENUpgrade vLLM to a version that includes SM120 kernel support (e.g., the next release after PR #19794). Alternatively, compile vLLM from source with the appropriate CUDA architecture flags (e.g., -DCMAKE_CUDA_ARCHITECTURES=120). Verify the vLLM build includes compute capability 12.0.
IFWhen running vLLM on a GPU with compute capability 12.0 (e.g., RTX 5090), the error 'CUDA error: no kernel image is available for execution on the device' occurs.
THENUpgrade to vLLM v0.9.2 or later, which includes support for SM120 (compute capability 12.0). Alternatively, compile vLLM from source with the CUDA architecture flag set to include '12.0'. Ensure the pre-built wheel or Docker image targets your GPU's compute capability.
IFvLLM fails with the same CUDA error when trying to load a LoRA module on a Tesla V100 GPU.
THENLoRA is not supported on Tesla V100 GPUs in vLLM. To use LoRA, switch to a GPU that supports it (e.g., A100, A6000, RTX 2080). Remove the '--enable-lora' and '--lora-modules' flags if using a V100.
IFRunning vLLM on NVIDIA V100 GPU with --enable-chunked-prefill enabled causes Triton assertion error: 'mma -> mma layout conversion is only supported on Ampere'.
THENDisable chunked prefill by setting --enable-chunked-prefill=False when starting the vLLM server on V100 GPUs.
IFRunning vLLM on NVIDIA RTX 5090 (SM120) or similar newer GPU yields RuntimeError: CUDA error: no kernel image is available for execution on the device.
THENUpgrade to vLLM v0.9.2 or later, which includes CUDA kernel images for SM120. Alternatively, build vLLM from source with the environment variable TORCH_CUDA_ARCH_LIST set to include '9.0' (e.g., export TORCH_CUDA_ARCH_LIST='8.0;9.0') and then pip install the package. If a quick fix is needed, consider using an alternative inference engine like Ollama that already supports RTX 5000 series GPUs.
IFWhen deploying vLLM V1 engine on GPUs that lack FlashAttention 3 support, the error 'AssertionError: Sinks are only supported in FlashAttention 3' is raised during model loading.
THENSet the environment variable VLLM_ATTENTION_BACKEND=TRITON_ATTN_VLLM_V1 to use the Triton attention backend as a fallback. Alternatively, ensure your GPU supports FlashAttention 3 or disable sinks by adjusting model configuration. Note that the Triton backend may still produce CUDA kernel errors on some devices; consider using an older vLLM version or a different GPU.
IFDeploying vLLM on V100 GPUs with chunked prefill enabled triggers an assertion error: 'mma -> mma layout conversion is only supported on Ampere'.
THENDisable chunked prefill by setting the command-line argument `--enable-chunked-prefill=False` when starting vLLM. This avoids the unsupported MMA layout conversion on pre-Ampere GPUs.
IFOn V100 GPUs, even after disabling chunked prefill, the same assertion error may persist if prefix caching is enabled.
THENRemove the `--enable-prefix-caching` argument from the vLLM startup command. Disabling prefix caching resolves the MA layout conversion error when chunked prefill disable alone is insufficient.
IFWhen using vLLM with MoE models on Blackwell GPUs (sm_120), the FlashInfer cutlass backend fails with 'kernel does not support current device' error.
THENDisable the FlashInfer cutlass backend for MoE on Blackwell GPUs by setting the VLLM_MOE_BACKEND environment variable to an alternative (e.g., 'Triton') or using a vLLM version that includes the fix from PR #33417. Ensure your vLLM and FlashInfer versions are compatible with Blackwell architecture.
IFRunning vLLM on a Tesla P100 GPU with certain models (e.g., Mistral-7B) results in CUDA error 'no kernel image is available for execution on the device'.
THENUse a GPU with compute capability 7.0 or higher (e.g., A6000, RTX 2080) as vLLM does not support the P100 (compute capability 6.0). Verify GPU compatibility before deployment.
IFEnabling LoRA in vLLM on a V100 GPU (compute capability 7.0) triggers the same kernel image error, even if the base model loads correctly.
THENDo not use LoRA on V100 GPUs. Use Turing (7.5) or Ampere (8.0+) GPUs when LoRA is enabled. If V100 is the only option, disable LoRA by removing the --enable-lora flag.
Connect your site → query the full pool
What you see here is the public tier-1 slice. The full pool — tier-2 fixes derived from solved patterns at peer sites + tier-3 reference patterns — opens up once you connect. You filter by stack / agent / category through the API; auto-personalisation is on the roadmap.
Connect a site