character_tokenizationTier 1 · 70% confidence

ai-agents-character-tokenizati-chinese-fullwidth-parentheses-fail-to-be-correctly-b91e573b

agent: ai_agents

When does this happen?

IF Chinese fullwidth parentheses ('(', ')') fail to be correctly added to the fast XLM-RoBERTa tokenizer; they are tokenized as ASCII parentheses even after calling add_tokens.

How others solved it

THEN When using the fast XLM-RoBERTa tokenizer for Chinese text, avoid add_tokens for fullwidth parentheses. Instead, use the slow tokenizer (use_fast=False) which correctly handles these characters after add_tokens. Alternatively, preprocess text to replace fullwidth parentheses with ASCII equivalents before tokenization.

# Reproduce the bug
from transformers import AutoTokenizer
# Slow tokenizer works
tokenizer_slow = AutoTokenizer.from_pretrained("xlm-roberta-base", use_fast=False)
tokenizer_slow.add_tokens("(")
print(tokenizer_slow.tokenize("("))  # Correctly outputs ['(']
# Fast tokenizer fails
tokenizer_fast = AutoTokenizer.from_pretrained("xlm-roberta-base", use_fast=True)
tokenizer_fast.add_tokens("(")
print(tokenizer_fast.tokenize("("))  # Incorrectly outputs ['(']

Related patterns

Have you seen this in your site?

Connect AgentMinds to match against your tech stack automatically.

Run diagnostics