Japanese Text Processing in JavaScript: A Complete Guide
From Unicode code points to Intl.Segmenter — how to correctly handle Japanese text in JS.
Japanese Text Processing in JavaScript
A guide to handling Japanese text correctly in modern JavaScript.
Character Type Detection
Use Unicode code point ranges to classify characters.
function getCharType(ch) {
const cp = ch.codePointAt(0);
if (!cp) return 'other';
if (cp >= 0x3040 && cp <= 0x309F) return 'hiragana';
if (cp >= 0x30A0 && cp <= 0x30FF) return 'katakana';
if ((cp >= 0x4E00 && cp <= 0x9FFF) ||
(cp >= 0x3400 && cp <= 0x4DBF)) return 'kanji';
if ((cp >= 0x0041 && cp <= 0x005A) ||
(cp >= 0x0061 && cp <= 0x007A)) return 'alpha';
if (cp >= 0x0030 && cp <= 0x0039) return 'digit';
return 'other';
}
Accurate Character Count
JavaScript's string.length returns UTF-16 code units. Emoji and rare kanji (surrogate pairs) count as 2.
const text = '😀hello';
console.log(text.length); // 7 (wrong)
console.log([...text].length); // 6 (correct)
Word Segmentation with Intl.Segmenter
const segmenter = new Intl.Segmenter('ja', { granularity: 'word' });
const text = '東京は日本の首都です。';
const words = [...segmenter.segment(text)]
.filter(s => s.isWordLike)
.map(s => s.segment);
// ['東京', 'は', '日本', 'の', '首都', 'です']
Summary
- Use
[...str].lengthfor code-point-based character count - Unicode ranges for hiragana/katakana/kanji detection
Intl.Segmenterfor word segmentation without a morphological analyzer