Japanese Text Processing in JavaScript: A Complete Guide

From Unicode code points to Intl.Segmenter — how to correctly handle Japanese text in JS.

Japanese Text Processing in JavaScript

A guide to handling Japanese text correctly in modern JavaScript.

Character Type Detection

Use Unicode code point ranges to classify characters.

function getCharType(ch) {
  const cp = ch.codePointAt(0);
  if (!cp) return 'other';
  if (cp >= 0x3040 && cp <= 0x309F) return 'hiragana';
  if (cp >= 0x30A0 && cp <= 0x30FF) return 'katakana';
  if ((cp >= 0x4E00 && cp <= 0x9FFF) ||
      (cp >= 0x3400 && cp <= 0x4DBF)) return 'kanji';
  if ((cp >= 0x0041 && cp <= 0x005A) ||
      (cp >= 0x0061 && cp <= 0x007A)) return 'alpha';
  if (cp >= 0x0030 && cp <= 0x0039) return 'digit';
  return 'other';
}

Accurate Character Count

JavaScript's string.length returns UTF-16 code units. Emoji and rare kanji (surrogate pairs) count as 2.

const text = '😀hello';
console.log(text.length);      // 7 (wrong)
console.log([...text].length); // 6 (correct)

Word Segmentation with Intl.Segmenter

const segmenter = new Intl.Segmenter('ja', { granularity: 'word' });
const text = '東京は日本の首都です。';

const words = [...segmenter.segment(text)]
  .filter(s => s.isWordLike)
  .map(s => s.segment);
// ['東京', 'は', '日本', 'の', '首都', 'です']

Summary

  • Use [...str].length for code-point-based character count
  • Unicode ranges for hiragana/katakana/kanji detection
  • Intl.Segmenter for word segmentation without a morphological analyzer