python - Normalize composite/decomposable/variable-length characters (unicode/python3.4) -


i stumbled upon http://mortoray.com/2013/11/27/the-string-type-is-broken/

and horror...

print(len('noe\u0308l')) # returns 5 not 4 

however found https://stackoverflow.com/a/14682498/1267259, normalizing unicode

from unicodedata import normalize print(len(unicodedata.normalize('nfc','noe\u0308l'))) # returns 4 

but do schrödinger's cats?

print(len('😸😾')) # returns 4 not 2 

(side question: in text editor when i'm trying save "utf-8 codec can't encode character x in position y: surrogates not allowed" in command prompt can paste , run code characters, assume because cats exist on different quantum level (smp) how normalize them?)

is there else should make sure characters counted "1"?

your editor producing surrogate pairs, not actual code points, why getting warning. use:

'\u0001f638\u0001f63e' 

to define cats without resorting surrogates.

if have string surrogates, can recode these via utf-16 , allowing surrogates encoded 'surrogatepass' error handler:

>>> # \u0001f638 \ud83d\ude38 when using utf-16 surrogates ... >>> '\ud83d\ude38'.encode('utf16', 'surrogatepass').decode('utf16') '😸' >>> len(_) 1 

from error handlers documentation:

'surrogateescape'
on decoding, replace byte individual surrogate code ranging u+dc80 u+dcff. code turned same byte when 'surrogateescape' error handler used when encoding data. (see pep 383 more.)