Discussion:
BUG #7999: Regexp with utf8
(too old to reply)
s***@gmail.com
2013-03-27 10:32:57 UTC
Permalink
The following bug has been logged on the website:

Bug reference: 7999
Logged by: david
Email address: ***@gmail.com
PostgreSQL version: 9.1.8
Operating system: linux
Description:


\y and \Y do not behave correctly next to
multibyte utf-8 characters - they seem to invert their sensesː

Propper behaivour with ascii e
'es'~$$\y[eɛ]s$$ => t
Inverted behaviour with epsilon
'ɛs'~$$\y[eɛ]s$$ => f
'ɛs'~$$[eɛ]\ys$$ => t
'ɛs'~$$[eɛ]\Ys$$ => f

This seems to be a case of utf8 characters not being recognised as
word-forming:

'ɛ'~$$\w'$$ => f

I've checked with a few other characters which are >1byte in utf8. U+00F0
counds as \w, but nothing I've tried > FF matches. I wonder if it's
something to do with >256?

In case anyone else hits this bug, replacing \y with
(^|$|\s|[[:punct:]]) seems to work for me, although it's ugly.
--
Sent via pgsql-bugs mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Tom Lane
2013-03-27 16:19:57 UTC
Permalink
Post by s***@gmail.com
PostgreSQL version: 9.1.8
I've checked with a few other characters which are >1byte in utf8. U+00F0
counds as \w, but nothing I've tried > FF matches. I wonder if it's
something to do with >256?
Yup. This is partially resolved in PG 9.2, but will never be fixed in
older branches. From the commit log:

Also, remove the hard-wired limitation to not consider wctype.h results for
character codes above 255. It turns out that we can't push the limit as
far up as I'd originally hoped, because the regex colormap code is not
efficient enough to cope very well with character classes containing many
thousand letters, which a Unicode locale is entirely capable of producing.
Still, we can push it up to U+7FF (which I chose as the limit of 2-byte
UTF8 characters), which will at least make Eastern Europeans happy pending
a better solution. Thus, this commit resolves the specific complaint in
bug #6457, but not the more general issue that letters of non-western
alphabets are mostly not recognized as matching [[:alpha:]].

regards, tom lane
--
Sent via pgsql-bugs mailing list (pgsql-***@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-bugs
Loading...