Quantcast
Channel: Recent Questions - Stack Overflow
Viewing all articles
Browse latest Browse all 11571

Bash regex vs. diacritics

$
0
0

I do some regex checking in Bash to make sure that a string contains only sane characters and I encountered this strange behavior.

It looks the same in grep.

Am I doing something wrong or is it a bug?If it is a bug, where to report it?


Lowercase š is wrongly detected as a character between a-z:

[[ 'š' =~ ^[a-z]$ ]] && echo sane || echo nopesane[[ 'š' =~ [a-z] ]] && echo sane || echo nopesanegrep '^[a-z]$'<<<'š'&& echo sane || echo nopešsane

Lowercase ž is correctly detected as not a character between a-z:

[[ 'ž' =~ ^[a-z]$ ]] && echo sane || echo nopenope[[ 'ž' =~ [a-z] ]] && echo sane || echo nopenopegrep '^[a-z]$'<<<'ž'&& echo sane || echo nopenope

Capital Š is correctly detected as not a character between a-z:

[[ 'Š' =~ ^[a-z]$ ]] && echo sane || echo nopenope[[ 'Š' =~ [a-z] ]] && echo sane || echo nopenopegrep '^[a-z]$'<<<'Š'&& echo sane || echo nopenope

Capital Š is wrongly detected as a character between A-Z:

[[ 'Š' =~ ^[A-Z]$ ]] && echo sane || echo nopesane[[ 'Š' =~ [A-Z] ]] && echo sane || echo nopesanegrep '^[A-Z]$'<<<'Š'&& echo sane || echo nopeŠsane

My Bash version:

GNU bash, version 5.1.8(1)-release (x86_64-redhat-linux-gnu)

My grep version:

grep (GNU grep) 3.6

My locale:

localeLANG=en_US.UTF-8LC_CTYPE="en_US.UTF-8"LC_NUMERIC="en_US.UTF-8"LC_TIME="en_US.UTF-8"LC_COLLATE="en_US.UTF-8"LC_MONETARY="en_US.UTF-8"LC_MESSAGES="en_US.UTF-8"LC_PAPER="en_US.UTF-8"LC_NAME="en_US.UTF-8"LC_ADDRESS="en_US.UTF-8"LC_TELEPHONE="en_US.UTF-8"LC_MEASUREMENT="en_US.UTF-8"LC_IDENTIFICATION="en_US.UTF-8"LC_ALL=

Viewing all articles
Browse latest Browse all 11571

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>