I do some regex checking in Bash to make sure that a string contains only sane characters and I encountered this strange behavior.
It looks the same in grep
.
Am I doing something wrong or is it a bug?If it is a bug, where to report it?
Lowercase š
is wrongly detected as a character between a-z
:
[[ 'š' =~ ^[a-z]$ ]] && echo sane || echo nopesane[[ 'š' =~ [a-z] ]] && echo sane || echo nopesanegrep '^[a-z]$'<<<'š'&& echo sane || echo nopešsane
Lowercase ž
is correctly detected as not a character between a-z
:
[[ 'ž' =~ ^[a-z]$ ]] && echo sane || echo nopenope[[ 'ž' =~ [a-z] ]] && echo sane || echo nopenopegrep '^[a-z]$'<<<'ž'&& echo sane || echo nopenope
Capital Š
is correctly detected as not a character between a-z
:
[[ 'Š' =~ ^[a-z]$ ]] && echo sane || echo nopenope[[ 'Š' =~ [a-z] ]] && echo sane || echo nopenopegrep '^[a-z]$'<<<'Š'&& echo sane || echo nopenope
Capital Š
is wrongly detected as a character between A-Z
:
[[ 'Š' =~ ^[A-Z]$ ]] && echo sane || echo nopesane[[ 'Š' =~ [A-Z] ]] && echo sane || echo nopesanegrep '^[A-Z]$'<<<'Š'&& echo sane || echo nopeŠsane
My Bash version:
GNU bash, version 5.1.8(1)-release (x86_64-redhat-linux-gnu)
My grep version:
grep (GNU grep) 3.6
My locale:
localeLANG=en_US.UTF-8LC_CTYPE="en_US.UTF-8"LC_NUMERIC="en_US.UTF-8"LC_TIME="en_US.UTF-8"LC_COLLATE="en_US.UTF-8"LC_MONETARY="en_US.UTF-8"LC_MESSAGES="en_US.UTF-8"LC_PAPER="en_US.UTF-8"LC_NAME="en_US.UTF-8"LC_ADDRESS="en_US.UTF-8"LC_TELEPHONE="en_US.UTF-8"LC_MEASUREMENT="en_US.UTF-8"LC_IDENTIFICATION="en_US.UTF-8"LC_ALL=