String Operations in Shell

Determining the Length of Matching Substring at Beginning of String

Predefined Character Classes

: ${var#t*is} deletes the shortest possible match from the left

: ${var##t*is} deletes the longest possible match from the left

: ${var%t*st} Function: deletes the shortest possible match from the right

: ${var%%t*st} deletes the longest possible match from the right

shell basic regex vs awk/egrep regular expressions

String Operations in Shell

阅读量：4176 次

发布时间：2019-05-26

本文共 21917 字，大约阅读时间需要 73 分钟。

Those notes are partially based on lecture notes by Professor Nikolai Bezroukov at FDU.

String operators allow you to manipulate the contents of a variable without resorting to AWK or Perl. Modern shells such as bash 3.x or ksh93 supports most of the standard string manipulation functions, but in a very pervert, idiosyncratic way. Anyway, standard functions like ,, are available. Strings can be concatenated by juxtaposition and using double quoted strings. You can ensure that variables exist (i.e., are defined and have non-null values) and set default values for variables and catch errors that result from variables not being set. You can also perform basic pattern matching. There are several basic string operations available in bash, ksh93 and similar shells:

- ${var:-bar} ()
- ${var:=bar}

String operators in shell use unique among programming languages curly-bracket syntax. In shell any variable can be displayed as ${

name_of_the_variable} instead of ${ name_of_the_variable}. This notation most often is used to protect a variable name from merging with string that comes after it. Here is example in which it is used for separation of a variable $var and a string "_string"

$ export var='test' $ echo ${var}_string # var is a variable that uses syntax ${var} and its value will be substitutedtest_string$ echo $var_string # var_string is a variable that doesn't exist, so echo doesn't print anything

In Korn 88 shell this notation was extended to allow expressions inside curvy brackets. For example ${var=moo}. Each operation is encoded using special symbol or two symbols ("digram", for example :-, :=, etc) . An argument that the operator may need is positioned after the symbol of the operation. And later this notation extended ksh93 and adopted by bash and other shells.

This "ksh-originated" group of operators is the most popular and probably the most widely used group of string-handling operators so it makes sense to learn them, if only in order to be able to modify old scripts. Bash 3.2 and later has =~ operator with "normal" Perl-style regular expressions that can be used instead in many cases and they are definitely preferable in new scripts that you might write. Let's say we need to establish whether variable $x appears to be a social security number:

if [[ $x =~ [0-9]{3}-[0-9]{2}-[0-9]{4} ]]then	# process SSNelse	# print error messagefi

Those operators can test for the existence of variables and allows substitutions of default values under certain conditions.

Note: The colon (:) in each of these operators is actually optional. If the colon is omitted, then change "exists and isn't null" to "exists" in each definition, i.e., the operator tests for existence only.

Bash and ksh also provide some (limited) regular expression functionality called pattern matching operators

Introduced in ksh88 notation was and still it really very idiosyncratic. In examples below we will assume that the variable var has value "this is a test" (as produced by execution of statement export var="this is a test")

: deletes the shortest possible match from the left
```
echo ${var#t*is}is a test
```

: deletes the longest possible match from the left
```
echo ${var##t*is}a test
```

: deletes the shortest possible match from the right
```
echo ${var%t*st}this is a
```

: deletes the longest possible match from the right

echo ${var%%t*st} # returns empty string as the first word is matched

Although the

# and

% operators mnemonics looks arbitrary, they have a convenient mnemonic if you use the US keyboard. The

# key is on the left side of the

$ key and operates from the left, while

% is to right and usually is used to the right of the string like is 100%. Also C preprocessor uses "

#"; as a prefix to identify preprocessor statements (

#define,

#include).

Despite shell deficiencies in this area and idiosyncrasies preserved from 1970th most classic string operations can be implemented in shell. You can define functions that behave almost exactly like in Perl or other "more normal" language. In case shell facilities are not enough you can use or Perl. It's actually sad that AWK was not integrated into shell.

There are several ways to get length of the string.

The simplest one is ${#varname}, which returns the length of the value of the variable as a character string. For example, if filename has the value fred.c, then ${#filename} would have the value 6.

The second is to use built in function expr, for example
```
expr length $string
```
or
```
expr "$string" : '.*'
```

Additional example from

stringZ=abcABC123ABCabcecho ${#stringZ}                 # 15echo `expr length $stringZ`      # 15echo `expr "$stringZ" : '.*'`    # 15

More complex example. Here's the function for validating that that string is within a given max length. It requires two parameters, the actual string and the maximum length the string should be.

check_length() # check_length # to call: check_length string max_length_of_string { 	# check we have the right params 	if (( $# != 2 )) ; then 	   echo "check_length need two parameters: a string and max_length" 	   return 1 	fi 	if (( ${#1} > $2 )) ; then 	   return 1 	fi 	return 0 }

You could call the function check_length like this:

#!/usr/bin/bash# test_name while : do   echo -n "Enter customer name :"   read NAME   [ check_length $NAME 10 ] && break  echo "The string $NAME is longer then 10 characters"    done

echo $NAME

This is pretty rarely used capability of

expr built-in but still sometimes it can be useful:

expr match "$string" '$substring'

where:

String is any variable of literal string.

$substring is a.

my_regex=abcABC123ABCabc#       |------|echo `expr match "$my_regex" 'abc[A-Z]*.2'`   # 8echo `expr "$my_regex" : 'abc[A-Z]*.2'`       # 8

Function index return the position of substring in string counting from one and 0 if substring is not found.

expr index $string $substring

Numerical position in

$string of first character in

$substring that matches.

stringZ=abcABC123ABCabcecho `expr index "$stringZ" C12`             # 6                                             # C position.echo `expr index "$stringZ" c`              # 3# 'c' (in #3 position)

This is the close equivalent of strchr() in C.

Substring function is available as a part of pattern matching operators in shell and has the form ${param:offset[:length}.

If an `offset' evaluates to a number less than zero, it counts back from the end of the string defined by variable$param.

Notes:

this pattern matching operator uses zero-based indexing.

When you specify negative offset as a numeric literal with minus sigh in the front, unexpected things can happen. Consider
```
a=12345678echo ${a:-4}
```
intending to print the last four characters of $a. The problem is that ${param:-word} already has a special meaning: in shell: assigning the value after minus sign to the variable, if the value of variable param is undefined or null. To use negative offsets that begin with a minus sign, separate the minus sign and the colon with a space.

${string:position}

Extracts substring from

$string at

$position.

If the $string parameter is "*" or "@", then this extracts the, starting at $position.

${string:position:length}

Extracts

$length characters of substring from

$string at

$position.

stringZ=abcABC123ABCabc#       0123456789.....#       0-based indexing.echo ${stringZ:0}                            # abcABC123ABCabcecho ${stringZ:1}                            # bcABC123ABCabcecho ${stringZ:7}                            # 23ABCabcecho ${stringZ:7:3}                          # 23A                                             # Three characters of substring.

If the $string parameter is "*" or "@", then this extracts a maximum of $length positional parameters, starting at $position.

echo ${*:2}          # Echoes second and following positional parameters.echo ${@:2}          # Same as above.echo ${*:2:3}        # Echoes three positional parameters, starting at second.

expr substr $string $position $length

Extracts

$length characters from

$string starting at

$position..

The first character has index one.

stringZ=abcABC123ABCabc#       123456789......#       1-based indexing.echo `expr substr $stringZ 1 2`              # abecho `expr substr $stringZ 4 3`              # ABC

You can search and replace substring in a variable using ksh syntax:

alpha='This is a test string in which the word "test" is replaced.' beta="${alpha/test/replace}"

The string "beta" now contains an edited version of the original string in which the first case of the word "test" has been replaced by "replace". To replace all cases, not just the first, use this syntax:

beta="${alpha//test/replace}"

Note the double "//" symbol.

Here is an example in which we replace one string with another in a multi-line block of text:

list="cricket frog cat dog" poem="I wanna be a x\n\ A x is what I'd love to be\n\ If I became a x\n\ How happy I would be.\n"for critter in $list; do   echo -e ${poem//x/$critter}done

Strings can be concatenated by juxtaposition and using double quoted strings. For example

PATH="$PATH:/usr/games"

Double quoted string in shell is almost identical to double quoted string in Perl and performs macro expansion of all variables in it. The minor difference is the treatment of escaped characters. If you want exact match you can use $'string'

#!/bin/bash# String expansion.Introduced with version 2 of Bash.#  Strings of the form $'xxx' have the standard escaped characters interpreted. echo $'Ringing bell 3 times \a \a \a'     # May only ring once with certain terminals.echo $'Three form feeds \f \f \f'echo $'10 newlines \n\n\n\n\n\n\n\n\n\n'echo $'\102\141\163\150'   # Bash                           # Octal equivalent of characters.exit 0

In bash-3.1, a string append operator (+=) was added:

PATH+=":~/bin"echo "$PATH"

Using the wildcard character (?), you can imitate Perl chop function (which cuts the last character of the string and returns the rest) quite easily

test="~/bin/"trimmed_last=${test%?}trimmed_first=${test#?}echo "original='$test,timmed_first='$trimmed_first', trimmed_last='$trimmed_last'"

The first character of a string can also be obtained with printf:

printf -v char "%c" "$source"

Conditional chopping line in Perl chomp function or REXX function trim can be done using while loop, for example:

function trim{   target=$1   while : # this is an infinite loop   do   case $target in      ' '*) target=${target#?} ;; ## if $target begins with a space remove it      *' ') target=${target%?} ;; ## if $target ends with a space remove it      *) break ;; # no more leading or trailing spaces, so exit the loop   esac   done   return target}

A more Perl-style method to trim trailing blanks would be

spaces=${source_var##*[! ]} ## get the trailing blanks in var $spaces

trimmed_var=${source_var#$spaces}

The same trick can be used for removing leading spaces.

Operator: ${var:-bar} is useful for assigning a variable a default value. It word the following way: if $var exists and is not null, return $var. If it doesn't exist or is null, return bar.

Example:

$ export var=""$ echo ${var:-one}one$ echo $var

More complex example:

sort -nr $1 | head -${2:-10}

A typical usage include situations when you need to check if arguments were passed to the script and if not assign some default values::

#!/bin/bash export FROM=${1:-"~root/.profile"}export TO=${2:-"~my/.profile"}cp -p $FROM $TO

Additional modification allows to set variable if it is not defined. This is done with the operator ${var:=bar}

It works as following: If $var exists and is not null, return $var. If it doesn't exist or is null, set $var to bar and return bar.

Example:

$ export var=""$ echo ${var:=one}one

Results:

$ echo $varone

There are two types of pattern matching is shell:

Old KSH-style pattern matching that uses very idiosyncratic regular expressions with prefix positioning of metasymbols

Classic, Perl-style pattern matching (implemented in bash 3.0). Since version 3 of bash (released in 2004) bash implements an extended regular expressions which are mostly compatible with Perl regex. They are also called POSIX regular expressions as they are defined in.

Unless you need to modify old scripts it does not make sense to use old ksh-style regex in bash.

(partially borrowed from)

Since version 3 of bash (released in 2004) bash implements

an extended regular expressions which are mostly compatible with Perl regex. They are also called POSIX regular expressions as they are defined in. (which you should read and understand to use the full power provided). Extended regular expression are also used in egrep so they are mostly well known by system administrators. Please note that Perl regular expressions are equivalent to extended regular expressions with a few additional features:

Perl supports noncapturing parentheses, as described in

The order of multiple options within parentheses can be important when substrings come into play, as described in

Perl allows you to include a literal square bracket anywhere within a character class by preceding it with a backslash, as described in

Perl adds a number of additional switches that are equivalent to certain special characters and character classes. These are described in

Perl supports a broader range of modifiers. These are described in

Extended regular expression support set of predefined character classes. When used between brackets, these define commonly used sets of characters. The POSIX character classes implemented in extended regular expressions include:

[:alnum:]—all alphanumeric characters (a-z, A-Z, and 0-9).

[:alpha:]—all alphabetic characters (a-z, A-Z).

[:blank:]—all whitespace within a line (spaces or tabs).

[:cntrl:]—all control characters (ASCII 0-31).

[:digit:]—all numbers.

[:graph:]—all alphanumeric or punctuation characters.

[:lower:]—all lowercase letters (a-z).

[:print:]—all printable characters (opposite of [:cntrl:], same as the union of [:graph:] and [:space:]).

[:punct:]—all punctuation characters

[:space:]—all whitespace characters (space, tab, newline, carriage return, form feed, and vertical tab). (See note below about compatibility.)

[:upper:]—all uppercase letters.

[:xdigit:]—all hexadecimal digits (0-9, a-f, A-F).

Modifies are by and large similar to Perl

Extended regex	Perl regex
`a+`	`a+`
`a?`	`a?`
`a\|b`	`a\|b`
`(expression1)`	`(expression1)`
`{m,n}`	`{m,n}`
`{,n}`	`{,n}`
`{m,}`	`{m,}`
`{m}`	`{m}`

It returns 0 (success) if the regular expression matches the string, otherwise it returns 1 (failure).

In addition to doing simple matching, bash regular expressions support sub-patterns surrounded by parenthesis for capturing parts of the match. The matches are assigned to an array variable BASH_REMATCH. The entire match is assigned to BASH_REMATCH[0], the first sub-pattern is assigned to BASH_REMATCH[1], etc..

The following example script takes a regular expression as its first argument and one or more strings to match against. It then cycles through the strings and outputs the results of the match process:

#!/bin.bashif [[ $# -lt 2 ]]; then    echo "Usage: $0 PATTERN STRINGS..."    exit 1firegex=$1shiftecho "regex: $regex"echowhile [[ $1 ]]do    if [[ $1 =~ $regex ]]; then        echo "$1 matches"        i=1        n=${#BASH_REMATCH[*]}        while [[ $i -lt $n ]]        do            echo "  capture[$i]: ${BASH_REMATCH[$i]}"            let i++        done    else        echo "$1 does not match"    fi    shiftdone

Assuming the script is saved in "bashre.sh", the following sample shows its output:

# sh bashre.sh 'aa(b{2,3}[xyz])cc' aabbxcc aabbcc  regex: aa(b{2,3}[xyz])cc  aabbxcc matches    capture[1]: bbx  aabbcc does not match

Pattern-matching operators were introduced in ksh88 in a very idiosyncratic way. The notation is different from used by Perl or utilities such as grep. That's a shame, but that's how it is. Life is not perfect. They are hard to remember, but there is a handy mnemonic tip: # matches the front because number signsprecede numbers; % matches the rear because percent signs follow numbers.

There are two kinds of pattern matching available: matching from the left and matching from the right.

The operators, with their functions and an example, are shown in the following table:

Operator	Meaning	Example
`${var#t*is}`	Deletes the shortest possible match from the left: If the pattern matches the beginning of the variable's value, delete the shortest part that matches and return the rest.	`export $var="this is a test"` `echo ${var#t*is}` `is a test`
`${var##t*is}`	Deletes the longest possible match from the left: If the pattern matches the beginning of the variable's value, delete the longest part that matches and return the rest.	`export $var="this is a test"` `echo ${var##t*is}` `a test`
`${var%t*st}`	Deletes the shortest possible match from the right: If the pattern matches the end of the variable's value, delete the shortest part that matches and return the rest.	`export $var="this is a test"` `echo ${var%t*st}` `this is a`
${var%%t*st}	Deletes the longest possible match from the right: If the pattern matches the end of the variable's value, delete the longest part that matches and return the rest.	`export $var="this is a test" echo ${var%%t*is}`

While the # and % identifiers may not seem obvious, they have a convenient mnemonic. The # key is on the left side of the $ key on the keyboard and operates from the left. The % key is on the right of the $ key and operated from the right.

These operators can be used to do a variety of things. For example, the following script changes the extension of all .html files to .htm.

#!/bin/bash# quickly convert html filenames for use on a dossy system# only handles file extensions, not filenamesfor i in *.html; do  if [ -f ${i%l} ]; then    echo ${i%l} already exists  else    mv $i ${i%l}  fidone

The classic use for pattern-matching operators is stripping off components of pathnames, such as directory prefixes and filename suffixes. With that in mind, here is an example that shows how all of the operators work. Assume that the variablepath has the value /home /billr/mem/long.file.name; then:

Expression         	  Result${path##/*/}                       long.file.name${path#/*/}              billr/mem/long.file.name$path              /home/billr/mem/long.file.name${path%.*}         /home/billr/mem/long.file${path%%.*}        /home/billr/mem/long

Example:

$ export var="this is a test"$ echo ${var#t*is}is a test

Example:

$ export var="this is a test"$ echo ${var##t*is}a test

Example:

$ export var="this is a test" $ echo ${var%t*st} this is a

for i in *.htm*; do    if [ -f ${i%l} ]; then        echo "${i%l} already exists"    else        mv $i ${i%l}    fi  done

Example:

$ export var="this is a test" $ echo ${var%%t*st}

A shell regular expression can contain regular characters, standard wildcard characters, and additional operators that are more powerful than wildcards. Each such operator has the form x(exp), where x is the particular operator and exp is any regular expression (often simply a regular string). The operator determines how many occurrences of exp a string that matches the pattern can contain.

Operator	Meaning
`*`(`exp`)	0 or more occurrences of `exp`
`+`(`exp`)	1 or more occurrences of `exp`
`?`(`exp`)	0 or 1 occurrences of `exp`
@(`exp1`\|`exp2`\|...)	`exp1` or `exp2` or...
!(`exp`)	Anything that doesn't match `exp`

Expression	Matches
`x`	`x`
`*`(`x`)	Null string, `x`, `xx`, `xxx`, ...
`+`(`x`)	`x`, `xx`, `xxx`, ...
`?`(`x`)	Null string, `x`
`!`(`x`)	Any string except `x`
`@`(`x`)	`x` (see below)

The following section compares Korn shell regular expressions to analogous features in awk and egrep. If you aren't familiar with these, skip to the section entitled "Pattern-matching Operators."

${var:-bar}
()

${var:=bar}

Shell	egrep/awk	Meaning
`*`(`exp`)	`exp*`	0 or more occurrences of `exp`
+(`exp`)	`exp`+	1 or more occurrences of `exp`
`?`(`exp`)	`exp?`	0 or 1 occurrences of `exp`
@(`exp1`\|`exp2`\|...)	`exp1`\|`exp2`\|...	`exp1` or `exp2` or...
!(`exp`)	(none)	Anything that doesn't match `exp`

These equivalents are close but not quite exact. Actually, an exp within any of the Korn shell operators can be a series of exp1|exp2|... alternates. But because the shell would interpret an expression like dave|fred|bob as a pipeline of commands, you must use @(dave|fred|bob) for alternates

For example:

@(dave|fred|bob) matches dave, fred, orbob.

*(dave|fred|bob) means, "0 or more occurrences ofdave, fred, or bob". This expression matches strings like the null string, dave, davedave, fred, bobfred,bobbobdavefredbobfred, etc.

+(dave|fred|bob) matches any of the above except the null string.

?(dave|fred|bob) matches the null string, dave,fred, or bob.

!(dave|fred|bob) matches anything except dave,fred, or bob.

It is worth re-emphasizing that shell regular expressions can still contain standard shell wildcards. Thus, the shell wildcard ? (match any single character) is the equivalent to . in egrep or awk, and the shell's character set operator [...] is the same as in those utilities. For example, the expression +([0-9]) matches a number, i.e., one or more digits. The shell wildcard character * is equivalent to the shell regular expression * (?).

A few egrep and awk regexp operators do not have equivalents in the Korn shell. These include: