This page attempts to teach Bash scripting from a data-driven, or task-oriented approach. Regional weather data was downloaded for around two weeks from NOAA. The goal of the commands on this page is to parse the data using simple scripting techniques.

These lessons assume you have a Biowulf account and are working within a temporary, local scratch disk on a node. Any changes made to the downloaded files during the session will be lost unless the files are copied to your own /home or /data directory!

Getting started

Log into Biowulf:

ssh user@biowulf.nih.gov

Make sure you are running Bash:

echo $0

Allocate an interactive session with local scratch space so that I/O operations on the data are as fast as possible. This command assumes you are already logged into Biowulf. By default, you will have access to 2 CPUs and 1.5GB of RAM. The session will last for 8 hours.

sinteractive --gres=lscratch:5

Copy the data into the local scratch space.

cd /lscratch/$SLURM_JOB_ID
mkdir bash_class
cd bash_class
cp -R /data/classes/bash/* .
tar -C data -x -z -f data/weather_data.tgz

If you are not on Helix or Biowulf, you can download the data using either wget or curl.

wget

wget https://hpc.nih.gov/training/handouts/BashScripting.tgz

curl

curl https://hpc.nih.gov/training/handouts/BashScripting.tgz > BashScripting.tgz

Untar the data

tar -x -v -f BashScripting.tgz
tar -C data -x -z -f data/weather_data.tgz

Have a look at the data. There should be about 63MB of data, 71MB including examples and scripts.

ls
ls data/weather-2017-10-09-00-00-01/
ls data/weather-2017-10-09-00-00-01/md/
file data/weather-2017-10-09-00-00-01/md/mdz009.txt
stat data/weather-2017-10-09-00-00-01/md/mdz009.txt
tree data
cat data/weather-2017-10-09-00-00-01/md/mdz009.txt

Get down and dirty with the data

Run some simple and piped commands to display the temperature, humidity, and air pressure of a given city at a given time.

Look at all the data

cat data/weather-2017-10-10-06-00-01/md/mdz009.txt

Pull out only the lines that are for Gaithersburg

grep GAITHERSBURG data/weather-2017-10-10-06-00-01/md/mdz009.txt

Case insensitive search this time

grep gaithersburg data/weather-2017-10-10-06-00-01/md/mdz009.txt

grep --ignore-case gaithersburg data/weather-2017-10-10-06-00-01/md/mdz009.txt

Get all the timepoints for the data, using --recursive or -R

grep --recursive --ignore-case gaithersburg data/*/md

Get rid of the file information

grep --no-filename --recursive --ignore-case gaithersburg data/*/md

Sort the data

grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort

Sort based on temperature - fail

grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k3

grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -nk3

grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -rnk3

Sort using character position, rather than delimiter

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | cut -c1-14,26-27

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.26,1.27

Reverse the sort -- change ascending to descending

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.26,1.27r

Sort ascending by relative humidity -- fail because of leading space -- fix by treating as numeric

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.33,1.35

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.33,1.35n

Multi-column sort

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.33,1.35n -k 1.26,1.27r

Isolate the most humid moment in time

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.33,1.35n | tail

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.33,1.35n | tail -n 1

grep --recursive "FOG       65  65 100 CALM" data/*/md/*

grep --recursive "GAITHERSBURG*  FOG       65  65 100 CALM" data/*/md/*

grep --recursive "GAITHERSBURG\*  FOG       65  65 100 CALM" data/*/md/*

Isolate the least humid moment in time

grep --no-filename --recursive --ignore-case gaithersburg data/*/md  | sort -k 1.33,1.35n | head -n 1

grep --recursive "GAITHERSBURG\*  MOSUNNY   83  64  52 S15       30.02F" data/*/md/*

grep --recursive 'GAITHERSBURG\*  MOSUNNY   83  64  52 S15       30.02F' data/*/md/*

Get the average temperature

grep  -h -r -i gaithersburg data/*/md | cut -c26-27 | awk '{sum+=$1; num++ } END { print sum,num,sum/num}'

Scripting

Had enough of grep? Had enough of typing again and again? Let's create a script. nano is a simple text-based file editor that is intuitive and easy to use.

nano scripts/script_01a.sh

  GNU nano 2.3.1                              File: scripts/script_01a.sh                                                       

grep -h -r -i -B100 gaithersburg data/* | grep 'M EDT'
grep -h -r -i gaithersburg data/*







^G Get Help         ^O WriteOut         ^R Read File        ^Y Prev Page        ^K Cut Text         ^C Cur Pos
^X Exit             ^J Justify          ^W Where Is         ^V Next Page        ^U UnCut Text       ^T To Spell

Now run the script.

bash scripts/script_01a.sh

This should give a long list of stuff. We want to paste the timestamps onto the weather data. Open the file again and edit it, redirecting the output and adding the paste command.

nano scripts/script_01b.sh

grep -h -r -i -B100 gaithersburg data/* | grep 'M EDT' > 1
grep -h -r -i gaithersburg data/* > 2
paste 1 2

Now it looks nicer.

bash scripts/script_01b.sh

Dealing with adversity

Use conditionals to handle unknowns and odd issues.

Only grab the data below "CITY"

nano scripts/script_04c.sh

while read line
do
    if ( echo "$line" | grep -q ^CITY ) ; then
        good=1
    fi
    if [[ -n $good ]]; then
        echo " ---${line}---- "
    fi
done

bash scripts/script_04c.sh < data/weather-2017-10-16-06-00-01/md/mdz009.txt

Don't print "CITY"

nano scripts/script_04d.sh

while read line
do
    if ( echo "$line" | grep -q ^CITY ) ; then
        good=1
        continue
    fi
    if [[ -n $good ]]; then
        echo " ---${line}---- "
    fi
done

bash scripts/script_04d.sh < data/weather-2017-10-16-06-00-01/md/mdz009.txt

Get rid of remainder

nano scripts/script_04e.sh

while read line
do
    if ( echo "$line" | grep -q ^CITY ) ; then
        good=1
        continue
    fi
    if [[ -n $good ]]; then
        if ( echo "$line" | grep -q "^$" ) ; then
            exit
        else
            echo " ---${line}---- "
        fi
    fi
done

bash scripts/script_04e.sh < data/weather-2017-10-16-06-00-01/md/mdz009.txt

Yuck, get rid of '$$' as well

nano scripts/script_04f.sh

this won't work

        if ( echo "$line" | grep -q "$$" ) ; then
            exit
        else

this will, but we're still left with blank lines

        if ( echo "$line" | grep -q '$$' ) ; then
            exit
        else

more cleverness

        if ( echo "$line" | grep -q '$$' ) ; then
            exit
        elif ( echo "$line" | grep -q "^$" ) ; then
            exit
        else

even more cleverness

        if  [[ "$line" =~ [[:alpha:]] ]] ; then
            echo " ---${line}---- "
        fi

now walk all the files

for file in data/weather-2017-10-10-00-00-02/md/* ; do bash scripts/script_04f.sh < $file ; done

Wot? Repeats?

for file in data/weather-2017-10-10-00-00-02/md/* ; do bash scripts/script_04f.sh < $file ; done | sort -u

Regular expressions

Huh? Rolling around through the data shows some repeats

---ANNAPOLIS      CLEAR     65  48  54 VRB7      29.65F----
---ANNAPOLIS      CLEAR     80  77  90 CALM      29.98R----

---BWI AIRPORT    MOCLDY    70  49  47 S12G20    29.63F----
---BWI AIRPORT    PTCLDY    75  75 100 CALM      29.98R----

---MD SCIENCE CTR   N/A     60  46  59 MISG      29.64F----
---MD SCIENCE CTR   N/A     79  75  87 MISG      29.97R----

Use the Expires tag to filter out very old data:

nano scripts/script_05a.sh

regex="^Expires:([0-9]{4})([0-9]{2})([0-9]{2})([0-9]{2})([0-9]{2})"
cutoff=201710010000

while read line
do

    if [[ "$line" =~ $regex ]]; then
        echo year=${BASH_REMATCH[1]}
        echo month=${BASH_REMATCH[2]}
        echo day=${BASH_REMATCH[3]}
        echo hour=${BASH_REMATCH[4]}
        echo minute=${BASH_REMATCH[5]}
        if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}${BASH_REMATCH[2]}${BASH_REMATCH[3]}${BASH_REMATCH[4]}${BASH_REMATCH[5]}" ]]; then
            exit
        fi

    fi

    if ( echo "$line" | grep -q ^CITY ) ; then
        good=1
        continue
    fi
    if [[ -n $good ]]; then
        if  [[ "$line" =~ [[:alpha:]] ]] ; then
            echo " ---${line}---- "
        fi
    fi
done

for file in $(find data/weather-2017-10-10-00-00-02/md/ -type f -mtime -1000) ; do bash scripts/script_05a.sh < $file ; done | sort -u

Simpler:

nano scripts/script_05b.sh

regex="^Expires:([0-9]{12})"
cutoff=201710010000

while read line
do

    if [[ "$line" =~ $regex ]]; then
        if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
            exit
        fi
    fi

    if ( echo "$line" | grep -q ^CITY ) ; then
        good=1
        continue
    fi
    if [[ -n $good ]]; then
        if  [[ "$line" =~ [[:alpha:]] ]] ; then
            echo " ---${line}---- "
        fi
    fi
done

for file in $(find data/weather-2017-10-10-00-00-02/md/ -type f -mtime -1000) ; do bash scripts/script_05b.sh < $file ; done | sort -u

Automation

Create a function to automate what we did:

nano scripts/script_06a.sh

cutoff=201710010000

function extract_data
{
    regex="^Expires:([0-9]{12})"
    while read line
    do
        if [[ "$line" =~ $regex ]]; then
            if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
                exit
            fi
        fi
        if ( echo "$line" | grep -q ^CITY ) ; then
            good=1
            continue
        fi
        if [[ -n $good ]]; then
            if  [[ "$line" =~ [[:alpha:]] ]] ; then
                if [[ -z $value ]]; then
                    value=${line}
                else
                    value=${value}$'\n'${line}
                fi
            fi
        fi
    done < $1
    echo "$value"
}

extract_data $1

for file in $(find data/weather-2017-10-10-00-00-02/md/ -type f -mtime -1000) ; do bash scripts/script_06a.sh $file ; done | sort -u

That's good, but what if we want all the data for a given time and state?

nano scripts/script_06b.sh

cutoff=201710010000

function extract_data
{
    regex="^Expires:([0-9]{12})"
    unset good
    unset value
    while read line
    do
        if [[ "$line" =~ $regex ]]; then
            if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
                return
            fi
        fi
        if  [[ "$line" =~ '$$' ]] ; then
            unset good
        fi
        if ( echo "$line" | grep -q ^CITY ) ; then
            good=1
            continue
        fi
        if [[ -n $good ]]; then
            if  [[ "$line" =~ [[:alpha:]] ]] ; then
                if [[ -z $value ]]; then
                    value=${line}
                else
                    value=${value}$'\n'${line}
                fi
            fi
        fi
    done < $1
    [[ -n $value ]] && echo "$value"
}

for file in $(find $1 -type f)
do
    extract_data $file
done | sort -u

bash scripts/script_06b.sh data/weather-2017-10-10-00-00-02/az

Selectivity and format

Getting back to time stamps. How can we format time?

nano scripts/script_07a.sh

cutoff=201710010000

function month_str_to_num {
# Convert month_3char to numeric
    case $1 in
        "JAN") echo 1 ;;
        "FEB") echo 2 ;;
        "MAR") echo 3 ;;
        "APR") echo 4 ;;
        "MAY") echo 5 ;;
        "JUN") echo 6 ;;
        "JUL") echo 7 ;;
        "AUG") echo 8 ;;
        "SEP") echo 9 ;;
        "OCT") echo 10 ;;
        "NOV") echo 11 ;;
        "DEC") echo 12 ;;
        *) { echo Bad month format; exit 1; } ;;
    esac
}

function timestr_to_clock {

    local c=$1
    local ap=$2
    local h=""
    local m=""

    if [[ ${#c} == 3 ]]; then
        h=${c:0:1}
        m=${c:1:2}
    elif [[ ${#c} == 4 ]]; then
        h=${c:0:2}
        m=${c:2:2}
    else
        { echo Bad time format; exit 1; }
    fi

  if [[ "$ap" == "PM" ]]; then
      ((h+=12))
  fi

  echo "$(printf '%02d' $h):$(printf '%02d' $m):00"
}

function parse_time_stamp {

    local regex="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"

    if [[ "$1" =~ $regex ]]; then
        local hour_min=${BASH_REMATCH[1]}
        local am_pm=${BASH_REMATCH[2]}
        local timezone=${BASH_REMATCH[3]}
        local day_of_the_week=${BASH_REMATCH[4]}
        local month_3char=${BASH_REMATCH[5]}
        local day=${BASH_REMATCH[6]}
        local year=${BASH_REMATCH[7]}

        local month=$(month_str_to_num $month_3char)
        local clock=$(timestr_to_clock $hour_min $am_pm)

        echo "${year}-${month}-${day}T${clock}"

    else

        { echo Bad timestamp format; exit 1; }

    fi
}

function extract_data
{
    regex="^Expires:([0-9]{12})"
    timestamp="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"
    unset good
    unset value
    unset timestr
    while read line
    do
        if [[ "$line" =~ $regex ]]; then
            if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
                return
            fi
        fi
        if [[ "$line" =~ $timestamp ]]; then
            timestr=$(parse_time_stamp "$line")
        fi
        if  [[ "$line" =~ '$$' ]] ; then
            unset good
        fi
        if ( echo "$line" | grep -q ^CITY ) ; then
            good=1
            continue
        fi
        if [[ -n $good ]]; then
            if  [[ "$line" =~ [[:alpha:]] ]] ; then
                if [[ -z $value ]]; then
                    value="${timestr}  ${line}"
                else
                    value=${value}$'\n'${timestr}$' '${line}
                fi
            fi
        fi
    done < $1
    [[ -n $value ]] && echo "$value"
}

for file in $(find $1 -type f)
do
    extract_data $file
done | sort -u

bash scripts/script_07a.sh data/weather-2017-10-06-12-49-01/nc

Simplify. The date command can parse time, somewhat:

nano scripts/script_07b.sh

cutoff=201710010000

function timestr_to_clock {

    local c=$1
    local ap=$2
    local h=""
    local m=""

    if [[ ${#c} == 3 ]]; then
        h=${c:0:1}
        m=${c:1:2}
    elif [[ ${#c} == 4 ]]; then
        h=${c:0:2}
        m=${c:2:2}
    else
        { echo Bad time format; exit 1; }
    fi

  if [[ "$ap" == "PM" ]]; then
      ((h+=12))
  fi

  echo "$(printf '%02d' $h):$(printf '%02d' $m):00"
}

function parse_time_stamp {

    local regex="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"

    if [[ "$1" =~ $regex ]]; then
        local hour_min=${BASH_REMATCH[1]}
        local am_pm=${BASH_REMATCH[2]}
        local timezone=${BASH_REMATCH[3]}
        local day_of_the_week=${BASH_REMATCH[4]}
        local month_3char=${BASH_REMATCH[5]}
        local day=${BASH_REMATCH[6]}
        local year=${BASH_REMATCH[7]}

        local clock=$(timestr_to_clock $hour_min $am_pm)

        echo $(date -d "$day_of_the_week $month_3char $day $clock $timezone $year" +"%FT%T")

    else

        { echo Bad timestamp format; exit 1; }

    fi
}

function extract_data
{
    regex="^Expires:([0-9]{12})"
    timestamp="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"
    unset good
    unset value
    unset timestr
    while read line
    do
        if [[ "$line" =~ $regex ]]; then
            if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
                return
            fi
        fi
        if [[ "$line" =~ $timestamp ]]; then
            timestr=$(parse_time_stamp "$line")
        fi
        if  [[ "$line" =~ '$$' ]] ; then
            unset good
        fi
        if ( echo "$line" | grep -q ^CITY ) ; then
            good=1
            continue
        fi
        if [[ -n $good ]]; then
            if  [[ "$line" =~ [[:alpha:]] ]] ; then
                if [[ -z $value ]]; then
                    value=${timestr}$' '${line}
                else
                    value=${value}$'\n'${timestr}$' '${line}
                fi
            fi
        fi
    done < $1
    [[ -n $value ]] && echo "$value"
}

for file in $(find $1 -type f)
do
    extract_data $file
done | sort -u

bash scripts/script_07b.sh data/weather-2017-10-06-12-49-01/nc

Stupid bug

nano scripts/script_07c.sh


cutoff=201710010000

function timestr_to_clock {

    local c=$1
    local ap=$2
    local h=""
    local m=""

    if [[ ${#c} == 3 ]]; then
        h=${c:0:1}
        m=${c:1:2}
    elif [[ ${#c} == 4 ]]; then
        h=${c:0:2}
        m=${c:2:2}
    else
        { echo Bad time format; exit 1; }
    fi

    if [[ "$ap" == "PM" ]]; then
        if [[ $h != 12 ]]; then
            ((h+=12))
        fi
    fi

  echo "$(printf '%02d' $h):$(printf '%02d' $m):00"
}

function parse_time_stamp {

    local regex="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"

    if [[ "$1" =~ $regex ]]; then
        local hour_min=${BASH_REMATCH[1]}
        local am_pm=${BASH_REMATCH[2]}
        local timezone=${BASH_REMATCH[3]}
        local day_of_the_week=${BASH_REMATCH[4]}
        local month_3char=${BASH_REMATCH[5]}
        local day=${BASH_REMATCH[6]}
        local year=${BASH_REMATCH[7]}

        local clock=$(timestr_to_clock $hour_min $am_pm)

        echo $(date -d "$day_of_the_week $month_3char $day $clock $timezone $year" +"%FT%T")

    else

        { echo Bad timestamp format; exit 1; }

    fi
}

function extract_data
{
    regex="^Expires:([0-9]{12})"
    timestamp="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"
    unset good
    unset value
    unset timestr
    while read line
    do
        if [[ "$line" =~ $regex ]]; then
            if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
                return
            fi
        fi
        if [[ "$line" =~ $timestamp ]]; then
            timestr=$(parse_time_stamp "$line")
        fi
        if  [[ "$line" =~ '$$' ]] ; then
            unset good
        fi
        if ( echo "$line" | grep -q ^CITY ) ; then
            good=1
            continue
        fi
        if [[ -n $good ]]; then
            if  [[ "$line" =~ [[:alpha:]] ]] ; then
                if [[ -z $value ]]; then
                    value=${timestr}$' '${line}
                else
                    value=${value}$'\n'${timestr}$' '${line}
                fi
            fi
        fi
    done < $1
    [[ -n $value ]] && echo "$value"
}

for file in $(find $1 -type f)
do
    extract_data $file
done | sort -u

bash scripts/script_07c.sh data/weather-2017-10-06-12-49-01/nc

What?

diff scripts/script_07c.sh scripts/script_07b.sh

20,24c20,22
<     if [[ "$ap" == "PM" ]]; then
<         if [[ $h != 12 ]]; then
<             ((h+=12))
<         fi
<     fi
---
>   if [[ "$ap" == "PM" ]]; then
>       ((h+=12))
>   fi

Compartmentalization

Our script is getting out of hand. So, create a separate file to hold the functions, then source it:

nano scripts/script_08a.sh

cutoff=201710010000

source scripts/function.sh

for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d) ; do
    for file in $(find $collection/md/ -type f) ; do
        extract_data $file
    done | sort -u
done

Run it:

bash scripts/script_08a.sh

Kind of messy, add a sort step:

nano scripts/script_08b.sh

cutoff=201710010000

source scripts/function.sh

for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d | sort) ; do
    for file in $(find $collection/md/ -type f) ; do
        extract_data $file
    done | sort -u
done

Run it:

bash scripts/script_08b.sh

Parallelization

It's kind of slow to parse each file, once after another. Instead, let's parse them in parallel:

nano scripts/script_09a.sh

cutoff=201710010000

source scripts/function.sh

file_array=()

for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d | sort) ; do
    for file in $(find $collection/md/ -type f -name "*.txt") ; do
        file_array+=($file)
    done
done

parallel --max-procs 8 extract_data ::: ${file_array[*]} | sort -u

Run it:

bash scripts/script_09a.sh

... except that this fails. We need the functions to become elevated to the environment:

nano scripts/script_09b.sh

export cutoff=201710010000

source scripts/function.sh
export -f timestr_to_clock
export -f parse_time_stamp
export -f extract_data

file_array=()

for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d | sort) ; do
    for file in $(find $collection/md/ -type f -name "*.txt") ; do
        file_array+=($file)
    done
done

parallel --max-procs 8 extract_data ::: ${file_array[*]} | sort -u

Run it:

bash scripts/script_09b.sh

How much speed up do we get?

time bash scripts/script_08b.sh > /dev/null

real    0m9.910s
user    0m39.404s
sys     0m49.298s

time bash scripts/script_09b.sh > /dev/null

real    1m7.579s
user    0m37.302s
sys     0m46.612s

Not quite 8-fold speed up, but pretty good nonetheless:

echo "scale=2;67.58/9.91" | bc
6.81