This page attempts to teach Bash scripting from a data-driven, or task-oriented approach. Regional weather data was downloaded for around two weeks from NOAA. The goal of the commands on this page is to parse the data using simple scripting techniques.
These lessons assume you have a Biowulf account and are working within a temporary, local scratch disk on a node. Any changes made to the downloaded files during the session will be lost unless the files are copied to your own /home or /data directory!
Getting started
Log into Biowulf:
ssh user@biowulf.nih.gov
Make sure you are running Bash:
echo $0
Allocate an interactive session with local scratch space so that I/O operations on the data are as fast as possible. This command assumes you are already logged into Biowulf. By default, you will have access to 2 CPUs and 1.5GB of RAM. The session will last for 8 hours.
sinteractive --gres=lscratch:5
Copy the data into the local scratch space.
cd /lscratch/$SLURM_JOB_ID mkdir bash_class cd bash_class cp -R /data/classes/bash/* . tar -C data -x -z -f data/weather_data.tgz
If you are not on Helix or Biowulf, you can download the data using either wget or curl.
wget https://hpc.nih.gov/training/handouts/BashScripting.tgz
curl https://hpc.nih.gov/training/handouts/BashScripting.tgz > BashScripting.tgz
Untar the data
tar -x -v -f BashScripting.tgz tar -C data -x -z -f data/weather_data.tgz
Have a look at the data. There should be about 63MB of data, 71MB including examples and scripts.
ls ls data/weather-2017-10-09-00-00-01/ ls data/weather-2017-10-09-00-00-01/md/ file data/weather-2017-10-09-00-00-01/md/mdz009.txt stat data/weather-2017-10-09-00-00-01/md/mdz009.txt tree data cat data/weather-2017-10-09-00-00-01/md/mdz009.txt
Get down and dirty with the data
Run some simple and piped commands to display the temperature, humidity, and air pressure of a given city at a given time.
Look at all the data
cat data/weather-2017-10-10-06-00-01/md/mdz009.txt
Pull out only the lines that are for Gaithersburg
grep GAITHERSBURG data/weather-2017-10-10-06-00-01/md/mdz009.txt
Case insensitive search this time
grep gaithersburg data/weather-2017-10-10-06-00-01/md/mdz009.txt
grep --ignore-case gaithersburg data/weather-2017-10-10-06-00-01/md/mdz009.txt
Get all the timepoints for the data, using --recursive or -R
grep --recursive --ignore-case gaithersburg data/*/md
Get rid of the file information
grep --no-filename --recursive --ignore-case gaithersburg data/*/md
Sort the data
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort
Sort based on temperature - fail
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k3
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -nk3
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -rnk3
Sort using character position, rather than delimiter
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | cut -c1-14,26-27
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.26,1.27
Reverse the sort -- change ascending to descending
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.26,1.27r
Sort ascending by relative humidity -- fail because of leading space -- fix by treating as numeric
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.33,1.35
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.33,1.35n
Multi-column sort
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.33,1.35n -k 1.26,1.27r
Isolate the most humid moment in time
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.33,1.35n | tail
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.33,1.35n | tail -n 1
grep --recursive "FOG 65 65 100 CALM" data/*/md/*
grep --recursive "GAITHERSBURG* FOG 65 65 100 CALM" data/*/md/*
grep --recursive "GAITHERSBURG\* FOG 65 65 100 CALM" data/*/md/*
Isolate the least humid moment in time
grep --no-filename --recursive --ignore-case gaithersburg data/*/md | sort -k 1.33,1.35n | head -n 1
grep --recursive "GAITHERSBURG\* MOSUNNY 83 64 52 S15 30.02F" data/*/md/*
grep --recursive 'GAITHERSBURG\* MOSUNNY 83 64 52 S15 30.02F' data/*/md/*
Get the average temperature
grep -h -r -i gaithersburg data/*/md | cut -c26-27 | awk '{sum+=$1; num++ } END { print sum,num,sum/num}'
Scripting
Had enough of grep? Had enough of typing again and again? Let's create a script. nano is a simple text-based file editor that is intuitive and easy to use.
nano scripts/script_01a.sh
GNU nano 2.3.1 File: scripts/script_01a.sh grep -h -r -i -B100 gaithersburg data/* | grep 'M EDT' grep -h -r -i gaithersburg data/* ^G Get Help ^O WriteOut ^R Read File ^Y Prev Page ^K Cut Text ^C Cur Pos ^X Exit ^J Justify ^W Where Is ^V Next Page ^U UnCut Text ^T To Spell
Now run the script.
bash scripts/script_01a.sh
This should give a long list of stuff. We want to paste the timestamps onto the weather data. Open the file again and edit it, redirecting the output and adding the paste command.
nano scripts/script_01b.sh
grep -h -r -i -B100 gaithersburg data/* | grep 'M EDT' > 1 grep -h -r -i gaithersburg data/* > 2 paste 1 2
Now it looks nicer.
bash scripts/script_01b.sh
Reduce, reuse, recycle
Rather than edit script files, it would be easier to pass the name of a city into the script to get the data.
nano scripts/script_02a.sh
grep -h -r -i -B100 $1 data/* | grep 'M EDT' > 1 grep -h -r -i $1 data/* > 2 paste 1 2
Now pass a city name as an argument to the script.
bash scripts/script_02a.sh gaithersburg
bash scripts/script_02a.sh leesburg
bash scripts/script_02a.sh manassas
bash scripts/script_02a.sh dulles
Walking the data
Use for loops to generate tables of the data.
nano scripts/script_03a.sh
for city in gaithersburg leesburg manassas dulles
do
grep -h -r -i -B100 $city data/* | grep 'M EDT' > 1
grep -h -r -i $city data/* > 2
paste 1 2
done
bash scripts/script_03a.sh | sort -k6,6n -k2,2 -k1,1n
nano scripts/script_03b.sh
for city in {gaithersburg,leesburg,manassas,dulles}
do
grep -h -r -i -B100 $city data/* | grep 'M EDT' > 1
grep -h -r -i $city data/* > 2
paste 1 2
done
bash scripts/script_03b.sh | sort -k6,6n -k2,2 -k1,1n
Parsing the data
while .. read .. line is the way to walk through a single file. The file is redirected into STDIN.
while read line ; do echo $line ; done < data/weather-2017-10-09-00-00-01/md/mdz009.txt
while read line ; do echo "$line" ; done < data/weather-2017-10-09-00-00-01/md/mdz009.txt
while read line ; do echo "--- $line ---" ; done < data/weather-2017-10-09-00-00-01/md/mdz009.txt
Create a script to do this.
nano scripts/script_04a.sh
while read line
do
echo "--- $line ---"
done
nano scripts/script_04b.sh
while read line
do
echo "---${line}---"
done
Dealing with adversity
Use conditionals to handle unknowns and odd issues.
Only grab the data below "CITY"
nano scripts/script_04c.sh
while read line
do
if ( echo "$line" | grep -q ^CITY ) ; then
good=1
fi
if [[ -n $good ]]; then
echo " ---${line}---- "
fi
done
bash scripts/script_04c.sh < data/weather-2017-10-16-06-00-01/md/mdz009.txt
Don't print "CITY"
nano scripts/script_04d.sh
while read line
do
if ( echo "$line" | grep -q ^CITY ) ; then
good=1
continue
fi
if [[ -n $good ]]; then
echo " ---${line}---- "
fi
done
bash scripts/script_04d.sh < data/weather-2017-10-16-06-00-01/md/mdz009.txt
Get rid of remainder
nano scripts/script_04e.sh
while read line
do
if ( echo "$line" | grep -q ^CITY ) ; then
good=1
continue
fi
if [[ -n $good ]]; then
if ( echo "$line" | grep -q "^$" ) ; then
exit
else
echo " ---${line}---- "
fi
fi
done
bash scripts/script_04e.sh < data/weather-2017-10-16-06-00-01/md/mdz009.txt
Yuck, get rid of '$$' as well
nano scripts/script_04f.sh
this won't work
if ( echo "$line" | grep -q "$$" ) ; then
exit
else
this will, but we're still left with blank lines
if ( echo "$line" | grep -q '$$' ) ; then
exit
else
more cleverness
if ( echo "$line" | grep -q '$$' ) ; then
exit
elif ( echo "$line" | grep -q "^$" ) ; then
exit
else
even more cleverness
if [[ "$line" =~ [[:alpha:]] ]] ; then
echo " ---${line}---- "
fi
now walk all the files
for file in data/weather-2017-10-10-00-00-02/md/* ; do bash scripts/script_04f.sh < $file ; done
Wot? Repeats?
for file in data/weather-2017-10-10-00-00-02/md/* ; do bash scripts/script_04f.sh < $file ; done | sort -u
Regular expressions
Huh? Rolling around through the data shows some repeats
---ANNAPOLIS CLEAR 65 48 54 VRB7 29.65F---- ---ANNAPOLIS CLEAR 80 77 90 CALM 29.98R---- ---BWI AIRPORT MOCLDY 70 49 47 S12G20 29.63F---- ---BWI AIRPORT PTCLDY 75 75 100 CALM 29.98R---- ---MD SCIENCE CTR N/A 60 46 59 MISG 29.64F---- ---MD SCIENCE CTR N/A 79 75 87 MISG 29.97R----
Use the Expires tag to filter out very old data:
nano scripts/script_05a.sh
regex="^Expires:([0-9]{4})([0-9]{2})([0-9]{2})([0-9]{2})([0-9]{2})"
cutoff=201710010000
while read line
do
if [[ "$line" =~ $regex ]]; then
echo year=${BASH_REMATCH[1]}
echo month=${BASH_REMATCH[2]}
echo day=${BASH_REMATCH[3]}
echo hour=${BASH_REMATCH[4]}
echo minute=${BASH_REMATCH[5]}
if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}${BASH_REMATCH[2]}${BASH_REMATCH[3]}${BASH_REMATCH[4]}${BASH_REMATCH[5]}" ]]; then
exit
fi
fi
if ( echo "$line" | grep -q ^CITY ) ; then
good=1
continue
fi
if [[ -n $good ]]; then
if [[ "$line" =~ [[:alpha:]] ]] ; then
echo " ---${line}---- "
fi
fi
done
for file in $(find data/weather-2017-10-10-00-00-02/md/ -type f -mtime -1000) ; do bash scripts/script_05a.sh < $file ; done | sort -u
Simpler:
nano scripts/script_05b.sh
regex="^Expires:([0-9]{12})"
cutoff=201710010000
while read line
do
if [[ "$line" =~ $regex ]]; then
if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
exit
fi
fi
if ( echo "$line" | grep -q ^CITY ) ; then
good=1
continue
fi
if [[ -n $good ]]; then
if [[ "$line" =~ [[:alpha:]] ]] ; then
echo " ---${line}---- "
fi
fi
done
for file in $(find data/weather-2017-10-10-00-00-02/md/ -type f -mtime -1000) ; do bash scripts/script_05b.sh < $file ; done | sort -u
Automation
Create a function to automate what we did:
nano scripts/script_06a.sh
cutoff=201710010000
function extract_data
{
regex="^Expires:([0-9]{12})"
while read line
do
if [[ "$line" =~ $regex ]]; then
if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
exit
fi
fi
if ( echo "$line" | grep -q ^CITY ) ; then
good=1
continue
fi
if [[ -n $good ]]; then
if [[ "$line" =~ [[:alpha:]] ]] ; then
if [[ -z $value ]]; then
value=${line}
else
value=${value}$'\n'${line}
fi
fi
fi
done < $1
echo "$value"
}
extract_data $1
for file in $(find data/weather-2017-10-10-00-00-02/md/ -type f -mtime -1000) ; do bash scripts/script_06a.sh $file ; done | sort -u
That's good, but what if we want all the data for a given time and state?
nano scripts/script_06b.sh
cutoff=201710010000
function extract_data
{
regex="^Expires:([0-9]{12})"
unset good
unset value
while read line
do
if [[ "$line" =~ $regex ]]; then
if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
return
fi
fi
if [[ "$line" =~ '$$' ]] ; then
unset good
fi
if ( echo "$line" | grep -q ^CITY ) ; then
good=1
continue
fi
if [[ -n $good ]]; then
if [[ "$line" =~ [[:alpha:]] ]] ; then
if [[ -z $value ]]; then
value=${line}
else
value=${value}$'\n'${line}
fi
fi
fi
done < $1
[[ -n $value ]] && echo "$value"
}
for file in $(find $1 -type f)
do
extract_data $file
done | sort -u
bash scripts/script_06b.sh data/weather-2017-10-10-00-00-02/az
Selectivity and format
Getting back to time stamps. How can we format time?
nano scripts/script_07a.sh
cutoff=201710010000
function month_str_to_num {
# Convert month_3char to numeric
case $1 in
"JAN") echo 1 ;;
"FEB") echo 2 ;;
"MAR") echo 3 ;;
"APR") echo 4 ;;
"MAY") echo 5 ;;
"JUN") echo 6 ;;
"JUL") echo 7 ;;
"AUG") echo 8 ;;
"SEP") echo 9 ;;
"OCT") echo 10 ;;
"NOV") echo 11 ;;
"DEC") echo 12 ;;
*) { echo Bad month format; exit 1; } ;;
esac
}
function timestr_to_clock {
local c=$1
local ap=$2
local h=""
local m=""
if [[ ${#c} == 3 ]]; then
h=${c:0:1}
m=${c:1:2}
elif [[ ${#c} == 4 ]]; then
h=${c:0:2}
m=${c:2:2}
else
{ echo Bad time format; exit 1; }
fi
if [[ "$ap" == "PM" ]]; then
((h+=12))
fi
echo "$(printf '%02d' $h):$(printf '%02d' $m):00"
}
function parse_time_stamp {
local regex="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"
if [[ "$1" =~ $regex ]]; then
local hour_min=${BASH_REMATCH[1]}
local am_pm=${BASH_REMATCH[2]}
local timezone=${BASH_REMATCH[3]}
local day_of_the_week=${BASH_REMATCH[4]}
local month_3char=${BASH_REMATCH[5]}
local day=${BASH_REMATCH[6]}
local year=${BASH_REMATCH[7]}
local month=$(month_str_to_num $month_3char)
local clock=$(timestr_to_clock $hour_min $am_pm)
echo "${year}-${month}-${day}T${clock}"
else
{ echo Bad timestamp format; exit 1; }
fi
}
function extract_data
{
regex="^Expires:([0-9]{12})"
timestamp="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"
unset good
unset value
unset timestr
while read line
do
if [[ "$line" =~ $regex ]]; then
if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
return
fi
fi
if [[ "$line" =~ $timestamp ]]; then
timestr=$(parse_time_stamp "$line")
fi
if [[ "$line" =~ '$$' ]] ; then
unset good
fi
if ( echo "$line" | grep -q ^CITY ) ; then
good=1
continue
fi
if [[ -n $good ]]; then
if [[ "$line" =~ [[:alpha:]] ]] ; then
if [[ -z $value ]]; then
value="${timestr} ${line}"
else
value=${value}$'\n'${timestr}$' '${line}
fi
fi
fi
done < $1
[[ -n $value ]] && echo "$value"
}
for file in $(find $1 -type f)
do
extract_data $file
done | sort -u
bash scripts/script_07a.sh data/weather-2017-10-06-12-49-01/nc
Simplify. The date command can parse time, somewhat:
nano scripts/script_07b.sh
cutoff=201710010000
function timestr_to_clock {
local c=$1
local ap=$2
local h=""
local m=""
if [[ ${#c} == 3 ]]; then
h=${c:0:1}
m=${c:1:2}
elif [[ ${#c} == 4 ]]; then
h=${c:0:2}
m=${c:2:2}
else
{ echo Bad time format; exit 1; }
fi
if [[ "$ap" == "PM" ]]; then
((h+=12))
fi
echo "$(printf '%02d' $h):$(printf '%02d' $m):00"
}
function parse_time_stamp {
local regex="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"
if [[ "$1" =~ $regex ]]; then
local hour_min=${BASH_REMATCH[1]}
local am_pm=${BASH_REMATCH[2]}
local timezone=${BASH_REMATCH[3]}
local day_of_the_week=${BASH_REMATCH[4]}
local month_3char=${BASH_REMATCH[5]}
local day=${BASH_REMATCH[6]}
local year=${BASH_REMATCH[7]}
local clock=$(timestr_to_clock $hour_min $am_pm)
echo $(date -d "$day_of_the_week $month_3char $day $clock $timezone $year" +"%FT%T")
else
{ echo Bad timestamp format; exit 1; }
fi
}
function extract_data
{
regex="^Expires:([0-9]{12})"
timestamp="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"
unset good
unset value
unset timestr
while read line
do
if [[ "$line" =~ $regex ]]; then
if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
return
fi
fi
if [[ "$line" =~ $timestamp ]]; then
timestr=$(parse_time_stamp "$line")
fi
if [[ "$line" =~ '$$' ]] ; then
unset good
fi
if ( echo "$line" | grep -q ^CITY ) ; then
good=1
continue
fi
if [[ -n $good ]]; then
if [[ "$line" =~ [[:alpha:]] ]] ; then
if [[ -z $value ]]; then
value=${timestr}$' '${line}
else
value=${value}$'\n'${timestr}$' '${line}
fi
fi
fi
done < $1
[[ -n $value ]] && echo "$value"
}
for file in $(find $1 -type f)
do
extract_data $file
done | sort -u
bash scripts/script_07b.sh data/weather-2017-10-06-12-49-01/nc
Stupid bug
nano scripts/script_07c.sh
cutoff=201710010000
function timestr_to_clock {
local c=$1
local ap=$2
local h=""
local m=""
if [[ ${#c} == 3 ]]; then
h=${c:0:1}
m=${c:1:2}
elif [[ ${#c} == 4 ]]; then
h=${c:0:2}
m=${c:2:2}
else
{ echo Bad time format; exit 1; }
fi
if [[ "$ap" == "PM" ]]; then
if [[ $h != 12 ]]; then
((h+=12))
fi
fi
echo "$(printf '%02d' $h):$(printf '%02d' $m):00"
}
function parse_time_stamp {
local regex="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"
if [[ "$1" =~ $regex ]]; then
local hour_min=${BASH_REMATCH[1]}
local am_pm=${BASH_REMATCH[2]}
local timezone=${BASH_REMATCH[3]}
local day_of_the_week=${BASH_REMATCH[4]}
local month_3char=${BASH_REMATCH[5]}
local day=${BASH_REMATCH[6]}
local year=${BASH_REMATCH[7]}
local clock=$(timestr_to_clock $hour_min $am_pm)
echo $(date -d "$day_of_the_week $month_3char $day $clock $timezone $year" +"%FT%T")
else
{ echo Bad timestamp format; exit 1; }
fi
}
function extract_data
{
regex="^Expires:([0-9]{12})"
timestamp="^([0-9]{3}[0-9]*) ([AP]M) ([A-Z]{3}) ([A-Z]{3}) ([A-Z]{3}) ([0-9]{2}) ([0-9]{4})$"
unset good
unset value
unset timestr
while read line
do
if [[ "$line" =~ $regex ]]; then
if [[ "${cutoff}" -gt "${BASH_REMATCH[1]}" ]]; then
return
fi
fi
if [[ "$line" =~ $timestamp ]]; then
timestr=$(parse_time_stamp "$line")
fi
if [[ "$line" =~ '$$' ]] ; then
unset good
fi
if ( echo "$line" | grep -q ^CITY ) ; then
good=1
continue
fi
if [[ -n $good ]]; then
if [[ "$line" =~ [[:alpha:]] ]] ; then
if [[ -z $value ]]; then
value=${timestr}$' '${line}
else
value=${value}$'\n'${timestr}$' '${line}
fi
fi
fi
done < $1
[[ -n $value ]] && echo "$value"
}
for file in $(find $1 -type f)
do
extract_data $file
done | sort -u
bash scripts/script_07c.sh data/weather-2017-10-06-12-49-01/nc
What?
diff scripts/script_07c.sh scripts/script_07b.sh
20,24c20,22 < if [[ "$ap" == "PM" ]]; then < if [[ $h != 12 ]]; then < ((h+=12)) < fi < fi --- > if [[ "$ap" == "PM" ]]; then > ((h+=12)) > fi
Compartmentalization
Our script is getting out of hand. So, create a separate file to hold the functions, then source it:
nano scripts/script_08a.sh
cutoff=201710010000
source scripts/function.sh
for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d) ; do
for file in $(find $collection/md/ -type f) ; do
extract_data $file
done | sort -u
done
Run it:
bash scripts/script_08a.sh
Kind of messy, add a sort step:
nano scripts/script_08b.sh
cutoff=201710010000
source scripts/function.sh
for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d | sort) ; do
for file in $(find $collection/md/ -type f) ; do
extract_data $file
done | sort -u
done
Run it:
bash scripts/script_08b.sh
Parallelization
It's kind of slow to parse each file, once after another. Instead, let's parse them in parallel:
nano scripts/script_09a.sh
cutoff=201710010000
source scripts/function.sh
file_array=()
for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d | sort) ; do
for file in $(find $collection/md/ -type f -name "*.txt") ; do
file_array+=($file)
done
done
parallel --max-procs 8 extract_data ::: ${file_array[*]} | sort -u
Run it:
bash scripts/script_09a.sh
... except that this fails. We need the functions to become elevated to the environment:
nano scripts/script_09b.sh
export cutoff=201710010000
source scripts/function.sh
export -f timestr_to_clock
export -f parse_time_stamp
export -f extract_data
file_array=()
for collection in $(find data/ -maxdepth 1 -mindepth 1 -type d | sort) ; do
for file in $(find $collection/md/ -type f -name "*.txt") ; do
file_array+=($file)
done
done
parallel --max-procs 8 extract_data ::: ${file_array[*]} | sort -u
Run it:
bash scripts/script_09b.sh
How much speed up do we get?
time bash scripts/script_08b.sh > /dev/null real 0m9.910s user 0m39.404s sys 0m49.298s
time bash scripts/script_09b.sh > /dev/null real 1m7.579s user 0m37.302s sys 0m46.612s
Not quite 8-fold speed up, but pretty good nonetheless:
echo "scale=2;67.58/9.91" | bc 6.81