Introduction to Shell String Processing: A Practical Example with OONI Probe
One of the most powerful features of the command line is that you can quickly put together scripts for relatively complex text processing. Given the right commands, you can format it however you like.
In this tutorial, we’ll use a command line program called OONI Probe to test what websites are blocked by your ISP, and then use grep
, awk
, and regular expressions to parse the output and make a list of all the blocked URLs.
Install OONI Probe
The Open Observatory of Network Interference (OONI) produces a tool called OONI Probe, which tests the blocking of various websites and apps. It has an app version and a much more comprehensive CLI version. We’ll be using the latter.
Note: If you live in a country with an authoritarian regime, or somewhere that you may be persecuted for the websites you visit, you should not run OONI Probe. The built-in tests will test connection to pornographic, gambling, messaging, and LGBT sites. If accessing these would get you in trouble, you may want to run the test in a way that can’t be traced to you, or not run it at all. You probably shouldn’t run it at work if your internet connection is monitored.
Install OONI Probe CLI using the instructions on their site. This requires a Unix-like system of some sort, so Windows users will have to find an alternative solution.
Run the test and view results
Run the default “websites” test using ooniprobe run websites
. This will test a very long list of approximately 1700 websites, and takes just over an hour. After the test is done, run ooniprobe list
to list the results of all tests. Look at the number of the most recent one — it will have a header that says something like #1 — 31 October 2023 10:00 UTC
.
To view the detailed results of that test, run ooniprobe list <number>
where <number>
is the index number of your test. This will print out a list detailing websites, what category the website is, whether it was blocked, and sometimes how it was blocked. For example,
┢━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┪
│ #1799 │
│ http://www.runescape.com/ (GAME) │
│ web_connectivity ok: ✓ │
│ success: ✓ uploaded: ❌ │
│ {"accessible": true, │
│ "blocking": ""} │
┢━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┪
In the above example, you can see that website #1799 is https://runescape.com, and that (GAME)
indicates that it’s in the game category. According to these results, it connected with no issues, which is denoted by "accessible": true
. Unfortunately, if you scroll up, you probably won’t be able to see all of the results. They are almost certainly longer than your maximum terminal history length. While you could use ooniprobe list X > file.txt
to print the output to a file, the results would still be massive. In this case, we don’t care about which sites are allowed, we only care about which ones are blocked.
Preliminary Filtering with grep
grep
is a command line utility that can search a file or an input stream for a pattern or string. It prints out every line that matches the pattern. Searching a file is simple, it’s just grep <pattern> <file>
. For example, grep password file.txt
will print out every line in file.txt
with the string “password” in it. You can add -i
to specify case insensitive, like grep -i password file.txt
.
Searching an input stream may not be something you’re used to, but it’s an extremely useful command line feature that lets you “pipe” the output of one command into the input of another. If you don’t specify a file, grep
will expect input from stdin (read as “standard in”). This is done by using the pipe operator |
(hold shift and then type backslash) between two commands. command1 | command2
will pipe the output of command 1 into command 2. Command 1 writes input to stdout (“standard out”), and it goes into command 2’s stdin (“standard in”).
For example, using cat
to print the contents of a file to the command line, we can use cat file.txt | grep password
to have grep
search the input stream from cat
for the string “password”.
Now, we want to use grep
to find only the lines of OONI Probe results pertaining to blocked websites. Since the results include either "accessible": true
or "accessible": false
based on whether the website is accessible, and there are no other booleans printed in the results, we can just search for “false” to find the lines pertaining to blocked websites.
ooniprobe list X | grep false
This will only list the exact lines containing “false”, which is not very useful.
│ {"accessible": false, │
│ {"accessible": false, │
│ {"accessible": false, │
Thankfully, it’s quite easy to fix. grep
can be provided with the optional flags -A <number>
and -B <number>
, for specifying the number of lines after and before the matching line respectively. If we modify our command to ooniprobe list X | grep false -A 2 -B 5
, to request two lines above and three lines below the match, we get the the full entries for each blocked site.
┢━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┪
│ #1683 │
│ http://www.delicates.co.uk/ (PROV) │
│ web_connectivity ok: ✓ │
│ success: ✓ uploaded: ❌ │
│ {"accessible": false, │
│ "blocking": ""} │
┢━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┪
--
┢━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┪
│ #1725 │
│ http://www.usacasino.com/ (GMB) │
│ web_connectivity ok: ✓ │
│ success: ✓ uploaded: ❌ │
│ {"accessible": false, │
│ "blocking": ""} │
┢━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┪
--
┢━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┪
│ #1800 │
│ http://www.securityfocus.com/ (HACK) │
│ web_connectivity ok: ❌ │
│ success: ✓ uploaded: ❌ │
│ {"accessible": false, │
│ "blocking": "dns"} │
┢━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┪
Now, to put this cleaned up output into a text file, all we have to do is add > file.txt
to the end of our command. If you want to append the information to an existing file, use >>
instead of >
.
ooniprobe list X | grep false -A 2 -B 5 > file.txt
Further Filtering with grep, awk, and regex
It may be useful to have a list of just the blocked URLs, for example if you wanted to run ooniprobe
on just those URLs to check whether they’re still blocked. We can accomplish this easily enough with further usage of grep
and a regular expression (regex), as well as using awk
.
First, we’ll search the file with the blocked sites for the lines that have links. We can search for the lines with “http” using grep http file.txt
.
│ https://www.orangewebsite.com/ (HOST) │
│ "blocking": "http-failure"} │
│ http://www.upci.org/ (REL) │
│ http://kremlin.ru/ (GOVT) │
│ http://www.delicates.co.uk/ (PROV) │
│ http://www.usacasino.com/ (GMB) │
│ http://www.securityfocus.com/ (HACK) │
While this has accomplished the desired result of getting the lines with links in them, it’s also printed out lines indicating a blocking type of “http-failure”, which we don’t want. We could search for “http:”, but that would exclude secure “https:” links. We can solve this with a regular expression.
Regular expressions are a way to search for strings that match a certain pattern, rather than containing an exact substring. For example, you can use brackets to indicate a list of characters that can be matched at a certain position. [bcf]at
will match any three character substring starting with b, c, or f, immediately followed by “at”. It will match bat, cat, and fat, but also batter because it contains the substring “bat”. In this case, we want to use ?
, which in regex indicates that the preceding character is optional. We can construct the regex “https?:” (for some reason Medium refuses to let that be a code block), which indicates that the “s” character is optional. Therefore, it will match “https:” and “http:”.
We can pass this regex to grep
preceded by the -E
flag to indicate usage of an extended regular expression, to make the command grep -E "https?:" file.txt
(Don’t forget to enclose the regex in quotes so that it will be parsed properly.)
│ https://www.orangewebsite.com/ (HOST) │
│ http://www.upci.org/ (REL) │
│ http://kremlin.ru/ (GOVT) │
│ http://www.delicates.co.uk/ (PROV) │
│ http://www.usacasino.com/ (GMB) │
│ http://www.securityfocus.com/ (HACK) │
We’ve successfully filtered for only the lines with links in them, but these lines have line borders that we want to remove to leave only the links. To do that, we’ll use awk
.
awk
is another scanning and pattern matching utility, but it provides more granularity than grep
. Most importantly, awk
makes it easy for us to subdivide individual lines. You provide it with a field separator and a pattern-action statement. The field separator is a string or regular expression that’s used to subdivide the string into “fields”. The string is divided into pieces at each substring matching the field separator, and the separator is removed. Each piece of the string is returned on a new line.
The default field separator is a space, so if you didn’t specify a field separator, then awk
would divide the string “cat bat” into the strings “cat” and “bat”. (Notice that the space used to separate the strings is not present in either output string.) You could instead use the -F
argument to specify a field separator, for example -F a
to specify “a” as the separator. Then, “cat bat” would be divided into “c”, “t b”, and “t”.
The pattern-action statement uses the pattern to select matching lines, and then performs the action on those matching lines. Pattern-action statements in awk
are in the format 'pattern { action }'
. If you want to use a regex for the pattern, it must be enclosed in forward slashes, like '/regex/ { action }'
. For example, the pattern-action statement '/[a-z]at/ { print $1 }'
will take lines matching the regex [a-z]at
(containing any substring starting with a lowercase letter and ending with “at”), split the line at each space, and then print the first thing from each line. A line cot bat
would match the regex, then it would print cot
because it’s the first space separated value in the line.
Our usage of awk
is simple and will not require a regular expression. We have every line with a blocked link, and just need to clean off the pipe characters and whitespace on either side. We can do this by putting the output of our previous grep
command through awk '{ print $2 }'
. Pipe the output of grep
through awk
, like so: grep -E "https?:" file.txt | awk '{ print $2 }'
Finally, we have the intended output: a list of blocked links from the ooniprobe
test.
https://www.orangewebsite.com/
http://www.upci.org/
http://kremlin.ru/
http://www.delicates.co.uk/
http://www.usacasino.com/
http://www.securityfocus.com/
Write this to a file by appending > output_file.txt
to the command.
Put it all together
If you want to avoid writing the full ooniprobe
blocked site results to a text file as an intermediate step, all of this can be done with a one-line command by piping the output of each command into the next.
ooniprobe list X | grep false -A 2 -B 5 | grep -E "http?:" | awk '{ print $2 }' > output.txt
Thanks for reading. I hope this helps you get more comfortable with command line text processing.