We've changed our name!  CivicFeed is now PeakMetrics

Daily Data Cleaning Notes: Awk FTW 2

June 13th 2019

Today’s bloggable tiny task was to use “news-[country]” in some sql table entries that just had “news”. Well, there was a lot more to it, actually, but let’s just take this small slice.

The starting data is a file like this, and many can guess that this is default psql output from a database SELECT query:

 31654|economicconfidential.com|["news"]
 31668|nationalfisherman.com|["news"]
 21432|turfshowtimes.com|["news"]
 31683|constructionnews.co.uk|["news"]
 31689|buckhead.com|["news"]
 21618|newstage.com.ng|["news"]

All we have to do is change the “news” tag to “news-ng” and “news-uk” and so forth, for those with country codes at the end of the domain name. Simple, right? The file is called out.wstags, by the way.

One should always have a look at the data first, so something like

awk -F '|' '{sub(/.*\./,"",$2); c[$2]++} END{for (i in c) print c[i],i}' out.wstags| sort -nr | less

will probe the suffixes. Sorry about the paucity of spacing but one-liners tend to be compact. There are a few more spaces that could be taken out, but it’s good to have habits of leaving them in for readability, IMHO.

17125 com
1916 org
929 uk
635 net
546 [no second field, domain name, present in the table]
502 au
373 edu
301 ca
149 gov
138 in
120 ie
113 za
92 us
87 nz
79 co
70 it
68 tv
68 no
66 fr
61 eu
49 news
47 de
45 ch
43 nl
39 es
35 fi
31 pt
31 hu
30 info
29 ph

etc.

Just as one might expect, except that there are a lot of suffixes, even two-letter suffixes, that are not countries. .tv is actually for Tuvalu, .eu is debatable as a “country”, and .us is ok in this example (in the real task, not ok w.r.t. spec, so we restrict ok countries below).

So the script is pretty easy — some would be tempted to do it in one line. But I would advise not to, just because experienced teachers of these tools preach simplicity. And sometimes simplicity argues in favor of several small steps that are bug-free, transparent, auditable, repeatable, and readable. Cleverness is so over-rated in real production environments.

BEGIN {
  while (getline < "out.wstags") {
 # 21432|turfshowtimes.com|["news"]
 # 31683|constructionnews.co.uk|["news"]
    thisid = $1
    dsuffix = $2; sub(/.*\./, "", dsuffix)
    if (length(dsuffix) != 2) continue
    newtags = $3; sub(/news/, "news-" dsuffix, newtags)
    if (newtags != "") print "UPDATE workspaces SET tags = " newtags " WHERE id=" thisid ";"
  }
}    

and some of that output looks like this:

 UPDATE workspaces SET tags = ["news-au"] WHERE id=18613;
UPDATE workspaces SET tags = ["news-es-glob", "news-es"] WHERE id=33008;
UPDATE workspaces SET tags = ["news-nl-glob", "news-nl"] WHERE id=33441;

Now, I know some people who use awk a lot are screaming about a few things, but trust me, I know the shortcuts and choose not to use them. Some will say why not use the -F '|' with the {} construct. Awk curmudgeons hate using the while (getline < filename) {} construct. Because it’s part of the original language appeal that one can write it as part of the stream-processing pattern-action list. Whatever that is, right? But I always argue that if one chooses the more verbose form, it is easier to stretch the script to more complicated logic. It’s much easier for newbies to read. It’s much easier to convert perl and bash and even python scripting people (and I would much rather read awk than the first two). Some would argue that the filename is now hard-coded; I would contend that the filename can now be remembered. In fact, I try to show a few lines of the file in the beginning of the while block so readers can understand the code. $2 is much easier to understand when one can see what the second field is likely to look like.

Yes, it’s true that doing things this way requires naming the script, and invoking it at the command line: awk -f newstagxform.awk for example.

I tested if (newtags != "") because if (newtags) is dangerous if newtags could have the value 0, or even 0foo in some early implementations. Normally one knows the data type is string, not numeric, for a field like this. In fact, an array of strings. But if there is any doubt, the safer form is definitely the former. On the other hand, one is more likely to know the data type well enough to shortcut than the logic spec, which permits relief of the while (getline < filename) { } syntax.

What matters of course is speed of task completion and correctness. So most people are smart enough to make good decisions once they see the options.

The strongest reason to use BEGIN { } to contain the entire script is that usually multiple files and pipes will be opened. This is harder to do in a serial list of pattern-actions operating on a single data stream. In this case, we might bring in a list of approved foreign country codes, and test.

BEGIN {
while (getline < "countrycodes.list") cc[$1]++
while (getline < "out.wstags") {
# 21432|turfshowtimes.com|["news"]
# 31683|constructionnews.co.uk|["news"]
thisid = $1
dsuffix = $2; sub(/.*\./, "", dsuffix)
if (!(dsuffix in cc)) continue
newtags = $3; sub(/news/, "news-" dsuffix, newtags)
if (newtags != "") print "UPDATE workspaces SET tags = " newtags " WHERE id=" thisid ";"
}
}

Here, the length test has been changed to a test of dsuffix in cc. One could test if (cc[dsuffix]) but realize that this potentially adds an index to cc at dsuffix with value “”. This can be a source of subtle errors in large scripts, so again, caveat programmor. Just choose one or the other and be consistent — awk’s auto initialization and type-coercion can bite anyone this way.

We’d still hear cries that the dictionary could be read in a BEGIN {} section and the second while () {} construct could be avoided, but that’s not really the goal of programming is it? Brevity for its own sake? To be he who can be most clever? No.

More likely, the spec will evolve so multiple file i/o and multiple stream transformations will interleave. In my experience, saving a few keystrokes here and there usually means error and brittleness down the road. Even from throwaway scripts. Because some people warn that “half of your throwaway scripts don’t actually get thrown away”. I’d say maybe a quarter, or even half of that, but that’s still a lot. At the very least, I’d rather do process forensics on scripts that contain filenames than be at the mercy of volatile command line histories, unless of course one plans to put those command line histories into its own script.

Monitor your PR impact across millions of stories. It's free to start.