tonybaldwin | blog

non compos mentis

Web Word Count

leave a comment »

Web Word Count: Get the word count for a list of webpages on a website.

A colleague asked what the easiest way was to get the word count for a list of pages on a website (for estimation purposes for a translation project).

This is what I came up with:

#!/bin/bash
# add up wordcounts for website

total=0 # initialize variable for total

# scan through a list of pages
# strip the html elements and count the words
# append the count to wordcount.txt

for i in $(cat pagelist.txt); do
     curl -s  $i |  sed -ne '{s/]*>//g;s/^[ \t]*//;p}' | wc -w >> wordcount.txt
done

# this is for purely aesthetic purposes, 
# but we're merging the list of pages with the wordcount file:
paste pagelist.txt wordcount.txt > pagewordcount.list

# for each number in the wordcount.txt file, add it to the previous number (get a total)
for t in $(cat wordcount.txt); do 
	total=$((total + $t))
done

# append the total to the end of the merged pagelist+wordcount file:
echo "Total word count = $total" >> pagewordcount.list

# read back the file:
cat pagewordcount.list

# ciao
exit

I ssh-ed to my server and did
ls -1 *.html > pagelist.txt
which lallowed me to feed the script this list.

baldwinlinguas.com/index.html
baldwinlinguas.com/esp.html
baldwinlinguas.com/fran.html
baldwinlinguas.com/port.html
baldwinlinguas.com/empregos.html
baldwinlinguas.com/transquote.html

So, then I ran the script on this list of the pages, and this is the output:

baldwinlinguas.com/index.html 535
baldwinlinguas.com/esp.html 342
baldwinlinguas.com/fran.html 295
baldwinlinguas.com/port.html 337
baldwinlinguas.com/empregos.html 662
baldwinlinguas.com/transquote.html 244
Total word count = 2415

So, it works. Someone with better bash fu could likely find a shorter path to this result.

Now, this is simple, of course, for a simple website, like baldwinlinguas.com.
On the other hand, if you have some huge wordpress installation, like this blog, and have tonso public php pages, rather than html, and eve more php files in the backend, you have to do a bit of sorting, I imagine.

Were I to attempt that with the baldwinsoftware wiki, I would probably just go to the Sitemap and grab that list of pages, using their URLs, of course.

./tony

Advertisements

Written by tonybaldwin

September 21, 2011 at 5:25 am

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: