tonybaldwin | blog

non compos mentis

Posts Tagged ‘web

Web Word Count – count the words on a website with bash, lynx, curl, wget, sed, and wc

with one comment

Web Word Count: Get the word count for a list of webpages on a website.

A colleague asked what the easiest way was to get the word count for a list of pages on a website (for estimation purposes for a translation project).

This is what I came up with:


# get word counts and generate estimated price for localization of a website
# by tony baldwin /
# with help from the linuxfortranslators group on yahoo!
# released according to the terms of the Gnu Publi License, v. 3 or later

# collecting necessary data:
read -p "Please enter the per word rate (only numbers, like 0.12): " rate
read -p "Enter currency (letters only, EU, USD, etc.): " cur
read -p "Enter domain (do not include http://www, just, for example, " url

# if we've run this script in this dir, old files will mess us up
for i in pagelist.txt wordcount.txt plist-wcount.txt; do
	if [[ -f $i ]]; then
		echo removing old $i
		rm $i

echo "getting pages ...  this could take a bit ... "

wget -m -q -E -R jpg,tar,gz,png,gif,mpg,mp3,iso,wav,ogg,ogv,css,zip,djvu,js,rar,mov,3gp,tiff,mng $url
find . -type f | grep html > pagelist.txt

echo "okay, counting words...yeah...we're counting words..."

for file in $(cat pagelist.txt); do
	lynx -dump -nolist  $file | wc -w >> wordcount.txt
paste pagelist.txt wordcount.txt > plist-wcount.txt

echo "adding up totals...almost there..."
for t in $(cat wordcount.txt); do
	total=$((total + t))

echo "calculating price ... "
price=`echo "$total * $rate" | bc`

echo -e "\n-------------------------------\nTOTAL WORD COUNT = $total" >> plist-wcount.txt
echo -e "at $rate, the estimated price is $cur $price
------------------------------" >> plist-wcount.txt

echo "Okay, that should just about do it!"
echo  -------------------------------
sed 's/\.\///g' plist-wcount.txt > $url.estimate.txt
rm plist-wcount.txt
cat $url.estimate.txt
echo This information is saved in $url.estimate.txt

So, then I ran the script on my site,, with a rate of US$012/word, and this is the final output:

—————————————- 38 38 52 38 52 322 774 494 382 289 618 93 1205 133 56 38 38 38 85 65 38 671 2027 2027 96 82

at 0.12, the estimated price is USD 1174.68

Now, this is simple, of course, for a simple website, like, which is largely all static html pages. Sites with dynamic content are going to be an entirely different story, of course.

The comments explain what’s going on here, but I explain in greater detail here on the baldwinsoftware wiki.

Now, if you just want the wordcount for one page, try this:


# add up wordcounts for one webpage

if [[ ! $* ]]; then
    read -p "Please enter a webpage url: " ur
 read -p "How much to you charge per word? " rate
 count=`lynx -dump -nolist $url | wc -w`
 price=`echo "$count * $rate" | bc`
 echo -e "$url has $count words. At $rate, the price would be US\$$price."

Special thanks to out to the Linux 4 Translator list for some assistance with this script.




Written by tonybaldwin

September 20, 2011 at 10:31 pm