tonybaldwin | blog

non compos mentis

Posts Tagged ‘lynx

Web Word Count – count the words on a website with bash, lynx, curl, wget, sed, and wc

with one comment

Web Word Count: Get the word count for a list of webpages on a website.

A colleague asked what the easiest way was to get the word count for a list of pages on a website (for estimation purposes for a translation project).

This is what I came up with:

#!/bin/bash

# get word counts and generate estimated price for localization of a website
# by tony baldwin / baldwinsoftware.com
# with help from the linuxfortranslators group on yahoo!
# released according to the terms of the Gnu Publi License, v. 3 or later

# collecting necessary data:
read -p "Please enter the per word rate (only numbers, like 0.12): " rate
read -p "Enter currency (letters only, EU, USD, etc.): " cur
read -p "Enter domain (do not include http://www, just, for example, somedomain.com): " url

# if we've run this script in this dir, old files will mess us up
for i in pagelist.txt wordcount.txt plist-wcount.txt; do
	if [[ -f $i ]]; then
		echo removing old $i
		rm $i
	fi
done

echo "getting pages ...  this could take a bit ... "

wget -m -q -E -R jpg,tar,gz,png,gif,mpg,mp3,iso,wav,ogg,ogv,css,zip,djvu,js,rar,mov,3gp,tiff,mng $url
find . -type f | grep html > pagelist.txt

echo "okay, counting words...yeah...we're counting words..."

for file in $(cat pagelist.txt); do
	lynx -dump -nolist  $file | wc -w >> wordcount.txt
done
paste pagelist.txt wordcount.txt > plist-wcount.txt

echo "adding up totals...almost there..."
total=0
for t in $(cat wordcount.txt); do
	total=$((total + t))
done

echo "calculating price ... "
price=`echo "$total * $rate" | bc`

echo -e "\n-------------------------------\nTOTAL WORD COUNT = $total" >> plist-wcount.txt
echo -e "at $rate, the estimated price is $cur $price
------------------------------" >> plist-wcount.txt

echo "Okay, that should just about do it!"
echo  -------------------------------
sed 's/\.\///g' plist-wcount.txt > $url.estimate.txt
rm plist-wcount.txt
cat $url.estimate.txt
echo This information is saved in $url.estimate.txt
exit

So, then I ran the script on my site, tonybaldwin.net, with a rate of US$012/word, and this is the final output:

—————————————-
tonybaldwin.net/log/archives/environment/index.html 38
tonybaldwin.net/log/archives/cuisine/index.html 38
tonybaldwin.net/log/archives/music/index.html 52
tonybaldwin.net/log/archives/philosophy/index.html 38
tonybaldwin.net/log/archives/nanoblogger-help/index.html 52
tonybaldwin.net/log/archives/2011/09/11/911/index.html 322
tonybaldwin.net/log/archives/2011/09/index.html 774
tonybaldwin.net/log/archives/2011/09/01/mit_intro_to_cs_and_programming_assignment_1/index.html 494
tonybaldwin.net/log/archives/2011/08/26/come_on_irene/index.html 382
tonybaldwin.net/log/archives/2011/08/26/welcome_to_nanoblogger_3_4_2/index.html 289
tonybaldwin.net/log/archives/2011/08/26/here_we_roll_again/index.html 618
tonybaldwin.net/log/archives/2011/08/27/couldnt_stand_the_weather/index.html 93
tonybaldwin.net/log/archives/2011/08/index.html 1205
tonybaldwin.net/log/archives/2011/index.html 133
tonybaldwin.net/log/archives/technology/index.html 56
tonybaldwin.net/log/archives/politic/index.html 38
tonybaldwin.net/log/archives/religion/index.html 38
tonybaldwin.net/log/archives/art/index.html 38
tonybaldwin.net/log/archives/index.html 85
tonybaldwin.net/log/archives/personal/index.html 65
tonybaldwin.net/log/archives/health/index.html 38
tonybaldwin.net/log/articles/about/index.html 671
tonybaldwin.net/log/index.html 2027
tonybaldwin.net/log.1.html 2027
tonybaldwin.net/index.html 96
tonybaldwin.net/social.html 82

———————————————–
TOTAL WORD COUNT = 9789
at 0.12, the estimated price is USD 1174.68
———————————————–

Now, this is simple, of course, for a simple website, like tonybaldwin.net, which is largely all static html pages. Sites with dynamic content are going to be an entirely different story, of course.

The comments explain what’s going on here, but I explain in greater detail here on the baldwinsoftware wiki.

Now, if you just want the wordcount for one page, try this:

    #!/bin/bash

# add up wordcounts for one webpage

if [[ ! $* ]]; then
    read -p "Please enter a webpage url: " ur
else
    url=$*
 fi
 read -p "How much to you charge per word? " rate
 count=`lynx -dump -nolist $url | wc -w`
 price=`echo "$count * $rate" | bc`
 echo -e "$url has $count words. At $rate, the price would be US\$$price."
 exit

Special thanks to out to the Linux 4 Translator list for some assistance with this script.

Enjoy!

./tony

Advertisements

Written by tonybaldwin

September 20, 2011 at 10:31 pm

green is groOvy…

leave a comment »

I got tired of all the black-n-blueness and opted for a cheeful, springy green theme.
I needed a little cheering up.
(that is one of my photos in the wallpaper)
I’m not sure how long that will last, though…
A dark screen is easier on my eyes when I put in 15 or 18 hour days translating and/or hacking…
Notice, of course, that I’m still deeply entrenched in a terminal,
as, I am making this post with lynx.

Note how well Tickle Text fits in well with this theme…Of course, it fits with any theme, since it’s completely themable.
You know…I wrote in a module with some preset themes, like about 10, and never released that.
I’ve been meaning to work more on it, anyway.
I’ve been meaning in the means to save some user configurations (like saving a color theme, prefered browser prefered terminal, etc).
Since I already made such features available to TclScreen and TclUP, it know how to do it.
I just haven’t put the time in…
On that note, I do think I will build TclUP right in for uploading scripts and pages, etc…Why not.
Heck, if I could figure out where to learn the protocol for uploading LJ posts, I’d build that in, too.
Of course, I believe they have to formated in xml or something, which complicates the matter a bit…Or, at least makes it time consuming.
I could just keep adding stuff…By the time I reach 60 or so, I could have a text editor nearly as bloated as Emacs…
Well, that might be a stretch…

kthnxbye

Written by tonybaldwin

September 6, 2008 at 9:16 pm

deeper and deeper…

leave a comment »

I’ve just totally sunk into a world of dark, blackness….
with pretty colored text.

I’m find myself using my terminal for anything I can get away with now…

chatting on irc with irssi
for instance…
or

listening to tunes while chatting on irc in irssi

the other tabs on my terminal or doing other stuff, too.

Heck, my last post…the long, long list o’ tunes, well, I did
ls -R /mnt/storage/tunes > tuneslist.txt
to pipe the list output from terminal to a text file.
Then I did use a gui text editor, my Tickle Text
to edit the rest of the html to make a web page with it…uploaded it with my TclUP
Before you know it, I’ll be editing html pages in nano in terminal and ftp-ing them to my site in a wish terminal (since tcl has good ftp tools)…
I’ll be writing lj-posts via e-mail with mutt, and reading my friend’s page with Lynx
I’m hopeless…

Written by tonybaldwin

September 2, 2008 at 2:32 pm

Lynx, a text based browser

leave a comment »

I am making this post using the Lynx text based browser, in my terminal, Sakura.

Here is a screenshot:

click to enlarge
.
Pretty cool, I suppose.
Yes…I am a geek.
I’m not ashamed.
(Funny how I amuse myself with my first morning coffee, no?)
Now, on to more important things…

Written by tonybaldwin

August 19, 2008 at 6:28 am