tonybaldwin | blog

non compos mentis

Posts Tagged ‘script

Web Word Count

leave a comment »

Web Word Count: Get the word count for a list of webpages on a website.

A colleague asked what the easiest way was to get the word count for a list of pages on a website (for estimation purposes for a translation project).

This is what I came up with:

# add up wordcounts for website

total=0 # initialize variable for total

# scan through a list of pages
# strip the html elements and count the words
# append the count to wordcount.txt

for i in $(cat pagelist.txt); do
     curl -s  $i |  sed -ne '{s/]*>//g;s/^[ \t]*//;p}' | wc -w >> wordcount.txt

# this is for purely aesthetic purposes, 
# but we're merging the list of pages with the wordcount file:
paste pagelist.txt wordcount.txt > pagewordcount.list

# for each number in the wordcount.txt file, add it to the previous number (get a total)
for t in $(cat wordcount.txt); do 
	total=$((total + $t))

# append the total to the end of the merged pagelist+wordcount file:
echo "Total word count = $total" >> pagewordcount.list

# read back the file:
cat pagewordcount.list

# ciao

I ssh-ed to my server and did
ls -1 *.html > pagelist.txt
which lallowed me to feed the script this list.

So, then I ran the script on this list of the pages, and this is the output: 535 342 295 337 662 244
Total word count = 2415

So, it works. Someone with better bash fu could likely find a shorter path to this result.

Now, this is simple, of course, for a simple website, like
On the other hand, if you have some huge wordpress installation, like this blog, and have tonso public php pages, rather than html, and eve more php files in the backend, you have to do a bit of sorting, I imagine.

Were I to attempt that with the baldwinsoftware wiki, I would probably just go to the Sitemap and grab that list of pages, using their URLs, of course.


Written by tonybaldwin

September 21, 2011 at 5:25 am

Web Word Count – count the words on a website with bash, lynx, curl, wget, sed, and wc

with one comment

Web Word Count: Get the word count for a list of webpages on a website.

A colleague asked what the easiest way was to get the word count for a list of pages on a website (for estimation purposes for a translation project).

This is what I came up with:


# get word counts and generate estimated price for localization of a website
# by tony baldwin /
# with help from the linuxfortranslators group on yahoo!
# released according to the terms of the Gnu Publi License, v. 3 or later

# collecting necessary data:
read -p "Please enter the per word rate (only numbers, like 0.12): " rate
read -p "Enter currency (letters only, EU, USD, etc.): " cur
read -p "Enter domain (do not include http://www, just, for example, " url

# if we've run this script in this dir, old files will mess us up
for i in pagelist.txt wordcount.txt plist-wcount.txt; do
	if [[ -f $i ]]; then
		echo removing old $i
		rm $i

echo "getting pages ...  this could take a bit ... "

wget -m -q -E -R jpg,tar,gz,png,gif,mpg,mp3,iso,wav,ogg,ogv,css,zip,djvu,js,rar,mov,3gp,tiff,mng $url
find . -type f | grep html > pagelist.txt

echo "okay, counting words...yeah...we're counting words..."

for file in $(cat pagelist.txt); do
	lynx -dump -nolist  $file | wc -w >> wordcount.txt
paste pagelist.txt wordcount.txt > plist-wcount.txt

echo "adding up totals...almost there..."
for t in $(cat wordcount.txt); do
	total=$((total + t))

echo "calculating price ... "
price=`echo "$total * $rate" | bc`

echo -e "\n-------------------------------\nTOTAL WORD COUNT = $total" >> plist-wcount.txt
echo -e "at $rate, the estimated price is $cur $price
------------------------------" >> plist-wcount.txt

echo "Okay, that should just about do it!"
echo  -------------------------------
sed 's/\.\///g' plist-wcount.txt > $url.estimate.txt
rm plist-wcount.txt
cat $url.estimate.txt
echo This information is saved in $url.estimate.txt

So, then I ran the script on my site,, with a rate of US$012/word, and this is the final output:

—————————————- 38 38 52 38 52 322 774 494 382 289 618 93 1205 133 56 38 38 38 85 65 38 671 2027 2027 96 82

at 0.12, the estimated price is USD 1174.68

Now, this is simple, of course, for a simple website, like, which is largely all static html pages. Sites with dynamic content are going to be an entirely different story, of course.

The comments explain what’s going on here, but I explain in greater detail here on the baldwinsoftware wiki.

Now, if you just want the wordcount for one page, try this:


# add up wordcounts for one webpage

if [[ ! $* ]]; then
    read -p "Please enter a webpage url: " ur
 read -p "How much to you charge per word? " rate
 count=`lynx -dump -nolist $url | wc -w`
 price=`echo "$count * $rate" | bc`
 echo -e "$url has $count words. At $rate, the price would be US\$$price."

Special thanks to out to the Linux 4 Translator list for some assistance with this script.



Written by tonybaldwin

September 20, 2011 at 10:31 pm

Posting to Posterous with curl

leave a comment »

Okay, so I DID successfully post with curl (thanks to some clarifications frmo John Tucker). Now, I have written a script that will allow me to write a post in vim and fire it off. It looks like this:

# post to posterous from bash cli with curl 

# I could do this with read -p "Enter ur username d0od: " username 
# kind of thing, but I just hardwired my info in. 
# edit accordingly for your info. 

# creates a date stamp for naming the post file 
filedate=$(date %m%d%y%H%M%S) 

# set 
post title read -p "Enter a post title: " ptitle 

# write post in vim 
vim $filedate.ppost pbody="$(cat $filedate.ppost)" 

# send post to posterous with curl 

if [[ $(curl -X POST -u $username:$password -d "api_token=$apitok" \
-d "post[title]=$ptitle" -d "post[body]=$pbody" \ | grep error) ]]; then 
echo "Too bad, do0d...FAIL!" 
echo "Success! Posted to Posterous!" 

mv $(pwd)/$filedate.ppost ~/Documents/fposts/ 
# moved post to dir for safekeeping. 
# you can use different dir, d00d 


This means that I have consquered the posterous API, and will now be able to add posterous support to Xpostulate.

Written by tonybaldwin

September 19, 2011 at 6:48 pm

Fren.Tcl and Frendi.Sh

with 3 comments



So, those who know me know that I’ve been playing on Friendika, a decentralized, federated, free/open source, privacy protecting, and, well, pretty amazing Social Networking application.

Friendika is pretty awesome in various ways, including, first, you have complete control over who can or cannot see your content.  You own your content and your privacy is completely yours to control.  Also, you can follow contacts from many other networks, including twitter, any installation, Diaspora and Facebook, plus rss feeds, even, so, it becomes sort of a social networking aggregator.  Not only that, but it has friend groups similar to Diaspora Aspects or Google+ Circles.  These groups are very handy.  I follow my Diaspora and Facebook contacts, plus my contacts, plus a large number of twitter accounts on my friendika, and have them grouped into local friends, family, haxors (fellow foss hackers, tech blogs, etc.), friends (not local, people I met online), tradus (translation colleagues, work related, polyglots), and one more group for news which includes mostly twitter feeds from a number of news outlets (Al Jazeera, BBC, NPR, Alternet, etc.).  So, it has really helped me to organize my social networking.

So, these past couple of days I, being the geek that I am, have been playing with means of posting to Friendika remotely, first from the bash cli.  Now, I had posted earlier a quick-n-dirty update type script, but I have one now that will toggle cross-posting to various other services (statusnet, twitter, facebook), and will open an editor (vim) to allow you to write longer posts.  I posted it on the wiki here, but will also include the code in this post:


# update friendika from bash with curl
# I put this in my path as "frendi"

# here you enter your username and password
# and other relevant variables, such as whether or not
# you'd like to cross post to statusnet, twitter, or farcebork

read -p "Please enter your username: " uname
read -p "Please enter your password: " pwrd
read -p "Cross post to statusnet? (1=yes, 0=no): " snet
read -p "Cross post to twitter? (1=yes, 0=no): " twit
read -p "Cross post to Farcebork? (1=yes, 0=no): " fb
read -p "Enter the domain of your Friendika site (i.e. " url

# if you did not enter text for update, the script asks for it

if [[ $(echo $*) ]]; then
	read -p "Enter your update text: " ud

# and this is the curl command that sends the update to the server

if [[ $(curl -u $uname:$pwrd  -d "status=$ud&statusnet_enable=$snet&twitter_enable=$twit&facebook_enable=$fb"  $url/api/statuses/update.xml | grep error) ]]; then

# what does the server say?

	echo "Error"
	echo "Success!"
	echo $ud

# this next is optional, but I made a dir in ~/Documents to keep the posts.
# You can comment it out, you can change where it is storing them (the dir path)
# or, even, if you don't want to save the posts (they will pile up), you could
# change this to simply
# rm $filedate.fpost or rm -rf *.fpost, or some such thing.

mv $filedate.fpost ~/Documents/fposts

But I have also now written a graphical application in tcl/tk to write posts to Friendika, Fren.Tcl


Fren.Tcl - tcl/tk Friendika posting application

Find me on Friendika here.


Written by tonybaldwin

September 14, 2011 at 8:27 am

Image UP

leave a comment »

Image Up

a quick-n-dirty script to copy an image (or other file) to your server. (wiki page for this script)

I basically use this to upload screenshots for display here on this wiki and my blog, etc., so have the images directory “hardwired” in the script, but this could easily be customized to choose a different directory and use with any manner of files.


# script to upload images to my server
# by tony baldwin

if [ ! $1 ]; then
        # If you didn't tell it which file, it asks here
	read -p "Which image, pal? " img
        img = $1

# using scp to copy the file to the server
scp $img username@server_url_or_IP_address:/path/to/directory/html/images/
# you will be asked for your password, of course.  This is a function of scp, so not written into the script.

echo "Your image is now at$img."
read -p "Would you like to open it in your a browser now? (y/n): " op

if [ $op = "y" ]; then
	# you can replace xdg-open with with your favorite browser, but this should choose your default browser, anyway.
        # if you chose yes, the browser will open the image.
        # Otherwise, it won't, but you have the url, so you can copy/paste to a browser or html document, blog entry, tweet, etc., at will.


This image was uploaded with the above script:

(editing website with Tcltext)

This script, of course, assumes you are in the same directory as your image file, too.



EDIT: What would be cool is if I could make your filemanager allow this in a right-click action. Like, I use PCManFM. If I could just right-click an image and choose this, then pop-up the url with zenity, or, perhaps, even just automatically run the xdg-open…Hmmmm…One can probably work this out with some filemanagers more easily than others.

With some work, I could rewrite the script so that it choose a clicked image and auto-opens with the browser, and then just choose the script with “right-click > open with …”, perhaps…

Of course, I can just F4 (open dir in terminal), then bang off the script.

Written by tonybaldwin

September 4, 2011 at 1:31 pm

search wiktionary from the bash cli

with 2 comments

Last week I posted several scripts for searching google, google translate, google dictionary, reverso, and wikipedia from the bash command line.

Today I wrote another script, this time for searching, the multilingual, user-edit online dictionary:


# get definitions from wikitionary

if [ ! $1 ];
read -p "Enter 2 letter language code: " lang
read -p "Enter search term: " sterm
lynx -dump $$sterm | less
lynx -dump $$2 | less

I tucked this into my PATH as simply “wikt”, and usage is thus:
you@machine:~$ wikt en cows
or, if you neglect to use the language code and search term, of course, the script asks for them:
you@machine:~$ wikt
Enter 2 letter language code: en
Enter search term: cows



Written by tonybaldwin

May 16, 2011 at 7:11 am

search google, wikipedia, reverso from the bash terminal

leave a comment »


searching in bash

searching in bash


Okay, so, I like to use my bash terminal. Call me a geek all you like; it matters not to me. I wear that badge with pride.

The bash terminal is quick and efficient for doing a lot of stuff that one might otherwise use some bloated, cpu sucking, eye-candied, gui monstrosity to do. So, when I find ways to use it for more stuff, more stuff I do with it.

Now, for my work (recall, I am professionally a translator) I must often do research, some of which entails heavy lifting, and, otherwise, often simply searching for word definitions and translations. I use TclDict, which I wrote, frequently, but, I also use a lot of online resources that I never programmed TclDict to access, and would generally use a browser for that stuff. Unless, of course, I can do it my terminal!

For precisely such purposes, here are a couple of handy scripts I use while working.

First, let’s look up terms at


if [[ $(echo $*) ]]; then


read -p "Enter your search term: " searchterm
read -p "choose database (enter \'list\' to list all): " db

if [ $db = list ] ; then
curl dict://

read -p "choose database, again: " db

curl dict://$searchterm:$db | less



Now, let’s search google from the command line:

if [[ $(echo $*) ]]; then
read -p "Enter your search term: " searchterm
lynx -accept_all_cookies$searchterm
# I accept all cookies to go direct to search results without having to approve each cookie.
# you can disable that, of course.


I saved that in ~/bin/goose # for GOOgle SEarch
and just do
goose $searchterm

Or, search the google dictionary to translate a term:

echo -e "Search google dictionary.\n"
read -p "Source language (two letters): " slang
read -p "Target language (two letters): " tlang
read -p "Search term: " sterm
lynx -dump "$slang|$tlang&q=$sterm" | less

Note: For a monolingual search, just use the same language for source and target. Don’t leave either blank.


if [ ! $1 ];
echo -e "usage requires 3 parameters: source language, target language, search term. \n
Thus, I have this as ~/bin/googdict, and do \n
googdict en es cows \n
to translate "cows" to Spanish. \n
For monolingual search, enter the language twice. \n
As indicated, use the two letter code: \n
\"en\" for English, \"fr\" for French, etc."

lynx -dump "$1|$2&q=$3" | less

For the above, I have it in ~/bin/gd, usage being simply “gd $sourcelanguage $targetlanguage $searchterm”.
me@machine:~$ gd en es cow
Searches the Englist to Spanish dictionary for “cow”.

We can use similar principles to search reverso:

#search reverso
read -p "Enter the source language: " slang
read -p "Enter target language: " tlang
read -p "Enter your search term: " searchterm
lynx -dump$slang-$tlang/$searchterm | less

With the google dictionary, you use the two-letter language code (i.e., “en” for English, “fr” for French, etc.). With reverso, you have to spell out the language (“english” for English, etc.).

With all of the above, I’ve used the program, less, to display the results, rather than spitting it all out to to the terminal at once. Click here to learn how to use less, if needed.

Additionally, most of the above require Lynx Browser, which is generally available for any gnu/linux distribution via your favorite package manager (apt, synaptic, aptitude, yum, portage, pacman, etc.). For the script, I used cURL (also part of most gnu/linux distributions and installable with your favorite package manager).

Google Translate can also be accessed, but for this, we’ll use a bit of python magic (I know, I pick on google translate, a lot, but it can be useful):

#!/usr/bin/env python
from urllib2 import urlopen
from urllib import urlencode
import sys

# The google translate API can be found here:


text=' '.join(sys.argv[3:])
params=urlencode( (('v',1.0),
('langpair',langpair),) )
end_idx=translation.find('"}, "')
print translation

Originally found that here, on the ubuntuforums.

And now for Wikipedia we have a couple of options.
First, we have this awesome little handy script, tucked into my $PATH as “define”:

dig +short txt $

I use it simply with “define $searchterm”, and it gives a short definition from wikipedia.  I originally found it here.

Another extremely handy tool is Wikipedia2Text, which I simply installed from the debian repos via aptitude. When I use this, I also pipe it to less:

if [[ $(echo $*) ]]; then


read -p "Enter your search term: " searchterm


wikipedia2text $searchterm | less

I have that tucked into ~/bin/wikit, thus, do simply wikit $searchterm to get my results.


All code here that I have written is free and released according to the GPL v. 3. Check the links for code I borrowed for licensing information (pretty sure it’s all GPL-ed, too).


Written by tonybaldwin

May 3, 2011 at 12:52 am