tonybaldwin | blog

non compos mentis

Posts Tagged ‘translation

Web Word Count – count the words on a website with bash, lynx, curl, wget, sed, and wc

with one comment

Web Word Count: Get the word count for a list of webpages on a website.

A colleague asked what the easiest way was to get the word count for a list of pages on a website (for estimation purposes for a translation project).

This is what I came up with:

#!/bin/bash

# get word counts and generate estimated price for localization of a website
# by tony baldwin / baldwinsoftware.com
# with help from the linuxfortranslators group on yahoo!
# released according to the terms of the Gnu Publi License, v. 3 or later

# collecting necessary data:
read -p "Please enter the per word rate (only numbers, like 0.12): " rate
read -p "Enter currency (letters only, EU, USD, etc.): " cur
read -p "Enter domain (do not include http://www, just, for example, somedomain.com): " url

# if we've run this script in this dir, old files will mess us up
for i in pagelist.txt wordcount.txt plist-wcount.txt; do
	if [[ -f $i ]]; then
		echo removing old $i
		rm $i
	fi
done

echo "getting pages ...  this could take a bit ... "

wget -m -q -E -R jpg,tar,gz,png,gif,mpg,mp3,iso,wav,ogg,ogv,css,zip,djvu,js,rar,mov,3gp,tiff,mng $url
find . -type f | grep html > pagelist.txt

echo "okay, counting words...yeah...we're counting words..."

for file in $(cat pagelist.txt); do
	lynx -dump -nolist  $file | wc -w >> wordcount.txt
done
paste pagelist.txt wordcount.txt > plist-wcount.txt

echo "adding up totals...almost there..."
total=0
for t in $(cat wordcount.txt); do
	total=$((total + t))
done

echo "calculating price ... "
price=`echo "$total * $rate" | bc`

echo -e "\n-------------------------------\nTOTAL WORD COUNT = $total" >> plist-wcount.txt
echo -e "at $rate, the estimated price is $cur $price
------------------------------" >> plist-wcount.txt

echo "Okay, that should just about do it!"
echo  -------------------------------
sed 's/\.\///g' plist-wcount.txt > $url.estimate.txt
rm plist-wcount.txt
cat $url.estimate.txt
echo This information is saved in $url.estimate.txt
exit

So, then I ran the script on my site, tonybaldwin.net, with a rate of US$012/word, and this is the final output:

—————————————-
tonybaldwin.net/log/archives/environment/index.html 38
tonybaldwin.net/log/archives/cuisine/index.html 38
tonybaldwin.net/log/archives/music/index.html 52
tonybaldwin.net/log/archives/philosophy/index.html 38
tonybaldwin.net/log/archives/nanoblogger-help/index.html 52
tonybaldwin.net/log/archives/2011/09/11/911/index.html 322
tonybaldwin.net/log/archives/2011/09/index.html 774
tonybaldwin.net/log/archives/2011/09/01/mit_intro_to_cs_and_programming_assignment_1/index.html 494
tonybaldwin.net/log/archives/2011/08/26/come_on_irene/index.html 382
tonybaldwin.net/log/archives/2011/08/26/welcome_to_nanoblogger_3_4_2/index.html 289
tonybaldwin.net/log/archives/2011/08/26/here_we_roll_again/index.html 618
tonybaldwin.net/log/archives/2011/08/27/couldnt_stand_the_weather/index.html 93
tonybaldwin.net/log/archives/2011/08/index.html 1205
tonybaldwin.net/log/archives/2011/index.html 133
tonybaldwin.net/log/archives/technology/index.html 56
tonybaldwin.net/log/archives/politic/index.html 38
tonybaldwin.net/log/archives/religion/index.html 38
tonybaldwin.net/log/archives/art/index.html 38
tonybaldwin.net/log/archives/index.html 85
tonybaldwin.net/log/archives/personal/index.html 65
tonybaldwin.net/log/archives/health/index.html 38
tonybaldwin.net/log/articles/about/index.html 671
tonybaldwin.net/log/index.html 2027
tonybaldwin.net/log.1.html 2027
tonybaldwin.net/index.html 96
tonybaldwin.net/social.html 82

———————————————–
TOTAL WORD COUNT = 9789
at 0.12, the estimated price is USD 1174.68
———————————————–

Now, this is simple, of course, for a simple website, like tonybaldwin.net, which is largely all static html pages. Sites with dynamic content are going to be an entirely different story, of course.

The comments explain what’s going on here, but I explain in greater detail here on the baldwinsoftware wiki.

Now, if you just want the wordcount for one page, try this:

    #!/bin/bash

# add up wordcounts for one webpage

if [[ ! $* ]]; then
    read -p "Please enter a webpage url: " ur
else
    url=$*
 fi
 read -p "How much to you charge per word? " rate
 count=`lynx -dump -nolist $url | wc -w`
 price=`echo "$count * $rate" | bc`
 echo -e "$url has $count words. At $rate, the price would be US\$$price."
 exit

Special thanks to out to the Linux 4 Translator list for some assistance with this script.

Enjoy!

./tony

Advertisements

Written by tonybaldwin

September 20, 2011 at 10:31 pm

How NOT to apply to Baldwin Linguas

with 3 comments

I get a lot of letters like this:

Dear Sirs/To whom it concerns:

I am a native (Language X) speaker. I am available for any potential project you have.
My resume is attached.
Feel free to contact me with any questions.

(name).

These letters are NOT effective.
I delete lots of them, sadly.
As I also do for those who give me a rambling, 3 page life story, but include little of relevance.
Even worse are those who send HTML formatted mail, with flashing images, and about 5 gigabytes of attachments.
I always make certain to report those as SPAM.

On the whole, I rather enjoy “meeting” new providers, colleagues in the field (as I am a translator). I particularly enjoy the various consummate professionals with whom I already work. They are intelligent, fascinating people. Their letters of application reflected those qualities. They were concise, but thorough enough to capture my attention. The letter itself must get my attention. I won’t open an attachment if the letter doesn’t give me a reason to do so. On the whole, I’d rather not receive attachments, anyway. I´d much rather receive a link to your proz profile (be sure it is complete and up to date). They also took the time to see if their language pairs and/or expertise meet my currently expressed needs. (ie., if I have announced I am seeking translators working ONLY in EN, FR, PT, and ES, why do you write me telling me you work in Pali, Lithuanian or Navajo?)

What I really want to see in a letter of application is more like:

Prezado Sr. Baldwin,

I am an experienced translator of Language X & Y to Language Z.
I have experience in the translation of documents pertaining to (area of expertise), having worked for Company H…I have provided interpretation for jkl…
[ie. insert a brief description of experience.]
I acquired a BA in (field, ie. electrical engineering and/or language X, translation, etc.) at University de Fulano Tal in Cidade Bela in 1992, going on to complete a MA in Translation Studies at Cerebro College…
[ie., a brief description of academic background.]

I work with Openoffice.org, Omega T and other open source tools, having the capacity to work with all major MSOffice document formats, .pdf, and .html files. [ie., description of technological capacity.]

My rates for translation are US$0.xx/word, and I can translate a volume of n words per day.
I accept payment via paypal/moneybookers (very key to Baldwin Linguas).

You may learn more about me and my services on my proz profile (insert link).
I look forward to working with you.
Thank you for your time.

Atenciosamente,
Name

Ie. Get my attention; tell me what I need to know; and tell me no more.

(originall posted here, on the Proz.com forums (thread contains additional commentary and discussion).

The best means of applying, of course, would be to read our website, find the appropriate link thereon for application, and following the instructions therein. It’s always best to familiarize yourself with a company before applying to work with them.

Tony

Written by tonybaldwin

September 11, 2010 at 7:03 am

Tux Trans: Linux for Translators released today!

with one comment

This morning I awoke to find announcement in my inbox of the release of Tux Trans, a gnu/linux distribution, based on Ubuntu Linux.

TuxTrans - gnu/linux for translators

Tuxtrans includes all of the software any professional needs for their usual office and communications needs, including web browsers, e-mail clients, VoIP and chat, the fully featured OpenOffice office suite (word processing, spreadsheets, etc.), tools for multimedia, pdf file manipulation, creation, and other desktop publication tools, plus additional programs specifically useful to translators, including CAT (Computer Aided Translation) software, text aligment tools, software localization, tools, even video subtitling tools, such as:

With these tools, any professional translator is fully equipped to conquer the industry. Seriously.
The underlying system, Ubuntu gnu/Linux, of course, is a solid, fully featured, and very popular gnu/linux distrubtion (I have Ubuntu on my laptop and my netbook, but Debian on my desktops).
Tuxtrans can be tried without affecting your current system, being a LiveCD distrubtion (it can run from a CDRom, without being installed to or effecting your hard drive, while, installation is, of course, an option once you’ve tried it).

Kudos to Peter Sandrini for putting this all together!

Written by tonybaldwin

May 5, 2010 at 4:06 am

exorcising bad translations

leave a comment »

This, my friends, is why Professional Translators are still a necessity.

Il Foglio, an Italian newspaper, has come out critizing the NY Times, who (OMGSTFUBBQ…can’t believe they did this!) used a computer generated translation of an article regarding the Vatican’s response to sexual abuse complaints.

The failure to translate led the American newspaper to argue that Cardinal Joseph Ratzinger was protecting a sexually abusive priest from Milwaukee.

The article, titled “New York Times does not translate,” starts by saying, “New York Times columnist Maureen Dowd returned to attack the Pope. Commenting on the words of exorcist Gabriele Amorth, who said that behind pedophile priests is the devil, Dowd suggested a way for the Catholic church to solve the problem: hire a ‘sexorcist.'” 1

Learn from this, kiddies.
When the text is important, neither Google Translate, nor Yahoo! BabelFish is truly your friend.

Go to, Proz.com and find a real, professional translator.
Of course, if your text requires translation from any of French, Portuguese or Spanish to American English, I’ve got you covered, right here.

tony


posted with Xpostulate

Written by tonybaldwin

April 13, 2010 at 8:20 pm

This technology can make the language barrier is gone

with 4 comments

Just for grins…

Engrish Mastars

English Mastary made simple....

First, let me state, for the millionth time, that I ❤ GOOGLE!

I use tonso google stuff…google search, gmail, google calendar (lifesaver!), google reader, google code, google groups, google plumbing, you name it…Google’s got it, I’m using it.  So, I’m not doing this to pick on Google.  Even so, a guy has to protect his own interests, no?  So, in the interest of demonstrating precisely why even the great Google will not supplant professional, human translators, I took yesterday’s NYTimes article on Google Translate, and ran it through Google Translate.  First, I translated it to French, then to Spanish, then back to English.

Now, I have to confess, the result is not unintelligible.  Most readers will be able to make some coherent sense of most of the resulting text.  Nonetheless, there  will be confusion (and laughter).  Now, imagine, if you will, the potential confusion, and quite possibly rather dire consequences were this method of translation used for, say, the instructions on your medication, international treaties, safety regulations, medical device instruction manuals, and a whole smathering of other complex textual materials of important significance.

There’s going to be confusion

That, folks, is why I still have a job.

And now, for your reading pleasure, the resultant text:


MOUNTAIN VIEW, Calif. – In a meeting with Google in 2004, the discussion focused on an e-mail the company had received from a fan in South Korea. Sergey Brin, one of the founders of Google, ran the message through an automatic translation service that the company had a license.

The message says that Google is a search engine of your choice, but the result is as follows: “The footwear of sliced raw fish you want. Google the green onion!”

Mr. Brin said Google should be able to do better. Six years later, its free Google Translate supports 52 languages, more than any other similar system, and use hundreds of millions of times a week to translate web pages and other texts.

“What you see on Google Translate is the state of the art in computer translation is not limited to a particular area,” said Alon Lavie, research associate professor in the Language Technologies Institute at Carnegie Mellon University.

Google’s efforts to expand beyond Web search has been uneven. Your digital book project, was hanged in the courtyard, and the introduction of its social network, Buzz, has raised fears of intimacy. The model suggests that this can sometimes stumble when it comes to challenge the traditions and conventions of cultural enterprise.

However, Google’s rapid growth to higher levels of translation is a reminder of what can happen when Google releases its power of brute force calculation of complex problems.

The network of data centers built to search the web, now, when united, the biggest team in the world. Google uses this machine to push the limits of translation technology. Last month, for example, said he was working to combine your translation tool with image analysis, allowing a person, for example, taking a photo of a German phone menu and get the machine translation into English.

“Machine translation is one of the best examples that demonstrates the vision of Google, said Tim O’Reilly, founder and CEO of tech publisher O’Reilly Media.” This is not something that someone no one takes seriously. However, Google understands something about the data that nobody understands and is willing to make the investments needed to address these types of complex problems ahead of the market. “

Creating a machine translation has been considered one of the toughest challenges in artificial intelligence. For decades, scientists tried using a team approach standards – teaching language regime of both languages and dictionaries give necessary.

But in half of the 1990s, researchers began to promote a statistical approach. They found that if they feed thousands or millions of computers and their human translations generated parts, you can learn to make assumptions about the exact form to translate new texts.

It turns out that this technique, which requires huge amounts of data and lots of computing power, Google has increased.

“Our infrastructure is well suited to this” Vic Gundotra, Google engineering vice president, said. “We can not adopt approaches that others can only dream.

Machine translation systems are far from perfect, and even Google’s human translators will not work soon. Experts say it is extremely difficult for a team to break a sentence into two parts, and then bring them back.

But the Google service is good enough to convey the essence of a news article, and became a source for quick translations for millions of people. “If you need a rough-and-ready translation is the place to go,” said Philip Resnik, an expert in machine translation and associate professor of linguistics at the University of Maryland, College Park.

Like its competitors in the field, including Microsoft and IBM, Google has promoted its translation engine transcripts of the United Nations, which are translated by the man in six languages, and the European Parliament, which resulted in 23 . This material is used to form systems most commonly used languages.

However, Google has traveled the Web text, and data from their project to digitize books and other sources to go beyond these languages. For more obscure languages, published a guide to help users with translations, then add the text in its database.

Offer Google could make a big hole in the translation business sale software companies like IBM, but machine translation is not likely to be a great Moneymaker, at least not by the standards of advertising google. But Google’s efforts could bear fruit in several ways.

Because the ads are online everywhere, while making it easier for people to use the Web to benefit society. And the system could have interesting applications. Last week, the company said that using speech recognition to generate English language subtitles for videos from YouTube, which could then be translated into 50 languages.

This technology can make the language barrier is gone,” said Franz Och, Google’s chief scientist who heads the team of the automatic translation company. This would allow anyone to communicate with anyone else. “

Mr. Och, a German researcher who previously worked at the University of Southern California, said he was reluctant to join Google, fearing that it would be the translation as a side project. Larry Page, Google’s other founder, called to reassure him.

“I just said is something that is very important to Google,” he recalled recently by Mr. Och. Mr. Och signed in 2004 and quickly was able to bring the promise of Mr. Page in the test.

While many translation systems such as using Google for one billion words of text to create a model of a language, Google has gone much more: hundreds of billions of few words in English. “The models are getting better the process rather than text,” said Och.

The effort was worth it. A year later, Google has won a competition run by the government that proof of sophisticated translation systems.

Google has used a similar approach – computing power, mounds of data and statistics – to address other complex issues. In 2007, for example, began offering 800-GOOG-411, directory assistance calls free interpretation of spoken. It has allowed Google to get the votes of millions of people who do better in the English speech recognition.

A year later, Google launched a search for the voice system that was as good as the other companies that have taken years to build.

And last year, Google launched a service called glasses, which analyzes the image of the phone, which is an online database of more than one billion images, including pictures of her taken to the streets Street View service.

Mr. Och has acknowledged that the Google translation still needs improvement, but he said he feels better quickly. “The curve of the current quality improvement is still very strong,” he said.

http://www.nytimes.com/2010/03/09/technology/09translate.html

This article was translated by Google, the English, then French, Spanish, then back to English.

TRANSLATORS domain of man!*

🙂


Tony

*this phrase was “Human Translators Rule!! prior to the above treatment)

Just for fun, I ran that article through Simplied Chinese, then Czech, then back to English, again.

here is that result

Written by tonybaldwin

March 10, 2010 at 10:42 am

Machine Translations, Google, and my job…

with 2 comments

I just thought I’d share this, quickly: Google’s Computing Power Refines Translation Tool

Google’s efforts to expand beyond searching the Web have met with mixed success. Its digital books project has been hung up in court, and the introduction of its social network, Buzz, raised privacy fears. The pattern suggests that it can sometimes misstep when it tries to challenge business traditions and cultural conventions.

But Google’s quick rise to the top echelons of the translation business is a reminder of what can happen when Google unleashes its brute-force computing power on complex problems.


Being both, a computer technology geek, and, a professional HUMAN translator, of course, I have mixed feelings about MT or Machine Translation. Personally, I don’t think MT will ever replace humans. Ever. Language is just too complex.
The internet is riddled with humorous examples of bad machine translation. Just take a look at Engrish.com, or, here’s a lovely example right here: Lost in Translation, Seriously.
Funny stuff.

Computers, or course, are a very powerful and useful tool in translation, of course. I would never deny that. Computer technology has brought about a great many changes in the translation industry over the past several years. Many translators feel threatened by that technology. I prefer to embrace it, frankly. I see it as a tool, not a threat. I confess, I use Google Translate sometimes. You already know I ❤ Google. Moreover, OmegaT, my preferred CAT (computer aided translation) tool now has integrated an optional Google Translate feature, so that, while I am translating a document, OmegaT will show me the Google Translate result for that segment. I have to say that instances in which I can simply insert that result without editing it are few. Perhaps 15 to 20%. I suppose that’s not too bad, really, considering the success of earlier attempts at MT, but it is also a clear indication that, without MY intervention, the translation would come out terribly. Sometimes this Google Translate feature is helpful, speeds things up, makes my work more efficient. I have found, however, that if I use Google Translate to translate an entire document, the revision process thereafter often becomes so cumbersome that the job becomes more work than it would had I simply translated the document on my own. Or with OmegaT, with Google Translate at my side. Using OmegaT, with Google Translate, I have access to the utility in Google’s tool, only using results when appropriate, thus, and my work does become more efficient. This becomes a sort of ménage à trois of Computer Aided Translation, Machine Aided Human Translation, and of course, Human Translation. Or, we could just call it “Human Aided Machine Translation” (not a new term). No matter what you call it, Machine translations will always, in my opinion, require human intervention. So, as I see it, Machine Translation is a useful tool. But it will never, ever take the place of professional, human translators. Language, and the human brain, are simply too complex.


Relevant links:

Written by tonybaldwin

March 9, 2010 at 11:05 am

New TransProCalc manual

leave a comment »

Hi!

I just wanted to post quickly and let everyone know that, thanks to technical writer, Anindita Basu, we now have a brand new, shiny TransProCalc Manual available, both in on online version, and for download as a pdf document for the TransProCalc project.

I still have plans to step up development on the project and add all kinds of useful features…stay tuned.

Written by tonybaldwin

March 7, 2010 at 5:16 pm