tonybaldwin | blog

non compos mentis

Posts Tagged ‘conversion

DjVu: Free alternative to PDF (and a script to convert plain text to DjVu)

with 14 comments

djview4

this article, as .djvu in djview4

First, a bit of ranting about open standards and free file formats:
Okay, you know I’m always harping about using Open Document Formats.
So, on the LibreOffice user list today there was discussion of a viable Free/Open alternative to .pdf files. After all, PDF is, indeed, a proprietary format, owned by Adobe, and it is ubiquitous, and there really should (must, perhaps), be a free, open alternative. As such, someone on the list mentioned DjVu, which, frankly, I’d never looked at before (I had heard of it, but knew not what it was). It’s a free/open file format that was initially created for scanned documents, from what I gather, and has been around since the late 80s, still maintained by the original authors, and is now used for all kinds of gro0vy stuff.
I did a bit of research, googling, apt=cache searching, and poking around. Eventually, I aptitude installed djview4 and djvulibre and experimented a little. I have drawn the conclusion that, yes, in my opinion, DjVu would be an excellent candidate to be used as, in fact, a better option for many reasons, for the purposes .pdf currently serves (a portable document format that preserves formatting, essentially). Works great.

But there IS a rather glaring drawback…
The one big drawback is, conversion tools are lacking.
One can not, for instance, simply write a DjVu file in any kind of document editor, as you can write a pdf with many different editors, web browsers, most office software, LaTeX editors, and basic text editors, such as tcltext, and, frankly, even in a command line interface.
But to create DjVu, you can only convert other files to DjVu.
Then, in general, and this is what most irritates me, it seems you have to convert from non-free formats. There are no tools, for instance, to convert directly from plain text, LaTeX (.tex), .odf (.odt), .png, or even html files to a .dvju file. What’s worse, is that all of your Free and/or open source browsers, document editors, etc., will export or print a file to .pdf, but not to .djvu. OpenOffice.org will write a .pdf. LibreOffice, and Abiword will write a .pdf. LaTeX editors will write a .pdf….Everybody will write a .pdf, but nobody has written code to write a file directly to .djvu. In my opinion, that needs changing. We need to use open standards and free/open file formats (all kinds of reasons for that discussed in this entry to this blog).

That said, today I wrote a script to convert a plain text file to DjVu (but, yes, I had to round-trip it through .pdf, darn it).
This script was written on a Debian/Stable (lenny at the time of this writing) system, on AMD64 arch, using all tools available in the lenny repos.
It requires (obvious when you read the script) enscript, ps2pdf, and pdf2djvu (part of dvjulibre).
The script first converts your text file to postscript with enscript, the from postscript to pdf, with, surprise, ps2pdf, and, then, the final step of converting to .djvu.

The script looks like this:
#!/bin/bash

#!/bin/bash

# Converting a text file to a DjVu file
# copyright © tony baldwin / tony@baldwinsoftware.com
# release according to the terms of the GNU Public License, v. 3 or later

# first, make sure you named a file. duh.

if [[ $(echo $*) ]]; then
text="$*"
else
echo "try again, and include the file name..hello!" && exit
fi

# okay, enscript like ASCII best, so let's test our file encoding
# if we have anything other than ASCII, we will convert with iconv

enc="$(file --brief --mime-encoding $text)"
echo This file is encoded as $enc

if [ $enc != us-ascii ] ; then
echo We need to convert to ascii first.
echo Converting text encoding now ...
iconv -f $enc --to-code=ascii//TRANSLIT $text > tempy
mv tempy $text
newenc="$(file --brief --mime-encoding $text)"
echo Ok, now we have $newenc encoding and can proceed with conversion to djvu ...
fi

# from here, things are fairly self-explanatory

echo converting $text to $text.ps

enscript $text -q -B -p $text.ps

echo converting $text.ps to $text.pdf

ps2pdf $text.ps

echo converting $text.pdf to $text.djvu

pdf2djvu $text.pdf -o $text.djvu

echo renaming ...

rename.ul .txt.djvu .djvu $text.djvu

echo cleaning up ...

rm $text.ps $text.pdf

echo all done

# here, we are using the variable $text, which is $filename.txt, and changing it to $filename
# so we can append .djvu and open the resulting file in djview4

ntx=${text%.*}

djview4 $ntx.djvu &

exit

# This program was written by anthony baldwin - tony@baldwinsoftware.com
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.

What it’s doing? The first thing the file does is check the file encoding of the file in question. Enscript seems to play nice with ASCII, but not utf8 or some other encodings, so we’re converting to ASCII before doing anything else. Then, the script converts your text file to postscript with enscript, then from postscript to pdf, with, surprise, ps2pdf, and, then, the final step of converting to .djvu. At the end, the file cleans up the directory, removing the .ps and .pdf files. Then, it opens your file in Djview4. I have commented the script accordingly.

I actually turned the script on itself, and created a DjVu file of this text, available here.
With this, I may very add the capacity to export a .djvu file to tcltext. Why not? It’s just a shame, imho, that such an export is not direct, without having the cross into proprietary territory via .pdf, in order to be accomplished.

Also, as a gift to my fellow freedom fighters, foss hackers, and open standards supports, I have created a DjVu of my poetry here which contains all the poems published in my recent book  (but not the paintings and photographs).

And, this full article in djvu format here.  This last was fun, because I ended up having to change the text encoding first.  Apparently enscript doesn’t like utf8. I had copy/pasted the article into tcltext, which generates utf8 here (system default).  I made a .dvju that had all these weird character substitutions (like /200a#blahblah for a quotation mark?).  This is why I updated the script with the enscript text encoding conversion feature.

Now, if you use firefox or some other mozilla derivative, there’s actually a plugin for view such files in your browser, included in the djvulibre packages..  Otherwise, you’ll need a djvu viewer, such as djview or evince.

Anyway,
Enjoy.

./tony


Este artículo en español: http://www.gnewbook.org/pg/blog/tonybaldwin/read/83736/djvu-excelente-substituto-al-formato-nolibre-de-pdf-pero
Esse artigo em português: http://softwarelivre.org/tonybaldwin/blog/djvu-otimo-substituto-ao-formato-nao-livre-de-pdf-mas…

Advertisements

Written by tonybaldwin

January 19, 2011 at 3:43 pm

Convert an .html file to .pdf in gnu/linux

leave a comment »

There are various options for converting .html files to .pdf in a gnu/linux operating system. Your choice of methods will depend on the complexity of the file you wish to convert, and your familiarity with the tools a gnu/linux system provides.

What you’ll need:

  • Gnu/linux operating system
  • Html file
  • Web browser

Optional:

  • Openoffice.org office suite
  • wget
  • html2ps
  • ps2pdf

Simply “Print to file”
One very simple option for creating a .pdf file from an .html file is to simply open the file in your browser, and choose, print. When the print dialog arises, choose “Print to File”, and indicate “PDF”. This will write the html file out to pdf format.
html to pdf conversion: print to file
Here is a pdf of this article generated in this fashion: converthtml2pdfgnulinux.pdf
OpenOffice.org

“Print to File” works well for basic html files with simple text and some images. If the html file in question has more complex formatting, this option may not always produce the best results. Luckily, other options exist.

Save the html file to your computer (if you haven’t already done so), and open it with OpenOffice.org‘s html editor (ooweb). Then simply go to the “File” menu, and choose “Export”. OpenOffice.org will then offer you the usual options for saving a file, such as choosing where to save it, and what title to give the file, and, preso-magico, will produce a .pdf file from your .html file.

Command Line

Of course, no linux how to article would be complete without instructions on how to accomplish your task using only the magical Bash command line interface. For those so inclined, then, the following is a complete process for acquiring an .html file and converting it to a .pdf file. In order to proceed with this method, the following software must be installed on the your computer: wget, html2ps, and ps2pdf. These programs are either already a part of most gnu/linux distributions, by default, or can be easily acquired with your favorite package manager (apt, yum, pacman, portage, etc.)

First, let’s save the file to your computer:
wget http://www.somesite.com/yourfile.html

Next, let’s convert the .html file to a postscript or .ps file:
html2ps yourfile.html > yourfile.ps

Then, we’ll convert the postscript file, finally, to a .pdf file:
ps2pdf yourfile.ps

Voila!
You should now have “yourfile.pdf”.

This could, of course, all be scripted.

#!/bin/bash

# convert webpages to pdf files
# get url
echo "Enter the url of the page to be converted:"
read page
#download page
wget $page

file=$(basename $page)
#convert to postscript
html2ps $file > $file.ps
#convert to pdf
ps2pdf $file.ps
#clean up extraneous files
rm -f $file
rm -f $file.ps
#clean up file name
rename "s/.html.pdf/.pdf/g" *.pdf

echo "done"

exit

Here is a pdf of this article, generated via this command line method: convertweb2pdflinux.pdf
Notice, it is different from the above pdf created with “Print to file”. One difference, which, depending on your goals, may be either advantageous or undesired, is that text in this file can be selected and copied, which is not true of the first file.

XHTML2PDF

In many cases, you may wish to create a pdf file from a complex .html or .xhtml file that includes .css (cascading style sheet) or other elements, that will not render in the above methods in such a manner as to produce a file that appears as it does on the Internet.

For those cases, there is a program called xhtml2pdf. This program is not as likely to be a part of most gnu/linux distributions by default, nor available from said distributions’ repositories. As such, you may to have to download and install it by hand. Thankfully, the site for this program is easily enough found at http://www.xhtml2pdf.com/, and, of course, the program is free, open source software.

And, of course, here is a pdf of this article generated with xhtml2pdf: xhtml2pdfconversion.pdf

There’s more!

Yet other methods exist for generating .pdf file from .html files, of course, and an attempt to compile an exhaustive list, with instructions for each, would be beyond the scope of this article.

Written by tonybaldwin

May 18, 2010 at 3:00 pm

Posted in free software, gnu/linux

Tagged with , , , , ,

Convert an .html file to .pdf in gnu/linux

leave a comment »

There are various options for converting .html files to .pdf in a gnu/linux operating system. Your choice of methods will depend on the complexity of the file you wish to convert, and your familiarity with the tools a gnu/linux system provides.

What you’ll need:

  • Gnu/linux operating system
  • Html file
  • Web browser

Optional:

  • Openoffice.org office suite
  • wget
  • html2ps
  • ps2pdf

Simply “Print to file”
One very simple option for creating a .pdf file from an .html file is to simply open the file in your browser, and choose, print. When the print dialog arises, choose “Print to File”, and indicate “PDF”. This will write the html file out to pdf format.
html to pdf conversion: print to file

Here is a pdf of this article generated in this fashion: converthtml2pdfgnulinux.pdf

OpenOffice.org

“Print to File” works well for basic html files with simple text and some images. If the html file in question has more complex formatting, this option may not always produce the best results. Luckily, other options exist.

Save the html file to your computer (if you haven’t already done so), and open it with OpenOffice.org‘s html editor (ooweb). Then simply go to the “File” menu, and choose “Export”. OpenOffice.org will then offer you the usual options for saving a file, such as choosing where to save it, and what title to give the file, and, preso-magico, will produce a .pdf file from your .html file.

Command Line

Of course, no linux how to article would be complete without instructions on how to accomplish your task using only the magical Bash command line interface. For those so inclined, then, the following is a complete process for acquiring an .html file and converting it to a .pdf file. In order to proceed with this method, the following software must be installed on the your computer: wget, html2ps, and ps2pdf. These programs are either already a part of most gnu/linux distributions, by default, or can be easily acquired with your favorite package manager (apt, yum, pacman, portage, etc.)

First, let’s save the file to your computer:
wget http://www.somesite.com/yourfile.html

Next, let’s convert the .html file to a postscript or .ps file:
html2ps yourfile.html > yourfile.ps

Then, we’ll convert the postscript file, finally, to a .pdf file:
ps2pdf yourfile.ps

Voila!
You should now have “yourfile.pdf”.

This could, of course, all be scripted.

#!/bin/bash

# convert webpages to pdf files
# get url
echo "Enter the url of the page to be converted:"
read page
#download page
wget $page

file=$(basename $page)
#convert to postscript
html2ps $file > $file.ps
#convert to pdf
ps2pdf $file.ps
#clean up extraneous files
rm -f $file
rm -f $file.ps
#clean up file name
rename "s/.html.pdf/.pdf/g" *.pdf

echo "done"

exit

Here is a pdf of this article, generated via this command line method: convertweb2pdflinux.pdf
Notice, it is different from the above pdf created with “Print to file”. One difference, which, depending on your goals, may be either advantageous or undesired, is that text in this file can be selected and copied, which is not true of the first file.

XHTML2PDF

In many cases, you may wish to create a pdf file from a complex .html or .xhtml file that includes .css (cascading style sheet) or other elements, that will not render in the above methods in such a manner as to produce a file that appears as it does on the Internet.

For those cases, there is a program called xhtml2pdf. This program is not as likely to be a part of most gnu/linux distributions by default, nor available from said distributions’ repositories. As such, you may to have to download and install it by hand. Thankfully, the site for this program is easily enough found at http://www.xhtml2pdf.com/, and, of course, the program is free, open source software.

And, of course, here is a pdf of this article generated with xhtml2pdf: xhtml2pdfconversion.pdf

There’s more!

Yet other methods exist for generating .pdf file from .html files, of course, and an attempt to compile an exhaustive list, with instructions for each, would be beyond the scope of this article.

Written by tonybaldwin

May 18, 2010 at 8:01 am

Oggify (improved)

with one comment

Aside from any argument over which audio file format is best (I like FOSS, thus ogg), I reworked my oggify script, attempting to implement what I learned from comments at

Oggify (improved)

Written by tonybaldwin

September 4, 2008 at 11:56 pm