A-  A  A+ RSS Feed

Deep Thoughts by Robert Felty

thoughts on wordpress, latex, cooking et alia

Archive for the 'perl' Category

Monday, September 15th, 2008

Why doesn’t Mac update standard UNIX utilities?

I am currently teaching a course on programming for linguists. We are using python, but for the first few classes, I have been going over some standard UNIX utilities like cd, ls and such, plus using regular expressions with grep and sed. I actually don’t use sed that much. I tend to reach for perl, since I know it better, and it can do pretty much all the same stuff that sed can plus much more. But sed is simpler than perl, and I basically just wanted to use it for doing substitutions.

Today I got an e-mail from a student asking why the following did not seem to be working:

echo abcd123 | sed 's/\([a-z]*\).*/\U\1/'

The student reported the following output: “Uabcd”. (The expected output is “ABCD”, which is what I get on Linux)

I tried it, and it worked fine for me. Then I thought: maybe this is a Mac/Linux problem. Sure enough, when I look at the man page for my Fedora 7 box, it tells me that my version of sed is GNU 4.1.5, from June 2006. Mac Leopard (10.5) is using BSD sed from July 2004. Leopard came out in 2007, as did Fedora 7. Why is it 2 years behind? Why is it still using python 2.4? Why doesn’t it come with useful utilities like dos2unix? Mac has done a great job of making a nice GUI, with some pretty cool applications like iLife. It is falling behind when it comes to the command line utilities though.

Wednesday, September 3rd, 2008

Perl slurping

It seems like whenever I go to slurp in a whole file into a string in Perl, I have to search around to remember the exact syntax. So I decided to put it here for myself, so I won’t have to search any further than my own site. In this particular instance, I am trying to remove any <span> blocks from a file. I can simply do the following:

perl -e '$string = do {local( $/ ); <> }; $string=~s/<span>.*?<\/span>//gs; print $string;' < infile > outfile
Monday, November 5th, 2007

100 yootles bounty for solution to nested loop rounding error

I am working on doing some monte carlo simulations. I want to do a particular manipulation n times, but I want to constrain what I do based on three parameters, x, y, and z, which are probability distributions coded as arrays. For example, if I want to run this simulation 1000 times, then 24 should be xayaza, 24 xayazb, 72 xaybzc and so on. My code seems to work right when everything is integers, but not when some of the numbers are non-integers reals (always positive). I have tried a couple different strategies of rounding, ceiling, and floor, but it seems to always be off by a few. The correct solution should output $total = $trials;

I will offer 100 yootles to the first person who finds a solution for me. (I am including the perl code I have been playing around with, but my problem lies in the algorithm, not in the syntax.)

#!/usr/bin/perl -w
use strict;
use POSIX qw(floor ceil);
my $trials=795;
$trials = shift;
my @x = (.4,.4,.2);
my @y = (.3,.5,.2);
my @z = (.2,.2,.6);

my $trial =0;
my $total =1;
my $a=0;
while ($a<scalar @x && $trial <= round($trials*$x[$a])) {
  my $b=0;
  while ($b< scalar @y && $trial <= round($trials*$x[$a]*$y[$b])) {
    my $c=0;
    while ($c< scalar @z && $trial <= round($trials*$x[$a]*$y[$b]*$z[$c])) {
      $total++;
      $trial++;
      if ($trial >= round($trials*$x[$a]*$y[$b]*$z[$c])) {
        $c++;
        $trial=0;
      }
    }
    $b++;
    $trial=0;
  }
  $a++;
  $trial=0;
}
print "trials = $trials, total = $total\n";

sub round {
    my($number) = shift;
    return int($number + .5 * ($number <=> 0));
}

Update

My friend Danny Reeves, along with some help from David Yang solved my problem in a completely different way. Here is Danny’s solution:

#!/usr/bin/perl
# Rob's monstronsity that is surely the solution to the wrong problem.
# But for 100 yootles, we'll just do as we're told.
# This would be much nicer in a properly functional-style language!

my $trials = 795;

my @x = (.4,.4,.2);
my @y = (.3,.5,.2);
my @z = (.2,.2,.6);

@tuples = cross(\@x,\@y,\@z);

# compute idealized tuple counts:
@counts = map { $trials*prod(@$_) } @tuples;

@ic = map { int($_) } @counts;  # floors of counts.
@fc = map { $_-int($_) } @counts;  # fractional parts.
$f = sum(@fc);  # sum of fractional parts, to redistribute.

# redistribute...
for($i=1; $i<=$f; $i++) {
   $ic[posmax(deltas(\@counts,\@ic))]++;
}

$total = 0;
for($i=0; $i<scalar(@ic); $i++) {
   ($x, $y, $z) = @{$tuples[$i]};
   for($j=0; $j<$ic[$i]; $j++) {
     print "do something with ($x, $y, $z)\n";
     $total++;
   }
}

print "trials = $trials, total = $total\n";


# Return a cross product from its arguments. Arguments are array refs.
# Result is a list of array refs.  [found this on the web; damn slick]
# (note that this returns the tuples in not quite canonical order)
sub cross {
   my @r = [];
   @r = map {my $s = $_; map {[@$_ => $s]} @r} @$_ for @_;
   @r
}

# Return sum of arguments.  Ie, reduce(+, args, 0).
sub sum { my $x = 0;  for(@_) { $x += $_; }  $x }

# Return product of arguments.  Ie, reduce(*, args, 1).
sub prod { my $x = 1;  for(@_) { $x *= $_; }  $x }

# Return a list of differences between 2 lists, passed as refs.
# (assumes the lists have the same length)
sub deltas {
   my @ans;
   for(my $i=0; $i<scalar(@{$_[0]}); $i++) {
     push(@ans, $_[0]->[$i] - $_[1]->[$i]);
   }
   @ans
}

# Takes list, return the position of the largest element.
sub posmax {
   if (scalar(@_)==0) { return -1; }
   my $x = 0;  # index of best so far.
   for(my $i=0; $i<scalar(@_); $i++) {
     if($_[$i] > $_[$x]) { $x = $i; }
   }
   $x
}

And because Danny is a huge fan of Mathematica, he also included a Mathematica version

trials = 795;
x = {.4, .4, .2};
y = {.3, .5, .2};
z = {.2, .2, .6};

(* index of the largest element; i'm sure there's an adorable one-liner
    for this if i thought hard enough *)
posmax[{}] = -1;
posmax[l_] := Module[{i, x = 1}, (* x is index of best so far *)
   For[i = 1, i < Length[l], i++, If[l[[i]] > l[[x]], x = i]];
   x]

tuples = Tuples[{x, y, z}];
counts = (trials*Times @@ # &) /@ tuples;
ic = IntegerPart /@ counts;
fc = FractionalPart /@ counts;
f = Total[fc];
For[i=1, i<=f, i++, ic[[posmax[counts - ic]]]++];
total = 0;
MapThread[Do[{"do something with ", #1}; total++, {#2}]&, {tuples,ic}];
total
Wednesday, April 25th, 2007

Picasa, JAlbum, and null bytes

I have recently been trying to transition from Mac to Linux, with much success for the most part, but a few hiccups as well, as is to be expected. One of the important uses of the computer for me is photo editing and sharing, especially since we got our Canon Rebel XT last year, which takes absolutely beautiful pictures. I had developed a fairly nice workflow on my Mac for photo editing and sharing, consisting of:

  1. Import pictures from camera to iPhoto
  2. Delete bad pictures
  3. Create new albums
  4. Edit some photos for lighting, cropping etc.
  5. Add titles and comments to photos
  6. Use caption buddy to write iPhoto comments to IPTC tags
  7. Use an applescript to rename all the images in an album with a meaningful name followed by an automatically increasing number
  8. Export the images from iPhoto
  9. Use rsync to upload the pictures to my webserver
  10. The webserver has a cron script which looks for new pictures, and then runs JAlbum to create new web albums for my pictures

It sounds like a lot of steps, but actually it was going quite quickly for me, and without many hitches. The biggest issue was with the IPTC tags. Using caption buddy is kind of a hack to get around that. It would be nice if iPhoto just wrote tags to the files by default. Oh well.

To get the same results with Linux, there are a number of changes. Obviously there is no iPhoto for Linux. The next best thing (actually better in some regards) is Picasa. Picasa is a Windows program put out by Google (they acquired it several years ago). As of about a year ago, there is now a Linux version as well. The Linux version requires WINE, which allows one to run Windows programs on other platforms (mostly Linux, but there is also a Mac version). Especially considering that Picasa is not a native Linux application, it runs remarkably fast and is very stable. It has some nice editing features, including a “I’m Feeling Lucky” button, which seems to do much better than iPhotos “enhance” button. It also includes a unique scrollbar, which changes scrolling speed, depending on how much you move it. Also, unlike iPhoto, it writes captions as IPTC captions, which are embedded in the file, which is handy, especially since that is how JAlbum knows about them. JAlbum will extract the IPTC tags, along with EXIF tags which include information about the settings of the camera for each picture. It will then make some nicely compressed and small images suitable for web-viewing. JAlbum is also highly customizable, which I like very much.

All that being said, both Picasa and JAlbum seem to have a few bugs.

Bug #1 — Null bytes

When uploading some pictures I had processed with Picasa recently, I noticed a strange character at the end of each caption. Viewing the album with Firefox on a Mac, this showed up as a question mark, making it look as if I were unsure about all of my captions. After some reading on the JAlbum forum, I saw a post claiming that this was a null byte character. That was helpful. I tried looking at the image files in my favorite text editor, vim, and saw a bunch of gobblety gook, along with some captions that I could recognize. There did seem to be some extra characters after the caption, but I couldn’t figure out which one was the null byte character. After some more searching about vim and control characters, I found out that the null byte character shows up as ^@ in vim, and that if I want to type one, I have to type Ctrl-V Ctrl-J. That meant I could remove the null byte character in vim! Well, I tried this with the image files, and that corrupted them. Bummer. Then I started looking at the html that JAlbum was generating, and indeed, the null byte character was still there. I tried doing a search and replace with vim, and that worked wonderfully. Unfortunately though, using vim to hand-edit a bunch of files was not acceptable to me, so I looked for a perl solution. After yet more searching, I discovered I could so a search and replace with perl, and that perl represents the null byte character as \0. Finally I had a solution. I simply added the following line to the shell script that runs JAlbum (after it is done processing with JAlbum)

for file in *.php; do cat $file|perl -pe 's/\0//g'>$file.tmp; mv $file.tmp $file; done

Bug #2 incorrectly ordered metadata

This seemed to be working fine, until I noticed that JAlbum was not able to process several of my pictures. Instead of getting a picture with a caption, I only got a caption, and JAlbum would return an error that it failed to process several pictures.

JFIF APP0 must be first marker after SOI

This had actually been happening for awhile with 2 of the 1000 or so images I have. Since it was only 2, I did not worry about it. But now this was happening with many of the new images I had just uploaded. It quickly became apparent that images that I had tweaked (color, cropping, etc.) with Picasa were the ones that were not getting processed correctly. After searching a bunch more, I have come to the conclusion that it has something to do with the ordering of metadata in images, and how java processes that metadata. It seems that java expects the metadata to be in a very particular order — a different order than what Picasa outputs. That being said, it seems that many other programs can read in the files that Picasa produces. So it is not really clear to me what program is at fault. I just want a solution.

Solution #1 — Reprocess with JAlbum

After quite a bit of searching around the interweb, I found a post on the JAlbum forum that mentioned this bug. The solution: “Turn off the EXIF info”. Sure enough, this did the trick. However, that means that I was losing valuable information, and that was not acceptable for me. However, I noticed that if I processed the images with the EXIF info off, then turned it on, both my images and my captions would show up, in spite of the error message the second time around. So I decided to process all of the photos twice. This was not ideal, but it seemed like a solution.

Solution #2 — Reprocess with ImageMagick

I then started thinking some more, and I recalled a LaTeX problem I had back when I was just learning. I was learning how to import images, and I was having a problem getting the right bounding box on the .eps file I was trying to import. A more experienced TeXnician told me to use eps2eps on the image, and that that often corrected bounding boxes, when the program that produced the image in the first place had screwed it up. I found it very odd that there was a program to convert eps to eps, but that is what that program does. Sure enough, it worked. So I started thinking if I could use a similar technique here. I had also read on the JAlbum forum that someone tried simply opening the image with Photoshop and resaving it. That sounded like a good idea, but I needed a solution which could be automated. So I tried using convert from the ImageMagick suite, and that did the trick. Convert produces a new file though, which I did not want. So instead, I tried mogrify, which changes the original file. That worked! To re-iterate the conundrum again, JAlbum does not like the ordering of metadata in files generated with Picasa, but ImageMagick inputs them just fine, and outputs a format that JAlbum likes. Strange.

UPDATE: I just discovered that ImageMagick version 6.3.2-6.3.3 has a bug with IPTC captions, which simply deletes them altogether. Make sure your ImageMagick version is either newer or older than this range.

Okay then. Now to present my new photo editing and sharing workflow:

  1. Import pictures from camera to Desktop/originals (Picasa automatically adds them to its database
  2. Delete bad pictures
  3. Create new albums
  4. Edit some photos for lighting, cropping etc.
  5. Add captions to photos (Picasa only has one IPTC tag available for editing)
  6. Export the images from Picasa to Desktop/modified
  7. for file in *.jpg; do mogrify $file; done
  8. Run shell script to rename pictures to meaningful names with auto-incrementing numbers (see below script)
  9. Use rsync to upload the pictures to my webserver
  10. The webserver has a cron script which looks for new pictures, and then runs JAlbum to create new web albums for my pictures
#!/bin/bash
#renamePics
# this script renames pictures based on user input, and automatically numbers
# them, including 0 padding
dir=$1
base=$2
ext='jpg'
iter=1
for file in `ls ${dir}/*.${ext}`; do
  if [[ $iter -lt 10 ]]; then
    newpic="${dir}/${base}00${iter}.${ext}"
  elif [[ $iter -lt 100 ]]; then
    newpic="${dir}/${base}0${iter}.${ext}"
  else
    newpic="${dir}/${base}${iter}.${ext}"
  fi
  mv -f $file $newpic
  let "iter = $iter +1"
done
Tuesday, November 21st, 2006

web + print frustrations

Sometimes one wants to write a document that can be viewed both in print form and on the web. From my experiences so far, there does not yet exist a good way to do this from one document. I have tried several different methods, each with its own set of complications. I have not given up hope yet. Perhaps someone else can suggest a better method.

Firstly, why would someone want to do this? Well, I think there are a couple times where documents should be available in both formats

  1. Program documentation
  2. Short tutorials
  3. Curriculum Vitae

The last one is what I have been focusing on. Almost all academics have their CV online nowadays. Some have it in html format. Many have it in pdf format, and of course some only have it a terrible format like .doc or something. Personally, if I am going to view a CV online, I prefer it to be in html format. For very complex documents with lots of figures, graphics and such, pdf is usually better. A CV usually is not that complicated though, and it is much nicer to read them in html format. However, if one is applying for jobs (as I am), it is crucial to send a copy of your CV, and this should be either in hard copy or pdf format. Next I will describe the two different approaches I have taken to this conundrum.

starting with latex

As you may have gleaned, I like LaTeX quite a bit. It is quite simple in many ways, yet can also handle some really complicated documents. It produces really nice postscript and pdf files. As I write this, I am printing off a 3′ x 5′ poster that I made with LaTeX (I’ll make a separate post about making posters with LaTeX). LaTeX was designed well before the world wide web existed, so it was not designed with the web in mind. It was designed with print in mind. That being said, there are some packages and utilities that do a decent job of converting LaTeX to html. The two that I have found best so far are latex2html and tex4ht. They both have their disadvantages and advantages

latex2html
This program takes a direct latex to html approach. Essentially it tries to do all the same things standard LaTeX can do, but instead of producing a dvi, it produces html. The drawback from this approach is that it does not always handle all the packages that standard LaTeX does. It also produces some pretty ugly, out of date html. The latest html you can specify is 4.01. This is not very satisfactory if you try to adhere to web standards. It also uses some strange heading code. Using the article class, sections are coded with <h1> tags. Normally h1 is used as the title of a page, and only once, though see this article from A List Apart about why using h1 for the title text may be a bad idea
tex4ht
This program uses an entirely different approach. It acts more like other standard drivers, by converting dvi to html. The advantage of this is that it should work with any LaTeX package imaginable, as long as parseable dvi is still produced. This is nice, but the output is still too focused on print. It defines all sorts of extra classes, and specifies all sorts of font sizes, when this should really be done with CSS.

I recently read a very poorly written article about why to use LaTeX, which spurred me on to write my own, and as a proof of concept, I decided to write it in LaTeX, and convert it to html. I ended up using the tex4ht approach, and then wrote a perl script to clean up some of the code it produced. You can view the article on why to use LaTeX on my University of Michigan homepage

starting with html + css

It is possible to get pretty nice printed output using css today, but there are still some things missing. My wife has also been working on her résumé lately, and one thing I like about hers is that she had a header on each page with her name and info. This seems quite handy. Imagine especially if someone is printing off your CV from your webpage, and the pages get mixed up or something. All of a sudden they lost a page of your CV! That is no good. Using CSS 2.1, the current standard, one cannot specify such things. The same goes for page numbers. Most browsers will let you choose whether you want page numbers or not, but we don’t want to make more work for the user.

CSS 3 offers a glimpse of hope. CSS 3 offers many more possibilities, especially in terms of dealing with alternative media such as print or aural media. Unfortunately, it seems that CSS 3 support is still quite a ways off for most browsers.

Enter Prince.

Prince is a program that implements many features from CSS 3 and is designed to make quality pdf output from xml (or xhtml) documents. It can do some pretty neat stuff. I first learned about it from A List Apart, in an article about printing an entire book with xhtml and css. It has one major drawback though. It is not open-source. In fact it is quite expensive for a normal full license. It does offer a restricted license, which has full functionality but also sticks in a default page about Prince on the first page of your document. So this is not ideal. I have tried it out, and one can see the results of my experiment by looking at my CV
What you get in your browser is the normal stuff. If you click on the button at the bottom you get the nicely formatted version from prince. If you click on the print this page button, you get a very similar version using CSS 2 rules. It is nice too, but lacking the headers and page numbers.

The future

For my dissertation, I am definitely sticking to LaTeX, and will not attempt to convert it to html. I don’t think it is well suited for that. For most of my websites, I will stick to html. For my CV, and other documents for which I want the best of both worlds, I would like to go the LaTeX route, but I think that the tools available need some updating. Maybe I will work on that.