A-  A  A+ RSS Feed

Deep Thoughts by Robert Felty

thoughts on wordpress, latex, cooking et alia

Archive for the 'linguistics' Category

Thursday, September 24th, 2009

Vim regex-fu for LaTeX

When writing a beamer presentation with LaTeX, I organize my presentation into sections and subsections. Frequently, the title of the first frame (slide) in a subsection has the same name as the subsection. Let’s say I start off with the following structure:

\section[corpora]{Accessing text corpora}
\subsection[gutenberg]{The Gutenberg Corpus}
\subsection[chat]{The web and chat Corpus}
\subsection[brown]{The Brown Corpus}
\subsection[reuters]{The Reuters Corpus}
\subsection[inaugural]{The Inaugural address Corpus}
\subsection[annotated]{Annotated corpora}
\subsection[foreign]{Corpora in other languages}
\subsection[DIY]{Loading your own corpora}

For each subsection, I want to put in one frame, with the name of the subsection being the name of the frame. Regular expressions to the rescue! In vim, all I have to is use V to select each line with subsection, then I hit :, which allows me to operate on those lines only.

'<,'>

is automatically inserted after the colon, which stands for “from the beginning of the highlighted section to the end of it”. Then I use s to perform my substitution. \r inserts a new line.

:'<,'>s/{\(.*\)}/{\1}\r\\begin{frame}\r\\frametitle<presentation>{\1}\r\\end{frame}/

The result is:

\section[corpora]{Accessing text corpora}
\begin{frame}
\frametitle<presentation>{Accessing text corpora}
\end{frame}
\subsection[gutenberg]{The Gutenberg Corpus}
\begin{frame}
\frametitle<presentation>{The Gutenberg Corpus}
\end{frame}
\subsection[chat]{The web and chat Corpus}
\begin{frame}
\frametitle<presentation>{The web and chat Corpus}
\end{frame}
\subsection[brown]{The Brown Corpus}
\begin{frame}
\frametitle<presentation>{The Brown Corpus}
\end{frame}
\subsection[reuters]{The Reuters Corpus}
\begin{frame}
\frametitle<presentation>{The Reuters Corpus}
\end{frame}
\subsection[inaugural]{The Inaugural address Corpus}
\begin{frame}
\frametitle<presentation>{The Inaugural address Corpus}
\end{frame}
\subsection[annotated]{Annotated corpora}
\begin{frame}
\frametitle<presentation>{Annotated corpora}
\end{frame}
\subsection[foreign]{Corpora in other languages}
\begin{frame}
\frametitle<presentation>{Corpora in other languages}
\end{frame}
\subsection[DIY]{Loading your own corpora}
\begin{frame}
\frametitle<presentation>{Loading your own corpora}
\end{frame}
Monday, September 15th, 2008

Why doesn’t Mac update standard UNIX utilities?

I am currently teaching a course on programming for linguists. We are using python, but for the first few classes, I have been going over some standard UNIX utilities like cd, ls and such, plus using regular expressions with grep and sed. I actually don’t use sed that much. I tend to reach for perl, since I know it better, and it can do pretty much all the same stuff that sed can plus much more. But sed is simpler than perl, and I basically just wanted to use it for doing substitutions.

Today I got an e-mail from a student asking why the following did not seem to be working:

echo abcd123 | sed 's/\([a-z]*\).*/\U\1/'

The student reported the following output: “Uabcd”. (The expected output is “ABCD”, which is what I get on Linux)

I tried it, and it worked fine for me. Then I thought: maybe this is a Mac/Linux problem. Sure enough, when I look at the man page for my Fedora 7 box, it tells me that my version of sed is GNU 4.1.5, from June 2006. Mac Leopard (10.5) is using BSD sed from July 2004. Leopard came out in 2007, as did Fedora 7. Why is it 2 years behind? Why is it still using python 2.4? Why doesn’t it come with useful utilities like dos2unix? Mac has done a great job of making a nice GUI, with some pretty cool applications like iLife. It is falling behind when it comes to the command line utilities though.

Tuesday, July 15th, 2008

Bash one-liners to the rescue

I recently find myself using handy bash one-liners more all the time. I think that this is where unix/linux can really start to shine. There are so many little programs that just do one thing, and one thing well. But the ability to combine these together through pipes means you have extremely flexible and powerful tools at the ready.

I have been working on a new project at work to come up with some lists for testing speech recognition. We decided to use the TIMIT database, which contains recordings of many different sentences from many different speakers all around America. I first wrote a perl script to generate some basic stats on the sentences, like how many words were in each sentence, and what the word frequency for those words is. Then I wrote a perl script to randomly select some of the sentences, and create several different lists of sentences. Finally, I wrote an R script which took the original .wav files, and mixed in signal-dependent noise in one channel, so that we can vary the signal to noise ratio during presentation of the stimuli by adjusting the balance on our sound system.

Along the way, I ran into a couple problems with the original sound files. It turns out that 446 of the 6300 sound files were clipped, and highly distorted. I noticed this on my own in listening to a few of the files I had generated with R. I could have gone through all 6300 files manually, and removed the distorted ones, but that would have taken a long time. Instead, I used the program sox, which is a low-level, powerful audio processing program. I first used the find command to find all .wav files in the directory I was interested in (including sub-directories), then I passed each file to sox, and told sox not to play the output , but instead just give me some stats (-n stat). After some testing with a few clipped, and non-clipped files, I realized that for clipped files, the output from sox ended with a line that said either “Try: blah blah”, or “Can’t determine type”. I then later discovered that there might still be clipped files, and these would have a maximum amplitude of 1 or minimum or -1. So I knew that any clipped file would produce this output. So I passed the results from sox to grep (notice I had to redirect STDERR to STDOUT 2>&1), and then if the output contained a line starting with “Try:” or “Can’t”, then I moved that file the $file.clipped.

for file in `find . -name "*.wav" -print`; do
  if [[ `sox $file -n stat 2>&1 | grep -E "^(Try:|Can't|(Minimum|Maximum) amplitude:\s+-?1\.00)"` ]]; then
    echo "$file CLIPPED";
    mv $file $file.clipped;
  fi;
done

After doing this, I simply amended my perl script which randomly generated lists to make sure that the wav file actually existed. Clipped files now ended in .clipped, instead of .wav.

There was an additional problem I had previously discovered with these sound files. They seemed to have some non-standard headers in them, which meant that the R script I was using to add noise to them couldn’t read the files. However, passing the files through sox made the files readable by R. (Windows Media Player on a Windows box couldn’t read the files either.) I only wanted to process the files I was actually going to add noise to, so I used another handy little bash one-liner. This one cuts a column of the file which contains all the sentences I am going to use, and then for each filename, processes the file through sox, and outputs it to the destination directory of my choosing.

for file in `cut -f 18 -d $'\t' timitLists2.txt`;
  do sox $file ~/R/work/timit/clean/`basename $file`;
done

Note that I have expanded the code into one more line, but pretty much they are one-liners. I think technically a one-liner doesn’t involve successive commands, which the first example does, but the first command is just an echo, to make sure I know what it is doing.

Thursday, April 3rd, 2008

Working on making fancy graphs with R / fixed versus random babble

I have been working on learning R for several months now, and continue to get better at it and enjoy it more all the time. I am currently working on a spoken word recognition project at work. The task we are using is quite simple. Participants listen to words that have been mixed with multi-talker babble (kind of like background conversation at a cocktail party), and type in what they hear. We are analyzing the errors they make to try to discover what sorts of words are activated in the brain, and how people organize the words they know in their brains.

We started running subjects in the summer of 2007, and after running 48 subjects, we made a few methodological changes. One change is that we switched from using the same segment of babble as background noise to using a random segment of babble. We also re-leveled the sound files after adding the babble so that all the words were presented at about the same volume. I recently analyzed the data from these two groups. It turns out the the results from the first 48 subjects performed significantly better than subjects 49-96. However, I realized that these are different subjects, so it could be due to the different subjects, and not the change in methods. Therefore I split each group into 2 subgroups. Now I had 4 groups — 1-24, 24-48, 49-72, and 73-96. The subgroups did not differ statistically from one another, but the main groups did. This seemed to indicate that listeners were becoming habituated to the noise, and could therefore tune it out a bit better, and thus hear the words better.

I originally displayed this in a table, but it was pretty cumbersome, so I decided to make a bar chart instead. I have seen some people produce bar charts such as these where they indicate which groups are statistically different using brackets to group bars, and an asterisk to mark them as statistically different. I also wanted to add error bars in. Adam Krawitz (another post-doc at IU) gave me some code to get started on the error bars, and I found some code from a post to the R-help list about making brackets. First here is the figure:
R plot using the Brack function with error bars and statistics displayed
And here is the code:

# make a nice bar plot
# first we find the mean of each group
bardata <- c(mean(items1to24$pc,na.rm=T),
  mean(items25to48$pc,na.rm=T),
  mean(items49to72$pc,na.rm=T),
  mean(items73to96$pc,na.rm=T)
)

#Then we do t-tests to calculate the 95% confidence intervals for our error bars
items1to24T=t.test(items1to24$pc)
items25to48T=t.test(items25to48$pc)
items49to72T=t.test(items49to72$pc)
items73to96T=t.test(items73to96$pc)
lowerConf=c(items1to24T$conf.int[1],
  items25to48T$conf.int[1],
  items49to72T$conf.int[1],
  items73to96T$conf.int[1]
)
upperConf=c(items1to24T$conf.int[2],
  items25to48T$conf.int[2],
  items49to72T$conf.int[2],
  items73to96T$conf.int[2]
)

# we set up some nice colors
barcolors=c('#FF3333','#FF3333','#4444FF','#4444FF')
bar <- barplot(bardata, ylim = c(0,1), col=barcolors, names.arg=c("1-24","25-48","49-72", "73-96"),ylab='proportion correct', xlab='subject group',xlim=c(0,1), width=.24,tck=.04)
#we use the arrows function to do the error bars
arrows(bar, upperConf, bar, lowerConf,
       length = 0.05, # width of the arrowhead
       angle = 90,
       code = 3
       )


# function to draw curly braces in red
# x1...y2 are the ends of the brace
# for upside down braces, x1 > x2 and y1 > y2
# taken from R help
# http://finzi.psych.upenn.edu/R/Rhelp02a/archive/112010.html
library(grid) #requires the grid library
Brack <- function(x1,y1,x2,y2,h)
{
x2 <- x2-x1; y2 <- y2-y1
v1 <- viewport(x=x1,y=y1,width=sqrt(x2^2+y2^2),
               height=h,angle=180*atan2(y2,x2)/pi,
               just=c("left","bottom"),gp=gpar(col="black"))
pushViewport(v1)
grid.curve(x2=0,y2=0,x1=.125,y1=.5,curvature=.5)
grid.move.to(.125,.5)
grid.line.to(.375,.5)
grid.curve(x1=.375,y1=.5,x2=.5,y2=1,curvature=.5)
grid.curve(x2=1,y2=0,x1=.875,y1=.5,curvature=-.5)
grid.move.to(.875,.5)
grid.line.to(.625,.5)
grid.curve(x2=.625,y2=.5,x1=.5,y1=1,curvature=.5)
popViewport()
}

# Now we use the brack function
# brackY1 and brack Y2 determine the y-value for the placement of the brackets.
# I had to fudge this a bit to get it to look right
brackY1=1.05*max(upperConf[1:2])
brackY2=1.21*max(upperConf[3:4])
Brack(bar[1]+.07,brackY1,bar[2]-.01,brackY1,.07)
Brack(bar[3]-.09,brackY2,bar[4]-.17,brackY2,.11)

# And now we also make some straight line groupings
arrows(mean(bar[1:2]),1.4*upperConf[1],mean(bar[3:4]), 1.4*upperConf[1],length=0)
arrows(mean(bar[1:2]),1.4*upperConf[1],mean(bar[1:2]), 1.35*upperConf[1],length=0)
arrows(mean(bar[3:4]),1.4*upperConf[1],mean(bar[3:4]), 1.35*upperConf[1],length=0)
arrows(mean(bar[2:3]),1.4*upperConf[1],mean(bar[2:3]), 1.45*upperConf[1],length=0)
# we label the groupings
text(mean(bar[1:2]),1.25*upperConf[1],"fixed babble",cex=.9,col=barcolors[1])
text(mean(bar[3:4]),1.25*upperConf[1],"random babble",cex=.9,col=barcolors[3])

# we add the stats in
text(mean(bar[2:3]),1.55*upperConf[1],paste("t =", formatC(itemBetweenT$statistic,digits=3), ", p", prettyPval(itemBetweenT$p.value)),cex=.9)

# and we add the values on the bars as well
text(bar[1],.7*bardata[1],formatC(bardata[1],digits=3),col='white')
text(bar[2],.7*bardata[2],formatC(bardata[2],digits=3),col='white')
text(bar[3],.7*bardata[3],formatC(bardata[3],digits=3),col='white')
text(bar[4],.7*bardata[4],formatC(bardata[4],digits=3),col='white')

# and then I print it to a pdf file
dev.print(pdf,"fig/48v96.pdf",height=2.8,width=4,pointsize=11)

While these results were suggestive of a difference, it is also possible that the difference was due to the re-leveling, and not the fixed vs. random babble. If the listeners were really habituating to the fixed babble, this should show up as an improvement over the course of the experiment. Therefore I also compared the learning rates of these two groups. It is normal for participants to get better as they get more familiar with a task, so we will always expect some learning. However if the participants are becoming habituated to a particular aspect of the stimuli, (in this case the fact that the same segment of babble is being used over and over again), then we would expect more learning in this group. This figure displays the percent correct over a moving 30 trial window. That is, the first point represents trials 1-30, the second point 2-31 and so on. It looks like the fixed-babble group is learning more. To test this statistically, I subtracted the fixed babble values from the random babble values, and performed a correlation on these values against the trial window. If the learning rates are the same, then there should be no correlation. If the fixed babble group is learning more rapidly however, we should see the difference between the two groups increase over the course of the experiment, which is what we found. Here is the nice figure:
R figure with multiple y axes
And here is the code:

#compute the percent correct for each window
numPoints=min(length(unique(Trials1$Trial)),length(unique(Trials1$Trial)))-trialWindow
  pc1=c(1:numPoints)
  pc2=c(1:numPoints)
  xvalues=c(1:numPoints)
  for (i in 0:numPoints) {
    upperTrials1=subset(Trials1,Trials1$Trial>i)
    theseTrials1=subset(upperTrials1,upperTrials1$Trial<=i+trialWindow)
    corr1 = (nrow(subset(theseTrials1, theseTrials1$correct=="CORRECT")))
    wrong1 = (nrow(subset(theseTrials1,theseTrials1$correct=="WRONG")))
    missing1 = (nrow(subset(theseTrials1,theseTrials1$correct=="MISSING")))
    nonword1 = (nrow(subset(theseTrials1,theseTrials1$correct=="NONWORD")))
    pc1[i] = corr1/(corr1+wrong1+nonword1+missing1)

    upperTrials2=subset(Trials2,Trials2$Trial>i)
    theseTrials2=subset(upperTrials2,upperTrials2$Trial<=i+trialWindow)
    corr2 = (nrow(subset(theseTrials2, theseTrials2$correct=="CORRECT")))
    wrong2 = (nrow(subset(theseTrials2,theseTrials2$correct=="WRONG")))
    missing2 = (nrow(subset(theseTrials2,theseTrials2$correct=="MISSING")))
    nonword2 = (nrow(subset(theseTrials2,theseTrials2$correct=="NONWORD")))
    pc2[i] = corr2/(corr2+wrong2+nonword2+missing2)

    xvalues[i]=i;
  }
# compute the differences between the two groups
diffs=pc1-pc2

#set up some graphical paramets
    par(mar=c(3,3,0.3,3),
        fin=c(4.1,2.9),
        fig=c(0,1,0,1),
        mgp=c(2,1,0),
        #usr=c(min(xdata),max(xdata),0,1),
        lab=c(10,10,3),
        #ps=11,
        family='serif'
       )
# plot the first group
plot(xvalues,pc1,col='#EE0000',pch="+",ylim=c(.4,.65),axes=F,xlab='trial window', ylab='proportion correct',cex=.8)
# add the second group
points(xvalues,pc2,col='#0000EE',pch="x",ylim=c(.4,.65),cex=.8)

# how much to add to the fitline y values to make it fit on the plot
fudge=.42
# plot the slope of the line we fitted to the differences
fitline=lsfit(xvalues,diffs+fudge)
abline(fitline,col='#000F00')
# do a correlation test
slope=cor.test(xvalues,diffs);
# the stats to put on the graph
stats=paste('r= ', formatC(slope$estimate,digits=3), ", p ", prettyPval(slope$p.value),sep='')

# put the values on the axes
axis(1) #bottom
axis2values=c(.4,.45,.5,.55,.6,.65)
axis(2,at=axis2values) #left
axis(4,at=axis2values,lab=formatC(axis2values-fudge,digits=2)) #right
box() #draw a box around the figure

# put the stats on the graph
text(280,.53,stats,cex=.75)

#label the right axis
mtext("difference between groups", side=4,line=2)

# label the points
text(80,.64,"fixed babble",col='#EE0000', cex=.8)
text(275,.44,"random babble",col='#0000EE', cex=.8)
# print as pdf
dev.print(pdf,"fig/learning.pdf",height=2.8,width=4,pointsize=11)
Monday, November 5th, 2007

100 yootles bounty for solution to nested loop rounding error

I am working on doing some monte carlo simulations. I want to do a particular manipulation n times, but I want to constrain what I do based on three parameters, x, y, and z, which are probability distributions coded as arrays. For example, if I want to run this simulation 1000 times, then 24 should be xayaza, 24 xayazb, 72 xaybzc and so on. My code seems to work right when everything is integers, but not when some of the numbers are non-integers reals (always positive). I have tried a couple different strategies of rounding, ceiling, and floor, but it seems to always be off by a few. The correct solution should output $total = $trials;

I will offer 100 yootles to the first person who finds a solution for me. (I am including the perl code I have been playing around with, but my problem lies in the algorithm, not in the syntax.)

#!/usr/bin/perl -w
use strict;
use POSIX qw(floor ceil);
my $trials=795;
$trials = shift;
my @x = (.4,.4,.2);
my @y = (.3,.5,.2);
my @z = (.2,.2,.6);

my $trial =0;
my $total =1;
my $a=0;
while ($a<scalar @x && $trial <= round($trials*$x[$a])) {
  my $b=0;
  while ($b< scalar @y && $trial <= round($trials*$x[$a]*$y[$b])) {
    my $c=0;
    while ($c< scalar @z && $trial <= round($trials*$x[$a]*$y[$b]*$z[$c])) {
      $total++;
      $trial++;
      if ($trial >= round($trials*$x[$a]*$y[$b]*$z[$c])) {
        $c++;
        $trial=0;
      }
    }
    $b++;
    $trial=0;
  }
  $a++;
  $trial=0;
}
print "trials = $trials, total = $total\n";

sub round {
    my($number) = shift;
    return int($number + .5 * ($number <=> 0));
}

Update

My friend Danny Reeves, along with some help from David Yang solved my problem in a completely different way. Here is Danny’s solution:

#!/usr/bin/perl
# Rob's monstronsity that is surely the solution to the wrong problem.
# But for 100 yootles, we'll just do as we're told.
# This would be much nicer in a properly functional-style language!

my $trials = 795;

my @x = (.4,.4,.2);
my @y = (.3,.5,.2);
my @z = (.2,.2,.6);

@tuples = cross(\@x,\@y,\@z);

# compute idealized tuple counts:
@counts = map { $trials*prod(@$_) } @tuples;

@ic = map { int($_) } @counts;  # floors of counts.
@fc = map { $_-int($_) } @counts;  # fractional parts.
$f = sum(@fc);  # sum of fractional parts, to redistribute.

# redistribute...
for($i=1; $i<=$f; $i++) {
   $ic[posmax(deltas(\@counts,\@ic))]++;
}

$total = 0;
for($i=0; $i<scalar(@ic); $i++) {
   ($x, $y, $z) = @{$tuples[$i]};
   for($j=0; $j<$ic[$i]; $j++) {
     print "do something with ($x, $y, $z)\n";
     $total++;
   }
}

print "trials = $trials, total = $total\n";


# Return a cross product from its arguments. Arguments are array refs.
# Result is a list of array refs.  [found this on the web; damn slick]
# (note that this returns the tuples in not quite canonical order)
sub cross {
   my @r = [];
   @r = map {my $s = $_; map {[@$_ => $s]} @r} @$_ for @_;
   @r
}

# Return sum of arguments.  Ie, reduce(+, args, 0).
sub sum { my $x = 0;  for(@_) { $x += $_; }  $x }

# Return product of arguments.  Ie, reduce(*, args, 1).
sub prod { my $x = 1;  for(@_) { $x *= $_; }  $x }

# Return a list of differences between 2 lists, passed as refs.
# (assumes the lists have the same length)
sub deltas {
   my @ans;
   for(my $i=0; $i<scalar(@{$_[0]}); $i++) {
     push(@ans, $_[0]->[$i] - $_[1]->[$i]);
   }
   @ans
}

# Takes list, return the position of the largest element.
sub posmax {
   if (scalar(@_)==0) { return -1; }
   my $x = 0;  # index of best so far.
   for(my $i=0; $i<scalar(@_); $i++) {
     if($_[$i] > $_[$x]) { $x = $i; }
   }
   $x
}

And because Danny is a huge fan of Mathematica, he also included a Mathematica version

trials = 795;
x = {.4, .4, .2};
y = {.3, .5, .2};
z = {.2, .2, .6};

(* index of the largest element; i'm sure there's an adorable one-liner
    for this if i thought hard enough *)
posmax[{}] = -1;
posmax[l_] := Module[{i, x = 1}, (* x is index of best so far *)
   For[i = 1, i < Length[l], i++, If[l[[i]] > l[[x]], x = i]];
   x]

tuples = Tuples[{x, y, z}];
counts = (trials*Times @@ # &) /@ tuples;
ic = IntegerPart /@ counts;
fc = FractionalPart /@ counts;
f = Total[fc];
For[i=1, i<=f, i++, ic[[posmax[counts - ic]]]++];
total = 0;
MapThread[Do[{"do something with ", #1}; total++, {#2}]&, {tuples,ic}];
total