robfelty.com


UNIX tip of the day – grep -P is slow

Posted in perl, regex, UNIX

Quarter note = 09282016 robfelty
Treble clef 4/4 Time
Unless you really need some advanced regular expressions only supported by PCRE, using POSIX regular expressions with grep is usually an order of magnitude faster – that’s because the default engine with grep uses finite automata, as opposed to a backtracking algorithm which PCRE uses ( the main featuress you gain from the backtracking algorithm are lookahead/lookbehind and backreferences) Here’s a small example $ time grep -E 'post:content.*facebook' a_bunch_of_files* | wc -l 1643 real 0m2.643s user 0m1.304s sys 0m1.306s $ […] (Read more)

Unicode block names in regular expressions

Posted in bash, java, perl, python, regex

Quarter note = 11032014 robfelty
Treble clef 4/4 Time
Frequently, I find myself wanting to do some simple language detection. For Chinese, Japanese, and Korean, this can easily be done by looking at the types of characters in some text. The simplest and most robust way to do this is to use Unicode block names. It is very simple to write a regular expression which will test if a character is contained in a certain block. For all the different possible blocks, see here: Unicode block names for use […] (Read more)

Why doesn’t Mac update standard UNIX utilities?

Posted in linguistics, linux, mac osx, perl

Quarter note = 09152008 robfelty
Treble clef 4/4 Time
I am currently teaching a course on programming for linguists. We are using python, but for the first few classes, I have been going over some standard UNIX utilities like cd, ls and such, plus using regular expressions with grep and sed. I actually don’t use sed that much. I tend to reach for perl, since I know it better, and it can do pretty much all the same stuff that sed can plus much more. But sed is simpler […] (Read more)

Perl slurping

Posted in perl

Quarter note = 09032008 robfelty
Treble clef 4/4 Time
It seems like whenever I go to slurp in a whole file into a string in Perl, I have to search around to remember the exact syntax. So I decided to put it here for myself, so I won’t have to search any further than my own site. In this particular instance, I am trying to remove any <span> blocks from a file. I can simply do the following: perl -e '$string = do {local( $/ ); }; $string=~s/.*?//gs; print […] (Read more)

100 yootles bounty for solution to nested loop rounding error

Posted in linguistics, perl

Quarter note = 11052007 robfelty
Treble clef 4/4 Time
I am working on doing some monte carlo simulations. I want to do a particular manipulation n times, but I want to constrain what I do based on three parameters, x, y, and z, which are probability distributions coded as arrays. For example, if I want to run this simulation 1000 times, then 24 should be xayaza, 24 xayazb, 72 xaybzc and so on. My code seems to work right when everything is integers, but not when some of the […] (Read more)