robfelty.com


UNIX tip of the day – grep -P is slow

Posted in perl, regex, UNIX

Quarter note = 09282016 robfelty
Treble clef 4/4 Time
Unless you really need some advanced regular expressions only supported by PCRE, using POSIX regular expressions with grep is usually an order of magnitude faster – that’s because the default engine with grep uses finite automata, as opposed to a backtracking algorithm which PCRE uses ( the main featuress you gain from the backtracking algorithm are lookahead/lookbehind and backreferences) Here’s a small example $ time grep -E 'post:content.*facebook' a_bunch_of_files* | wc -l 1643 real 0m2.643s user 0m1.304s sys 0m1.306s $ […] (Read more)

Unicode block names in regular expressions

Posted in bash, java, perl, python, regex

Quarter note = 11032014 robfelty
Treble clef 4/4 Time
Frequently, I find myself wanting to do some simple language detection. For Chinese, Japanese, and Korean, this can easily be done by looking at the types of characters in some text. The simplest and most robust way to do this is to use Unicode block names. It is very simple to write a regular expression which will test if a character is contained in a certain block. For all the different possible blocks, see here: Unicode block names for use […] (Read more)

Java anchored regex

Posted in java, regex

Quarter note = 04032014 robfelty
Treble clef 4/4 Time
I just discovered this today when doing some regex in Java. When I first started doing regex in Java, I was surprised to learn that Java seems to treat all regular expressions as anchored. That is, if you have a string foobar and search for “foo” it will not match. This is different than grep, perl, and other tools. In other words, for Java, the following regexes are equivalent: "foo" "^foo$" If you want to find foo within foobar you […] (Read more)