robfelty.com


UNIX tip of the day —
duplicate and replace lines with awk

Posted in linguistics, UNIX

Quarter note = 01182019 robfelty
Treble clef 4/4 Time
Today I got some data I wanted to add to my machine learning training datasets for named entity recognition. My system is designed to be used with output from automatic speech recognition (ASR). It is frequently difficult to be certain whether ASR output will contain hyphens or not, e.g. (email, vs e-mail) so frequently I include both variants to be robust. I was able to automatically add these variants with a quick awk oneliner awk ‘/-/ {print; gsub(“-“, ” “)} […] (Read more)

UNIX tip of the day – grep -P is slow

Posted in perl, regex, UNIX

Quarter note = 09282016 robfelty
Treble clef 4/4 Time
Unless you really need some advanced regular expressions only supported by PCRE, using POSIX regular expressions with grep is usually an order of magnitude faster – that’s because the default engine with grep uses finite automata, as opposed to a backtracking algorithm which PCRE uses ( the main featuress you gain from the backtracking algorithm are lookahead/lookbehind and backreferences) Here’s a small example $ time grep -E 'post:content.*facebook' a_bunch_of_files* | wc -l 1643 real 0m2.643s user 0m1.304s sys 0m1.306s $ […] (Read more)

UNIX tip – xargs with multiple commands

Posted in UNIX

Quarter note = 04012015 robfelty
Treble clef 4/4 Time
Xargs is an extremely powerful complement to the awesome find command. One downside is that you usually need to have a single pipeline. By default you can’t put together a bunch of commands which are not piped. However, it is possible to call a shell with xargs. In this way, you can execute multiple commands in this shell, but from xargs point of view, it is calling a single command – the shell interpreter. More details here: bash – xargs […] (Read more)

Using awk to sum rows of numbers

Posted in bash, linux, UNIX

Quarter note = 11142013 robfelty
Treble clef 4/4 Time
I have a script which takes a tab-delmited file for regression tests, and converts it xml. I want to do a sanity check, to make sure that the number of utterances in my xml files matches the number in the tab-delimited.txt file. I can do this in 2 lines in UNIX robert_felty$ wc -l samples2.txt 72148 samples2.txt robert_felty$ find . -name '*.xml' | xargs grep -c " (Read more)