There was an interesting problem this week again which I didn't immediately have a solution for. I was facing a multi-gigabyte csv file which had some broken lines:
ID|foo|bar|baz|freeinputfield
1|awef|hbsrt|25.2|1235~6343~2345
2|awef|hbsrt|25.2|856~12546~9867
3|awef|hbsrt|25.2|7136~1111~9672
4|anad|sthjnk|13|7777~23
523~364
5|cbvx|srtbd|99.9|12345~12346~11111Of course someone put newlines into the freeinputfield and noone cleaned that up so far... How do you fix that while on a tight time budget and suffering from data gravity?
Vi(m) to the rescue!
Something everyone getting to know vi learns first is regular search and replace but that didn't really suffice here... To be honest I didn't want to do a 2-pass approach and join row 5 with 6 and replace the additional newline in row 7. I ended up using multi-repeats of vim:
:g!/\(\d\+\)\@!/-1j
:v/\(\d\+\)\@!/-1j- From command mode (that's what you are in when vim gets started) entering a colon puts you in ex mode
- g signals global search mode
- g! represents a negated global search mode. This is interchangeable with v
- Then you can put your regular expression between slashes
- followed by an ex command, here '-1j' means 'join the current line with the previous line'
After I finished my task I thought about whether you can do this more efficient by using perl, awk or sed?