Large CSV Manipulation tricks
I sometimes have to deal with large csv files that need some massaging and they are too big to rely on spreadsheet program to handle. Here are some tips that I sometimes use in the terminal on ubuntu.
To remove duplicate rows:
awk '!seen[$0]++' duped.csv > DEduped.csv
To remove rows with certain strings in the row and make a backup file:
sed -i.bak '/string-to-trigger-removal/d' ./file-to-clean.csv
// or
awk '!/string-to-trigger-removal/' file-to-clean.csv > cleaned-file.csv
To sort csv rows by a column:
// -t represents comma separated.
// -n represents numeric sort. Leave it out if you want alphanumeric.
// -k3 represents key /column 3.
sort -t, -nk3 file-to-sort.csv > sorted-file.csv