Large CSV Manipulation tricks

I sometimes have to deal with large csv files that need some massaging and they are too big to rely on spreadsheet program to handle.  Here are some tips that I sometimes use in the terminal on ubuntu.

To remove duplicate rows:

awk '!seen[$0]++' duped.csv > DEduped.csv

To remove rows with certain strings in the row and make a backup file:

sed -i.bak '/string-to-trigger-removal/d' ./file-to-clean.csv
// or
awk '!/string-to-trigger-removal/' file-to-clean.csv > cleaned-file.csv

To sort csv rows by a column:

// -t represents comma separated.
// -n represents numeric sort. Leave it out if you want alphanumeric.
// -k3 represents key /column 3.
sort -t, -nk3 file-to-sort.csv > sorted-file.csv

section: