Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Monday, November 10, 2014

csvkit to manipulate csv at command line, Rio to interact with R at command line

I came across csvkit long time ago, but I just began to use it and found it is very handy. It is a python utility to manipulate csv files at command line. you can install it by pip:

pip install csvkit

I played around with the iris data.
csvlook gives you a database view of the data by adding grids.
$ cat iris.csv | head | csvlook
|---------------+-------------+--------------+-------------+--------------|
|  sepal_length | sepal_width | petal_length | petal_width | species      |
|---------------+-------------+--------------+-------------+--------------|
|  5.1          | 3.5         | 1.4          | 0.2         | Iris-setosa  |
|  4.9          | 3.0         | 1.4          | 0.2         | Iris-setosa  |
|  4.7          | 3.2         | 1.3          | 0.2         | Iris-setosa  |
|  4.6          | 3.1         | 1.5          | 0.2         | Iris-setosa  |
|  5.0          | 3.6         | 1.4          | 0.2         | Iris-setosa  |
|  5.4          | 3.9         | 1.7          | 0.4         | Iris-setosa  |
|  4.6          | 3.4         | 1.4          | 0.3         | Iris-setosa  |
|  5.0          | 3.4         | 1.5          | 0.2         | Iris-setosa  |
|  4.4          | 2.9         | 1.4          | 0.2         | Iris-setosa  |
|---------------+-------------+--------------+-------------+--------------|

csvstat can output summary statistics for specific columns with -c flag
-c can accept both the column names and the column numbers as arguments

$ cat iris.csv | csvstat -c sepal_length,sepal_width
  1. sepal_length
<type 'float'>
Nulls: False
Min: 4.3
Max: 7.9
Sum: 876.5
Mean: 5.84333333333
Median: 5.8
Standard Deviation: 0.825301291785
Unique values: 35
5 most frequent values:
5.0: 10
6.3: 9
5.1: 9
6.7: 8
5.7: 8
  2. sepal_width
<type 'float'>
Nulls: False
Min: 2.0
Max: 4.4
Sum: 458.6
Mean: 3.05733333333
Median: 3.0
Standard Deviation: 0.434410967735
Unique values: 23
5 most frequent values:
3.0: 26
2.8: 14
3.2: 13
3.4: 12
3.1: 11

Row count: 150

you can also output only the max, min, mean by adding --max, --min etc
$ cat iris.csv | csvstat -c sepal_length,sepal_width --max
  1. sepal_length: 7.9
  2. sepal_width: 4.4

csvcut can cut out specific columns and reorder the column while the unix cut command can
not reorder the column, one needs to use awk to achieve that.

$ cat iris.csv | csvcut -c5,4,2 | csvlook | head
|------------------+-------------+--------------|
|  species         | petal_width | sepal_width  |
|------------------+-------------+--------------|
|  Iris-setosa     | 0.2         | 3.5          |
|  Iris-setosa     | 0.2         | 3.0          |
|  Iris-setosa     | 0.2         | 3.2          |
|  Iris-setosa     | 0.2         | 3.1          |
|  Iris-setosa     | 0.2         | 3.6          |
|  Iris-setosa     | 0.4         | 3.9          |
|  Iris-setosa     | 0.3         | 3.4          |

csvgrep can grep based on a specific column:
$ cat iris.csv | csvgrep -c species -m setosa | wc -l
51

other commands like csvsort, csvjoin, csvsql are also useful, please refer to the link above to see more examples.

I want to plot some nice figure at command line, but I am too lazy to fire Rstudio. by using Rio, you can leverage the power of R at the command line. you can then visualize it by using the display command from imagemagic
-g to load the ggplot2 package 
-e to execute the command

update on 11/19/2014, Rio has changed the flag to load dplry and tidyr
https://github.com/jeroenjanssens/data-science-at-the-command-line/commit/66fc1fda604f2d420de9490934259093a76fb105

boxplot of sepal_length for different species :



histogram for sepal_length for all species:


 you can change the color and fill of the bars:


you can also plot histogram for different species with a facet:


scatter plot of sepal_width for different species:


command line is awesome! I love it! It brings me more power!