Creative Commons License
This blog by Tommy Tang is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

My github papge

Monday, November 10, 2014

csvkit to manipulate csv at command line, Rio to interact with R at command line

I came across csvkit long time ago, but I just began to use it and found it is very handy. It is a python utility to manipulate csv files at command line. you can install it by pip:

pip install csvkit

I played around with the iris data.
csvlook gives you a database view of the data by adding grids.
$ cat iris.csv | head | csvlook
|---------------+-------------+--------------+-------------+--------------|
|  sepal_length | sepal_width | petal_length | petal_width | species      |
|---------------+-------------+--------------+-------------+--------------|
|  5.1          | 3.5         | 1.4          | 0.2         | Iris-setosa  |
|  4.9          | 3.0         | 1.4          | 0.2         | Iris-setosa  |
|  4.7          | 3.2         | 1.3          | 0.2         | Iris-setosa  |
|  4.6          | 3.1         | 1.5          | 0.2         | Iris-setosa  |
|  5.0          | 3.6         | 1.4          | 0.2         | Iris-setosa  |
|  5.4          | 3.9         | 1.7          | 0.4         | Iris-setosa  |
|  4.6          | 3.4         | 1.4          | 0.3         | Iris-setosa  |
|  5.0          | 3.4         | 1.5          | 0.2         | Iris-setosa  |
|  4.4          | 2.9         | 1.4          | 0.2         | Iris-setosa  |
|---------------+-------------+--------------+-------------+--------------|

csvstat can output summary statistics for specific columns with -c flag
-c can accept both the column names and the column numbers as arguments

$ cat iris.csv | csvstat -c sepal_length,sepal_width
  1. sepal_length
<type 'float'>
Nulls: False
Min: 4.3
Max: 7.9
Sum: 876.5
Mean: 5.84333333333
Median: 5.8
Standard Deviation: 0.825301291785
Unique values: 35
5 most frequent values:
5.0: 10
6.3: 9
5.1: 9
6.7: 8
5.7: 8
  2. sepal_width
<type 'float'>
Nulls: False
Min: 2.0
Max: 4.4
Sum: 458.6
Mean: 3.05733333333
Median: 3.0
Standard Deviation: 0.434410967735
Unique values: 23
5 most frequent values:
3.0: 26
2.8: 14
3.2: 13
3.4: 12
3.1: 11

Row count: 150

you can also output only the max, min, mean by adding --max, --min etc
$ cat iris.csv | csvstat -c sepal_length,sepal_width --max
  1. sepal_length: 7.9
  2. sepal_width: 4.4

csvcut can cut out specific columns and reorder the column while the unix cut command can
not reorder the column, one needs to use awk to achieve that.

$ cat iris.csv | csvcut -c5,4,2 | csvlook | head
|------------------+-------------+--------------|
|  species         | petal_width | sepal_width  |
|------------------+-------------+--------------|
|  Iris-setosa     | 0.2         | 3.5          |
|  Iris-setosa     | 0.2         | 3.0          |
|  Iris-setosa     | 0.2         | 3.2          |
|  Iris-setosa     | 0.2         | 3.1          |
|  Iris-setosa     | 0.2         | 3.6          |
|  Iris-setosa     | 0.4         | 3.9          |
|  Iris-setosa     | 0.3         | 3.4          |

csvgrep can grep based on a specific column:
$ cat iris.csv | csvgrep -c species -m setosa | wc -l
51

other commands like csvsort, csvjoin, csvsql are also useful, please refer to the link above to see more examples.

I want to plot some nice figure at command line, but I am too lazy to fire Rstudio. by using Rio, you can leverage the power of R at the command line. you can then visualize it by using the display command from imagemagic
-g to load the ggplot2 package 
-e to execute the command

update on 11/19/2014, Rio has changed the flag to load dplry and tidyr
https://github.com/jeroenjanssens/data-science-at-the-command-line/commit/66fc1fda604f2d420de9490934259093a76fb105

boxplot of sepal_length for different species :



histogram for sepal_length for all species:


 you can change the color and fill of the bars:


you can also plot histogram for different species with a facet:


scatter plot of sepal_width for different species:


command line is awesome! I love it! It brings me more power!





3 comments:

  1. replica watches uk, combining elegant style and cutting-edge technology, a variety of styles of replica hublot watches, the pointer walks between your exclusive taste style.

    ReplyDelete
  2. Thank you for these csvkit using statistics. I really liked the table in which you presented the results. It would be nice if you share the secrets of creating such a laconic table with so much data. The company in which I work is also related to the big data and this knowledge would help me a lot in order to form beautiful reports. At the moment, after searching a little, I found a couple of guides for creating similar tables on Instagram, but only similar tables are created in them. And the authors of this videos using the services of to quickly increase the number of followers.

    ReplyDelete
  3. Thank you for these csvkit using statistics. I really liked the table in which you presented the results. It would be nice if you share the secrets of creating such a laconic table with so much data. The company in which I work is also related to the big data and this knowledge would help me a lot in order to form beautiful reports. At the moment, after searching a little, I found a couple of guides for creating similar tables on Instagram, but only similar tables are created in them. And the authors of this videos using the services of viplikes.net to quickly increase the number of followers.

    ReplyDelete