Tips of Reading CSV Data

Many times CSV data is not in good format. It may contains some garbage that we don’t want to import into our table.

So here are some tips of reading CSV file:

  1. Read the CSV file in text editor to have idea of how the data looks like
  2. Skip Blank lines
  3. Specify whether it includes header or not
  4. Define variable type on each column to make sure it gets correct one


GDP <- read.csv(“GDP.csv”, header = FALSE, na.strings = “”, skip = 5,
blank.lines.skip = TRUE, nrows = 190,
colClasses = c(“character”, “integer”, “NULL”, “character”,
“character”, rep(“NULL”, 5)))

Wide and Long Format

There two kind of format of storing data in a tables. We call it (1) wide and (2) long.

Wide format has a column for each variable. While long only has a column that has information of the name of the variable.

Here is the example of wide format

#   ozone   wind  temp
# 1 23.62 11.623 65.55
# 2 29.44 10.267 79.10
# 3 59.12  8.942 83.90
# 4 59.96  8.794 83.97

And below is the long format

#    variable  value
# 1     ozone 23.615
# 2     ozone 29.444
# 3     ozone 59.115
# 4     ozone 59.962
# 5      wind 11.623
# 6      wind 10.267
# 7      wind  8.942
# 8      wind  8.794
# 9      temp 65.548
# 10     temp 79.100
# 11     temp 83.903
# 12     temp 83.968

People find it most easier to record data in wide format. However, many functions in R was designed for long format.

Then how to transform wide into long and vice-versa?

There are two functions in reshape2 library that will do the job.

reshape2 is based around two key functions: melt and cast:

melt takes wide-format data and melts it into long-format data.

dcast takes long-format data and casts it into wide-format data.

Have fun !

Some Useful Link:

(1) Converting Data between Wide and Long

Analyzing Facebook with R

I was very excited when I found a website mentioning the possibility to access and analyze Facebook using R.  There is an R package called RFacebook to access Facebook for this purpose. Details steps can be here.

The summary of the steps are:

  1. create a new app go to
  2. Run several R commands below
fb_oauth <- fbOAuth(app_id=”XXXXXX”, app_secret=”YYYYYY”,extended_permissions = FALSE)
me <- getUsers(“me”,token=fb_oauth)
group <- getGroup(group_id = ZZZZZZZ, token=fb_oauth)

Unfortunately, FB puts more access to there are not much things can be gathered anymore. For example, I’m unable to access closed group anymore. Only open group can be accessed.

Data Extraction

This entry is to record how to extract data

Extract column from data.frame

A <- head(iris[1])
> B <- head(iris[[1]])

1 5.1
2 4.9
3 4.7
4 4.6
5 5.0
6 5.4
> B
[1] 5.1 4.9 4.7 4.6 5.0 5.4

‘data.frame’: 6 obs. of 1 variable:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4
> str(B)
num [1:6] 5.1 4.9 4.7 4.6 5 5.4

A extract it to data.frame

B extract the column into numeric

String Operation

This entry is to record any tips related to string operation

Make a fixed string with leading zeros

command: formatC(1,width=3,flag=”0″)
output:        [1] “001”

As seen above, two leading zeros were added in front of “1”

Data Observasion (1)

Today I was observing Titanic data. Before started to use the Data, I’m trying to see how the data looks like first.

Here are my questions and answers for today:

Q: How to find what entries in a column? For example, I want to know how many variations of entry in Sex column

A: use table command


> table(train$Sex)

female male

Q: How to find a column that is empty?

A: Use a subset command


PassengerId Survived Pclass Name Sex Age
62 62 1 1 Icard, Miss. Amelie female 38
830 830 1 1 Stone, Mrs. George Nelson (Martha Evelyn) female 62

SibSp Parch Ticket Fare Cabin Embarked
62 0 0 113572 80 B28
830 0 0 113572 80 B28

That’s all for today.

How can I become a data scientist?

Found a good post on

I copied and pasted here so that I can easily find it in the future.


——- start quote ——-

Strictly speaking, there is no such thing as “data science” (see What is data science? ). See also: Vardi, Science has only two legs:…

Here are some resources I’ve collected about working with data, I hope you find them useful  (note: I’m an undergrad student, this is not an expert opinion in any way).

1) Learn about matrix factorizations

  • Take the Computational Linear Algebra course (it is sometimes called Applied Linear Algebra or Matrix Computations or Numerical Analysis or Matrix Analysis and it can be either CS or Applied Math course). Matrix  decomposition algorithms are fundamental to many data mining applications and are usually underrepresented in a standard “machine learning” curriculum. With TBs of data traditional tools such as Matlab become not suitable for the job, you cannot just run eig() on Big Data. Distributed matrix computation packages such as those included in Apache Mahout [1] are trying to fill this void but you need to understand how the numeric algorithms/LAPACK/BLAS routines [2][3][4][5] work in order to use them properly, adjust for special cases, build your own and scale them up to terabytes of data on a cluster of commodity machines.[6] Usually numerics courses are built upon undergraduate algebra and calculus so you should be good with prerequisites.  I’d recommend these resources for self study/reference material:
  • See Jack Dongarra : Courses and What are some good resources for learning about numerical analysis?

2) Learn about distributed computing

3) Learn about statistical analysis

  • I’ve found that learning statistics in a particular domain (e.g. Natural Language Processing) is much more enjoyable than taking Stats 101. My personal recommendation is the course by Michael Collins at Columbia (also available on Coursera).
  • You can also choose a field where the use of quantitative statistics and causality principles [7]  is inevitable, say molecular biology [8], or a fun sub-field such as cancer research [9], or even narrower domain, e.g. genetic analysis of tumor angiogenesis [10] and try answering important questions in that particular field, learning what you need in the process.

4) Learn about optimization

5) Learn about machine learning

6) Learn about information retrieval

7) Learn about signal detection and estimation

8) Master algorithms and data structures

9) Practice

If you do decide to go for a Masters degree:

10) Study Engineering

I’d go for CS with a focus on either IR or Machine Learning or a combination of both and take some systems courses along the way. As a “data scientist” you will have to write a ton of code and probably develop distributed algorithms/systems to process massive amounts of data. MS in Statistics will teach you how to do modeling and regression analysis etc, not how to build systems, I think the latter is more urgently needed these days as the old tools become obsolete with the avalanche of data. There is a shortage of engineers who can build a data mining system from the ground up. You can pick up statistics from books and experiments with R (see item 3 above) or take some statistics classes as a part of your CS studies.

Good luck.

[7] Causality: Models, Reasoning and Inference (9780521895606): Judea Pearl: Books
[8] Introduction to Biology , MIT 7.012 video lectures
[9] Hanahan & Weinberg, The Hallmarks of Cancer, Next Generation: Page on Wisc
[10] The chaotic organization of tumor-associated vasculature, from The Biology of Cancer: Robert A. Weinberg: 9780815342205: Books, p. 562