Diving into Genetics and Genomics: when NAs are not NAs

Monday, May 22, 2017

when NAs are not NAs

Using TCGAbiolinks, I downloaded RNAseq data for LUAD and LUSC

library(TCGAbiolinks)
library(SummarizedExperiment)

# query_rna_LUAD.hg38 <- GDCquery(project = "TCGA-LUAD", data.category = "Transcriptome Profiling",
#                   data.type = "Gene Expression Quantification", 
#                   workflow.type = "HTSeq - Counts")
# 
# 
# query_rna_LUSC.hg38 <- GDCquery(project = "TCGA-LUSC", data.category = "Transcriptome Profiling",
#                   data.type = "Gene Expression Quantification", 
#                   workflow.type = "HTSeq - Counts")
# 
# GDCdownload(query_rna_LUAD.hg38, method = "client")
# GDCdownload(query_rna_LUSC.hg38, method = "client")
# 
# LUAD_rna_data <- GDCprepare(query_rna_LUAD.hg38)
# LUSC_rna_data <- GDCprepare(query_rna_LUSC.hg38)

# I have saved both R objects into disk
load("~/projects/mix_histology/data/TCGA_rna/TCGA_lung_rna.rda")
# a RangedSummarizedExperiment object
LUSC_rna_data

## class: RangedSummarizedExperiment 
## dim: 57035 551 
## metadata(0):
## assays(1): HTSeq - Counts
## rownames(57035): ENSG00000000003 ENSG00000000005 ...
##   ENSG00000281912 ENSG00000281920
## rowData names(3): ensembl_gene_id external_gene_name
##   original_ensembl_gene_id
## colnames(551): TCGA-77-8009-01A-11R-2187-07
##   TCGA-34-5239-01A-21R-1820-07 ... TCGA-NK-A7XE-01A-12R-A405-07
##   TCGA-43-6773-11A-01R-1949-07
## colData names(69): patient barcode ...
##   subtype_Homozygous.Deletions subtype_Expression.Subtype

The problem is with the meta data

LUSC_coldata<- colData(LUSC_rna_data)
LUAD_coldata<- colData(LUAD_rna_data)

we will see the different representations of NAs

table(LUAD_coldata$subtype_Smoking.Status, useNA = "ifany")

## 
##      Current reformed smoker for > 15 years 
##                                          78 
## Current reformed smoker for < or = 15 years 
##                                          77 
##                              Current smoker 
##                                          47 
##                         Lifelong Non-smoker 
##                                          32 
##                             [Not Available] 
##                                          11 
##                                        <NA> 
##                                         349

table(LUSC_coldata$subtype_Smoking.Status, useNA = "ifany")

## 
##      Current reformed smoker for > 15 years 
##                                          51 
## Current reformed smoker for < or = 15 years 
##                                          87 
##                              Current smoker 
##                                          28 
##                         Lifelong Non-smoker 
##                                           7 
##                                         N/A 
##                                           6 
##                                        <NA> 
##                                         372

we see NAs are represented either as real <NA>, [Not Avaiable] or N/A. The first thing to do is to tidy the metadata changing all NAs to <NA>:

I will use stringr from the tidyverse packages.

library(stringr)
# one needs to \\ to escape [ and ]
str_replace(LUAD_coldata$subtype_Smoking.Status, "\\[Not Available\\]", NA)

## Error: `replacement` must be a character vector

This does not quite work. stringr enforces that the replacement is the same type as a character. instead:

str_replace(LUAD_coldata$subtype_Smoking.Status, "\\[Not Available\\]", NA_character_) %>% table(useNA = "ifany")

## .
## Current reformed smoker for < or = 15 years 
##                                          77 
##      Current reformed smoker for > 15 years 
##                                          78 
##                              Current smoker 
##                                          47 
##                         Lifelong Non-smoker 
##                                          32 
##                                        <NA> 
##                                         360

similarily:

str_replace(LUSC_coldata$subtype_Smoking.Status, "N/A", NA_character_) %>% table(useNA = "ifany")

## .
## Current reformed smoker for < or = 15 years 
##                                          87 
##      Current reformed smoker for > 15 years 
##                                          51 
##                              Current smoker 
##                                          28 
##                         Lifelong Non-smoker 
##                                           7 
##                                        <NA> 
##                                         378

to test if a vector contains any NAs:

# this is invalid
# LUAD_coldata$subtype_Smoking.Status == NA, instead, use is.na()
is.na(LUAD_coldata$subtype_Smoking.Status) %>% table()

## .
## FALSE  TRUE 
##   245   349

The traditional way to change a string to NA is:

myvector<- c("NA", "NA", "a", "b", "c", NA)
myvector

## [1] "NA" "NA" "a"  "b"  "c"  NA

myvector[myvector =="NA"]<- NA
myvector

## [1] NA  NA  "a" "b" "c" NA

Conclusions

Data are messy, even if the data are packaged from a big project such as TCGA.
Different representation of NA could be a devil in your data, tidy it to R’s representation is needed.
stringr is a great tool in your belt for tidying data.

1 comment:

UnknownMay 22, 2017 at 8:31 PM
Currently I use this package to download TCGA data too. It's very convenient.
ReplyDelete
Replies

Add comment

Diving into Genetics and Genomics

My github papge

Monday, May 22, 2017

when NAs are not NAs

Downloading data

we will see the different representations of NAs

Conclusions

1 comment:

Labels

My Blog List