R Script prep for GOplot from Trinity

2018-11-20 2 min read R

This post contains an adapted R script based on prep_n_run_GOplot.pl from Trinity, the denovo transcriptome assembler, for the times R cannot read in the produced EC.* files.

prep_n_run_GOplot.pl is used to produce GOplot visualizing differential expression, sorted by GO terms (see http://wencke.github.io/). It uses the DE_analysis.DE_subset and DE.subset.GOseq.enriched files to produce two other files, EC.david. and EC.genelist used in GOplot.

In most cases, this script words properly. However, sometimes when the text annotation contains special characters such as , EC.david and EC.genelist do not read properly into R when prep_n_run_GOplot.pl calles GOplot.Rscript.

Based on the perl script, I adapted a R version, prep_for_GOplot.R to produce EC.david and EC.genelist shown below.

This is a command-line version which will require the argparse package in order to read in arguments from the command line.

# to make EC. genelist, has columns ID, logFC and adj.P.val

# DE.subset equivalents are the row name, logFC and FDR

suppressPackageStartupMessages(library(argparse))

parser <- ArgumentParser()
parser$add_argument("-s", "--DE_subset", help = "the subset file from the two-way comparison for both", required=T)
parser$add_argument("-e", "--DE_go", help = "the enriched go file from the same two-way comparison", required=T)
parser$add_argument("--comparison_prefix", help = "the string to add to output files, should be the names of the two-way comparison", required=T)
# parser$print_help()

args <- parser$parse_args()

subset.genelist<-read.table(args$DE_subset, sep = "\t", header=T, quote = "")

subset.genelist$gene<-rownames(subset.genelist)

make.ec.genelist<-data.frame("ID" = subset.genelist$gene, "logFC" = subset.genelist$logFC, "adj.P.val" = subset.genelist$FDR)

write.csv(make.ec.genelist, paste0(args$comparison_prefix,".EC.genelist"),row.names = F)

# to make EC.david, has columns Category, ID, Term, GEnes, adj_pval

# DE.subset.Goseq.enriched equivalents are , , , , over_represented_FDR
subset.david<-read.table(args$DE_go, sep = "\t", header=T, quote = "")

## get all go-terms without any information

missing_go <- subset.david$category[is.na(subset.david$term)]
print(paste("go terms being skipped due to no information", as.character(missing_go)))

subset.david<-subset.david[!is.na(subset.david$term),]

make.ec.david<-data.frame("Category" = subset.david$ontology, "ID" = subset.david$category, "Term" =subset.david$term, "Genes" = subset.david$gene_ids, "adj_pval" = subset.david$over_represented_FDR)

write.csv(make.ec.david, paste0(args$comparison_prefix, ".EC.david"), row.names=F)

Run the script as Rscript prep_for_GOplot.R --help to see the help page. The script takes in three arguments, DE_analysis.DE_subset and DE.subset.GOseq.enriched , and a prefix for the file names.

Afte running this script, you will have to run GOplot manually rather than through prep_n_run_GOplot.pl. There is example code on the package website and Trinity also has the script GOplot.Rscript in the directory $TRINITY/Analysis/DifferentialExpression.

In summary, I used the argparse package to adapt parts of a perl script by porting it into R instead. The script can be run on the command line just like the other Trinity scripts.

R trinity rna

R Script prep for GOplot from Trinity

Related