Tuesday, May 15, 2018

JSON file I/O and GDC metadata download

What is JSON?

JSON is short for javascript object notation. It is a short format for the XML, record sample information in a shorter/simpler format, but the structure is similar to the XML.

Why JSON?

The reason why I come to this is because the metadata downloaded from the GDC website is JSON format. I was trying to compare the samples downloaded from different data categories.

How to download data from GDC?

1. We can directly by selecting the samples and add them into the cart and download it to the local computer. However, this is bit slow and if your destination is not local computer, you will spend extra time to transfer.
2. Use the gdc-client to download data set using the manifest file (gdc-client download). In this case, as long as manifest file provided, we can easily download using the terminal and directly download to the server.
*For RadHat server user, special instruction will be needed to install it. 
*for controlled data, authorization token needed for gdc-client

Here is the format of the manifest file:

id filename md5 size state
a8f1ba67-bcef-45d1-b508-d4ce580d4362 TCGA-GV-A3JV-10B-01D-A221_120830_SN590_0178_AD143RACXX_s_2_rg.sorted.bam f6ee0cf24f786edb34743e1a68e0b4de 19824636256 live
c85e79ff-7813-4679-9c88-de2565ff4afa G32450.TCGA-FD-A3N5-10A-01D-A21A-08.4.bam afd2efa5c9634bd02cbf9bb4a68c5314 160495436671 live

We can see, for each file, five information is used to download the specified file. But how can we map the samples download from different data category?

How to match sample from different data category?

1. select samples in the GDC API
2. add samples into cart
3. go to the cart and download the metadata
4. Extract metadata with useful information, such as TCGA sample ID, tumor project, tumor type, filename, id (which can be used to match to the manifest file), and etc.

Here I wrote a small python code (read_json.py) which can be used to extract those information.

How to read json file using python?

json file can be read into python using: import json.
The differences between json.load and json.loads are different. The latter one is reading a "string" while the former one is reading a file.

You can download the read_json.py file above and use it to extract the information from it. You can easily use it to extract the information needed. Example:
python read_json.py --meta_file path/to/file.json  --output_file  path/to/output.txt --log_file path/to/log.txt

With this code, following information will be extracted:

file_id;file_name;md5sum;entity_submitter_id;disease_type;primary_site;project;sample_type

file_id: can be used to match the id in manifest file.
file_name: can be used to match the filename in manifest file.
md5sum: can be used to match the md5 in manifest file.
entity_submitter_id: the TCGA sample barcode, such as : TCGA-HT-7602-01A-21D-2087-05
disease_type: cancer type
primary_site: cancer site
project: such as TCGA-LGG
sample_type: solid primary tumor or normal, etc.




JSON file I/O and GDC metadata download

What is JSON? JSON is short for javascript object notation. It is a short format for the XML, record sample information in a shorter/simpl...