Digital pathology image analysis requires high quality input images. While there are a large number of images available in The Cancer Genome Atlas (TCGA), the ones which are currently available in the data portal are frozen specimens and are *not* suitable for computational analysis. This post discusses how to download the Formalin-Fixed Paraffin-Embedded (FFPE) slides for corresponding patients.
First a brief introduction, the TCGA offers two types of slides, flash frozen and Formalin-Fixed Paraffin-Embedded (FFPE). Flash frozen samples are typically produced during surgery in a cryolab to help the surgeon determine if the borders of the tumor are clean( i.e., has the tumor been fully resected). Flash freezing is a fast and “easy” process, but frequently leaves the tissue damaged, giving it a swiss cheese type appearance:
FFPE slides are the gold standard for diagnostic medicine, and are generated by fixing a specimen in formaldehyde and then embedding it in a paraffin wax block for cutting. It has a much nicer appearance, making it more amenable to computational analysis:
The TCGA has both types of slides available, so care must be taken to obtain the correct cohort and *not* mix cohorts unless specifically part of your experimental design.
The difference can be found by looking at the particular filename, where files with “TS#” or “BS#”, where # is an integer, is a frozen slide, like this:
While files with “DX#”, again where # is an integer, is an FFPE slide:
To perform the download, we need two components, (1) the TCGA download tool, and (2) a manifest file which states using precise id numbers which files to download.
First we need to go to the TCGA data portal, located here: https://portal.gdc.cancer.gov
Then we click on “Repository”:
Then click on “slide image” under “Data type”
Then “Diagnostic Slide” under “Experimental Strategy”
This produces a list of slides, all of which have the “DX#” sting in their filename:
We can limit to a specific organ group by clicking, e.g., Cases, and then breast:
Now we have the 1,133 files that we would like to download. We do this by clicking “add all files to cart” (or selecting the ones we are interested in):
Lastly, we go to the cart and select download – > manifest:
This provides us with a txt file that we can feed to the gdc-client:
gdc-client download -m gdc_manifest_20180801_125430.txt