Download TCGA Digital Pathology Images (FFPE)

Digital pathology image analysis requires high quality input images. While there are a large number of images available in The Cancer Genome Atlas (TCGA), the ones which are currently available in the data portal are frozen specimens and are *not* suitable for computational analysis. This post discusses how to download the Formalin-Fixed Paraffin-Embedded (FFPE) slides for corresponding patients.

First a brief introduction, the TCGA offers two types of slides, flash frozen and Formalin-Fixed Paraffin-Embedded (FFPE). Flash frozen samples are typically produced during surgery in a cryolab to help the surgeon determine if the borders of the tumor are clean( i.e., has the tumor been fully resected). Flash freezing is a fast and “easy” process, but frequently leaves the tissue damaged, giving it a swiss cheese type appearance:


FFPE slides are the gold standard for diagnostic medicine, and are generated by fixing a specimen in formaldehyde and then embedding it in a paraffin wax block for cutting.  It has a much nicer appearance, making it more amenable to computational analysis:


A more full discussion is available here and here.

The TCGA has both types of slides available, so care must be taken to obtain the correct cohort and *not* mix cohorts unless specifically part of your experimental design.

The difference can be found by looking at the particular filename, where files with “TS#” or “BS#”, where # is an integer, is a frozen slide, like this:


While files with “DX#”, again where # is an integer, is an FFPE slide:


To perform the download, we need two components, (1) the TCGA download tool, and (2) a manifest file which states using precise id numbers which files to download.

First we need to go to the TCGA data portal, located here:

Then we click on “Repository”:

2018-08-01 14_49_57-GDC

Then click on “slide image” under “Data type”

2018-08-01 14_50_21-Repository

Then “Diagnostic Slide” under “Experimental Strategy”

2018-08-01 14_50_47-Repository

This produces a list of slides, all of which have the “DX#” sting in their filename:

2018-08-01 14_51_43-Repository

We can limit to a specific organ group by clicking,  e.g., Cases, and then breast:

2018-08-01 14_52_27-Repository

Now we have the 1,133 files that we would like to download. We do this by clicking “add all files to cart” (or selecting the ones we are interested in):

2018-08-01 14_53_20-Repository

Lastly, we go to the cart and select download – > manifest:

2018-08-01 14_54_03-Cart


This provides us with a txt file that we can feed to the gdc-client:

gdc-client download -m gdc_manifest_20180801_125430.txt

Thats it!

21 thoughts on “Download TCGA Digital Pathology Images (FFPE)”

    1. either need to make yourself or find a published paper which has used them and ask them for whatever annotations you’re interested in

  1. Is there any formal document from GDC mentioned that files with “TS#” or “BS#” are frozen slides, and files with “DX#” are FFPE slides? I find some files with “TSA” or “TSB”, and don`t know what they mean, so I am really confused.

    1. i dont know of any, if you find one please let me know : ) the TS and BS stand for “top slide” and “bottom slide” and are used during surgery to ensure that resection has clean boundaries. since the patient is still on the operating table, these are always flash frozen. “diagnostic” slides by definition are FFPE. this can be seen when looking at the data portal under “experimental” strategy, there are two options “tissue slide” (frozen) and “diagnostic slide” (ffpe). not sure if there will be a formal document explicitly saying this since its fairly routine practice to my knowledge

  2. Hi Andrew, many thanks for this. I am interested in playing around with DL methods and have couple of questions. How do you convert svs files to tiff (or any other format)? Which image format do you prefer to work with? How do you tile the images (and how many tiles do you create)? Thanks

    1. thank you for your questions. depending on the user case, no conversion may be necessary as its possible to load particular regions of the image directly using either openslide or matlab (examples of that are on this blog and ultimately, if the experiment is going to be repeated often only on particular regions of interest, i do prefer extracting those regions of interest as high quality png/tif files so that they’re easier to access. unfortunately, how to tile and how many to tile are very dependant on the use case and the amount of data available so there is no real hard and fast rule. in general, enough so that the DL can learn, but not too many that it takes ages to train for little added value : )

  3. Beside, ‘BS’ and ‘TS’, some slides are ‘MS’. I suppose those are ‘Middle Section’ frozen samples. Is that correct?

  4. Hi Andrew,

    I was wondering if you had already tried to correlate finding of pathology and radiology on the image data from TCGA ? Whether for pathology the images are well annotated and using the ID you can link the finding of the diagnostic slides to the associated clinical data, on the radiology we are having some troubles. Very often there are multiple visits for a patient (hence multiple series) and for each visit dates are randomized so we do not know which is the diagnostic visit.

    If you have tried to link the pathology to the radiology and can give us some hints that would be greatly appreciated.

    1. Thanks for your question. I think its an interesting avenue to pursue, but unfortunately have not done so myself and thus don’t have any advice to give you :-\

      1. Yeah indeed it is very interesting but also challenging for a lot reasons. We will keep digging on the online data from TCGA …


  5. Hi,

    I’d like to download Immunohistochemistry (IHC) images from TCGA. It doesn’t seem to be trivial to search the database. I’ve been trying for days now, but no luck. Any help would be much appreciated!


Leave a Reply

Your email address will not be published. Required fields are marked *