Instructions for How to Download the Raw Genomic Data

Dr. Raquel Fleskes
Jan 26, 2023
4 min read

The raw genomic data generated from the Anson Street Ancestors is now available for public download and use. The downloading process requires submitting an application to the Data Access Committee for approval and signing a Data Access Agreement protecting against commercial use of the data. You will also need to be familiar with the Python coding language in order to download the data from the server.

The type of data that is available for download are raw FASTA files. FASTA files contain the DNA sequences generated from the DNA sequencing machines. They will need to be trimmed to remove the adapters, as well as mapped to the human genome reference sequence. Afterwards, the mapped data will need to be filtered in order to make genotype calls (DNA markers) for analysis.

The raw genome data are currently hosted on the European Genome-Phenome Archive (EGA), which is a secure website that many other scientists use to store the generated data from their studies. This website was chosen because it offers an option for Controlled Data Access – which means that the data of the Anson Street Ancestors are more protected than in a fully publicly accessible database where anyone, for any reason, can download them.

The Controlled Data Access was set up to protect the genomic data of the Anson Street Ancestors from commercial use. This means that potential users are not allowed to sell the genomic data of the Ancestors or financially profit from them. This arrangement was set up based on the wishes of the community connected to the Anson Street Ancestors, as determined through community meetings and conversations in Charleston, South Carolina.

If you would like to use the data for any other reason other than commercial use, you will be able to download it. However, you will also be required to apply for access to the data and sign an agreement which states that you will not use the data for commercial purposes or financial profit. Your application will be reviewed by the Data Access Committee. When approved by the DAC you will be allowed to download the data. Please allow a few weeks for your application to be processed.

Should you be interested in downloading the genomic data of the Anson Street Ancestors, please follow these instructions based on the EGA website (https://ega-archive.org/access/data-access). If you have any questions, please contact the EGA Helpdesk (ega-helpdesk@ebi.ac.uk).

STEP 1:

Go to the EGA website (https://ega-archive.org), and search for our study in the search bar.

Study ID: EGAS00001006693

Study Name: Whole Genome Sequences of 18th century African descended individuals from Charleston, South Carolina

Study Description: This study presents Whole Genome Sequencing results from the Anson Street African Burial Ground Project, which is a community-based initiative aimed at understanding the histories of 37 Ancestors found during construction in Charleston, South Carolina. Here we report fastq files for all 37 Ancestors. DNA was extracted at the University of Tennessee-Knoxville following Dabney et al. 2013, and dual index libraries prepared using a modified NEBNext Ultra II kit with partial USER enzyme digestion. Libraries were then enriched for human genomic DNA (MyBaits) and sequenced on Illumina Platforms.

STEP 2:

Click on the datasets tab to download the data. The dataset ID for this project is EGAD00001009643.

STEP 3:

Identify the Data Access Committee (DAC) and make your application using the dataset page.

For this project, please send an email to Dr. Theodore Schurr at tgschurr@sas.upenn.edu You will receive a request to fill out an application. This application will require you provide your name and those of anyone else who will have access to the data and require a summary of the way in which you intend to use the data. If your application is approved by the Data Access Committee, then you will also be sent a Data Use Agreement Form. This forms specifies how the data can be used (protection against commercial use). If you agree to the terms, then sign the document. A co-signed version by the Data Access Committee will then be sent to you for your records.

STEP 4:

Access approved: Receive your EGA account Log-in details.

Once your application is approved, a link for a single sign-on for your EGA account is sent to your registered email address, for you to set your own password. Once your password has been authorized, you will receive an email notification to confirm that your EGA account has been activated and is ready to use.

STEP 5:

Once you have logged-in, a list of the datasets you have been granted access to will appear on 'My Datasets' page. For example:

For each dataset, you will be able to identify the samples that it contains and also download the metadata associated with the dataset. The metadata package contains mapping files for samples, experiments and files.

Click on the desired dataset that you wish to download.

STEP 6:

Download the data. The steps to download the data require the use of the pyEGA3 Download Client V3, which is a Python based download client. This client is used because the files can be very large and also because the data must be downloaded using a secure method.

The pyEGA3 client is compatible with any MacOS having Python 3.6+ installed. The client requires a connection to the internet, sufficient space on the destination drive, and the EGA download account credentials.

More information about the downloader can be found on the EGA website: https://ega-archive.org/download/downloader-quickguide-APIv3

Or on the github page: https://github.com/EGA-archive/ega-download-client

In addition, a video tutorial demonstrating the usage of pyEGA3 from installation to file download is available: https://embl-ebi.cloud.panopto.eu/Panopto/Pages/Viewer.aspx?id=be79bb93-1737-4f95-b80f-ab4300aa6f5a