Carrying out world class genetics is both easier and harder than ever before. While the tools exist to produce a gold mine of information, getting access to datasets is a complex legal process, and harmonising that data to compare across studies only adds further obstacles.
A new resource developed by scientists at the University of Tartu, though, aims to resolve these challenges. Crafted together with partners at the European Bioinformatics Institute in the UK, developers hope the resource can sustain research and reshape the field of genetics.
Scientists recently described the tool, called the eQTL Catalogue, in a paper in Nature Genetics.
‘A real issue’
Officially, the eQTL Catalogue, is a compendium that provides access to standardised quantitative trait loci (QTLs), culled from more than two dozen public studies in human subjects. QTLs are variants in the human genome that have been found to be associated with a measurable phenotype, such as trait or disease. Scientists often discover QTLs by running genome-wide association studies, where the genomes of many subjects are interrogated using technologies such as genotyping arrays or next-generation sequencing.
By comparing genomic data with phenotypic data, they arrive at variants of interest, which are often just coordinates in the genome that are adjacent to genes or so-called splicing events, a sort of perturbation in the genomic code. While a vast amount of genomic data has been generated in such studies and many variants of interest reported, accessing and harmonising the data to compare between studies has been a challenge, developers argue.
“There has been a real issue in the field because nobody could reuse these published datasets,” notes Kaur Alasoo, the primary driver behind the eQTL Catalogue and one of the two corresponding authors on the new paper. “You could reuse the data but you always had to apply for access,” he says. “In this field, there are a lot of small datasets, so if you want to use all of them in a large study you have to spend two or three years applying for access.”
Alasoo is currently an assistant professor at the University of Tartu but previously was working at the Wellcome Sanger Institute at its campus in Hinxton, a small English village south of Cambridge. It was through his work at the Sanger that Alasoo took part in the Open Targets Consortium, a public-private partnership between global pharmaceuticals companies Bristol Myers Squibb, GSK, and Sanofi, and the EBI and Wellcome Sanger Institute, a nonprofit British genomics and genetics research institute also based at Hinxton. The eQTL Catalogue was then proposed and funded via Open Targets, and the project commenced in 2018. According to Alasoo, it took two years before the first version went live in 2020.
Since then, the feedback from users has been positive. Users are mainly other scientists who are interpreting findings from their own studies. Their work turns up coordinates in the genome and variants of interest and whereas previously they might have had to conclude data access agreements with the owners of various datasets of interest, obtain the data, and harmonize it so they could compare from dataset to dataset, they can now merely look up the same variants in the eQTL Catalogue, bypassing all of that work.
“The main use case is people who have done a genome-wide association study and then want to look up what these variants do in a particular cell type or tissue and how they affect genes,” says Alasoo.
A painstaking effort
It’s an achievement that took some tedious work to accomplish. As noted in the paper, technical variation exists between studies, ranging from the protocols and technology used to produce the data, to the informatics tools that analysed it. These variations might influence how scientists currently compute the significance or effect size of certain variants of interest or how these variants impact gene expression. To overcome such issues, the researchers uniformly reprocessed data from over a hundred datasets from more than two dozen studies and found that effect sizes were, in general, reproducible across the studies and not overly influenced by technical variation. Their summary statistics and fine mapping of these variants are now available via the Catalogue.
According to Alasoo, the real challenge in constructing the eQTL Catalogue was not the data processing itself, which was time-consuming but “completely doable” in his words, but rather securing data access agreements. “We have to propose what we want to do with the data, and see if it matches the consent obtained from the original research participants,” explains Alasoo. “Most have their own procedures and there is a lot of bureaucratic work in finding out what the requirements are and filling out forms,” he says. Once access is granted, though, the processing of data sets is rapid.
Here, Alasoo credited applying for access from the University of Tartu as a benefit, as larger institutions tend to attempt to renegotiate data access agreements, whereas Tartu, in part because of its smaller size, was more flexible in reaching agreements quickly with owners of datasets.
Currently, datasets from 29 studies have been obtained, processed, and entered into the eQTL Catalogue and it has spurred a round of new studies, most of which have yet to be published. “This database is being used earlier in projects so that many projects that are using it haven’t been published yet,” says Alasoo. He notes that scientists working within the FinnGen Project, an effort to obtain genetic data from about half a million Finns by the year 2023, are using the eQTL Catalogue extensively to interpret their findings.
Alasoo is using it as well. “My own interest is using this resource to better understand the fundamental mechanisms of gene regulation,” he says. He confirmed that he will continue to be involved in updating the Catalogue as more datasets are obtained, processed, and harmonised.
freelance journalist and writer