Machine Intelligence Cracks Genetic Controls


caption.

Every recipe has both instructions and ingredients. So does the human genome. An error in the instructions can raise the risk for disease.



Every cell in your body reads the same genome, the DNA-encoded instruction set that builds proteins. But your cells couldn’t be more different. Neurons send electrical messages, liver cells break down chemicals, muscle cells move the body. How do cells employ the same basic set of genetic instructions to carry out their own specialized tasks? The answer lies in a complex, multilayered system that controls how proteins are made.


Most genetic research to date has focused on just 1 percent of the genome—the areas that code for proteins. But new research, published Dec. 18 in Science, provides an initial map for the sections of the genome that orchestrate this protein-building process. “It’s one thing to have the book—the big question is how you read the book,” said Brendan Frey, a computational biologist at the University of Toronto who led the new research.

Frey compares the genome to a recipe that a baker might use. All recipes include a list of ingredients—flour, eggs and butter, say—along with instructions for what to do with those ingredients. Inside a cell, the ingredients are the parts of the genome that code for proteins; surrounding them are the genome’s instructions for how to combine those ingredients.


Just as flour, eggs and butter can be transformed into hundreds of different baked goods, genetic components can be assembled into many different configurations. This process is called alternative splicing, and it’s how cells create such variety out of a single genetic code. Frey and his colleagues used a sophisticated form of machine learning to identify mutations in this instruction set and to predict what effects those mutations have.


code-graphic

Olena Shmahalo/Quanta Magazine



The researchers have already identified possible risk genes for autism and are working on a system to predict whether mutations in cancer-linked genes are harmful. “I hope this paper will have a big impact on the field of human genetics by providing a tool that geneticists can use to identify variants of interest,” said Chris Burge, a computational biologist at the Massachusetts Institute of Technology who was not involved in the study.


But the real significance of the research may come from the new tools it provides for exploring vast sections of DNA that have been very difficult to interpret until now. Many human genetics studies have sequenced only the small part of the genome that produces proteins. “This makes an argument that the sequence of the whole genome is important too,” said Tom Cooper, a biologist at Baylor College of Medicine in Houston, Texas.


Reading the Recipe


The splicing code is just one part of the noncoding genome, the area that does not produce proteins. But it’s a very important one. Approximately 90 percent of genes undergo alternative splicing, and scientists estimate that variations in the splicing code make up anywhere between 10 and 50 percent of all disease-linked mutations. “When you have mutations in the regulatory code, things can go very wrong,” Frey said.


“People have historically focused on mutations in the protein-coding regions, to some degree because they have a much better handle on what these mutations do,” said Mark Gerstein, a bioinformatician at Yale University, who was not involved in the study. “As we gain a better understanding of [the DNA sequences] outside of the protein-coding regions, we’ll get a better sense of how important they are in terms of disease.”


Scientists have made some headway into understanding how the cell chooses a particular protein configuration, but much of the code that governs this process has remained an enigma. Frey’s team was able to decipher some of these regulatory regions in a paper published in 2010, identifying a rough code within the mouse genome that regulates splicing. Over the past four years, the quality of genetics data—particularly human data—has improved dramatically, and machine-learning techniques have become much more sophisticated, enabling Frey and his collaborators to predict how splicing is affected by specific mutations at many sites across the human genome. “Genome-wide data sets are finally able to enable predictions like this,” said Manolis Kellis, a computational biologist at MIT who was not involved in the study.



No comments:

Post a Comment