Computers Search for Understanding in Bioscience
by David Pescovitz
Printer-friendly
version
Marti Hearst is also a member of the University of California, San Francisco's Graduate Program in Biological and Medical Informatics.
|
A key part of any scientific research is knowing what's already known. To stand on the shoulders of giants, researchers browse, search, and synthesize journal articles written by their colleagues. For bioscientists, this means frequent forays into MEDLINE, the US National Library of Medicine's bibliographic database of journal references dating back forty years. Currently, MEDLINE houses 13 million citations with half a million more added each year. How does one find the needles in this ever-growing data haystack? UC Berkeley professors Marti Hearst and Adam Arkin are developing smart search technology to help bioscientists find what they're seeking and, perhaps someday, valuable information they didn't even know they were looking for.
For example, a bioscientist might want to locate articles relevant to her theory that an enzyme dramatically increases the concentration of a certain chemical in the body. According to Hearst, it's simply not possible to ask those kinds of questions of MEDLINE and receive accurate answers using existing technology. Traditional searches can locate keywords, but finding non-obvious relationships between words and phrases in the texts such as "increases concentration of" or "binds to" is a very humanlike talent that computers are far from mastering. This is especially difficult because different authors may use different phrases to describe the same relationship.
Adam Arkin is also a faculty scientist at Lawrence Berkeley National Laboratory's Physical Biosciences Division.
|
"The search algorithms we're developing as part of the BioText project are based on natural language processing," says Hearst, a professor in the School of Information Management and Systems who also holds an affiliate position with the Department of Computer Science. "So instead of focusing on artificial intelligence, we're leveraging the intelligence of the user by revealing what's in the underlying data so they can make more sophisticated queries in intuitive ways."
As BioText scours the articles, the software uses statistical techniques to automatically identify and label words and relations between entities found in the text. Combining those labels with existing databases and pre-existing categories of terms will intelligently narrow a user's search results. For instance, a search of MEDLINE might extract a relationship between a certain disease and the mention of a treatment. The tricky part for a computer is defining the nature of that relationship.
"Our software tries to determine if the described treatment is a cure for the disease, ineffective for the disease, causes a side effect, or can be used to prevent the disease," Hearst says.
With millions of references to search and thousands of new journal papers published each week, BioText could be a boon for scientists. Indeed, BioText, a project of the Berkeley-based Center for Information Technology Research in the Interest of Society (CITRIS), is supported by a large grant from the National Science Foundation. However, developing algorithms that can penetrate the particular grammar of bioscience texts is no simple feat.
"For me, it's like learning a different language," Hearst explains. "Bioscience texts use fewer verbs and longer strings of nouns than everyday language, so we have to develop special text analysis techniques to find the relationships in those long noun phrases."
The researchers are continuing to hone their algorithms to analyze the esoteric language of bioscience. Meanwhile, they've begun to develop a user-friendly search interface, a "front-end" for the natural language processing that can interface with standard databases.
Someday, Hearst adds, the research could have applications in "text mining," discovering previously unknown nuggets of wisdom by extracting information from a variety of written resources. Using advanced natural language processing to highlight indirect links in medical data, scientists might suss out the cause of a rare disease or the undiscovered interactions between proteins encoded in the human genome. By training a computer to read more than an individual researcher could ever hope to, machines might someday answer questions scientists haven't yet asked.
"Human language was never meant for a computer to process," Hearst says. "It's exciting to try and break that code."
Marti Hearst's home page
BioText: Project Homepage
Arkin Laboratory
"What Is Text Mining?" by Marti Hearst
The Flamenco Search Interface Project
UC Berkeley School of Information Management and Systems
Center for Information Technology Research in the Interest of Society (CITRIS)
"Engineering Life" by David Pescovitz (Lab Notes, June 2004)
Lab Notes is
published online by the Marketing and Communications Office of the UC Berkeley
College of Engineering. The Lab Notes mission is to illuminate groundbreaking
research underway today at the College of Engineering that will
dramatically change our lives tomorrow.
Media contact: Teresa
Moore, Lab Notes editor, Director of Marketing and Communications
Writer, Researcher: David
Pescovitz
Web Manager: Michele
Foley
Subscribe or send comments to the Engineering Marketing and Communications Office: lab-notes@coe.berkeley.edu.
© 2005 UC Regents.
Updated 2/1/05.
|