Thursday, June 26, 2008

Bill Gates Speaks Up For Semantics

Bill Gates’ criticism that pure keyword search is “just syntax and not semantics and has limits no matter how much you build those things up” came at the heels of a heated conversation I took part in at the Semantic Technology 2008 conference in San Jose. The session title was: Will Semantics Give Web Search a Face-lift?

It was clear from the outset that very different notions of semantics were used, so a lively discussion ensued during the Q&A, where the panelists compared their own approaches to take search to the next level. Since everyone belongs to a different school of thought, we agreed not to agree: Fernando Pereira, Research Director at Google, assumes that semantics can be captured from the use and formatting of language–ironically, he later stated that Wittgensteinian (meaning is use) or Fregian (meaning can be reduced to formal logic) approaches are futile. Google’s approach is using classic statistical machine-learning methods (robust, in the sense of a brick being a robust tool for switching off a light, but as we know non-scalable), so we know that there is no “semantics” focus. Peter Mika, a recent hire at Yahoo!, on the other hand, talked about their new SeachMonkey interface that is, inter alia, to be fed by RDF markup. Obviously, hakia’s position is rather different.

Keyword Co-occurrence Statistics as Semantics (Google)

Fernando killed the light with a rock.

If meaning is co-occurence for you, then this sentence will be a possible answer to queries about people dying in avalanches. Not much that structure of a webpage could help you here, either. Seemingly relevant words, not disambiguated as to their actual senses in the given context, will easily mislead you.

Syntax as Semantics (Powerset)

It should also be mentioned that Ron Kaplan, CSO of Powerset, made a few statements from the audience, including the very telling one that Powerset believes in the “syntactic web”, which pointedly illustrates his belief that you can get to meaning from surface syntax.

If meaning is syntax, then for you the sentence above is not distinguishable from this one:

Ron killed the program with a memory leak.

The surface structures of the sentences are identical, even some words overlap, but killing a light is different from killing a program (not to mention, killing an animate being), and the ‘rock’ is an instrument in the first case, while the ‘memory leak’ is a cause in the second. Syntax does not grant you access to any of these important differences in meaning.

Semantic-Web Markup as Semantics (Yahoo!)

If, on the other hand, you believe in semantic-web-style markup as the solution, then the author of the sentence will have to add tags that clarify that a lamp was switched off, hopefully in a way that another user has tagged this sentence:

Peter used his usual brick to turn off the lamp.

Semantics as Semantics (hakia)

If, finally, you have access to semantics, your constraints on the different senses of ‘kill’, ‘light’, and ‘rock’ will get you to the meaning automatically, and you will serve the sentence above only as an answer to queries about methods to switch off lamps, and not pollute your results with it otherwise. For more examples, you can read my prior blog posts.

Where we currently are in search is nice, but there is much room for improvement. Non-semantic methods have reached their ceiling. Carefully tested and appropriate semantic methods, based on understanding natural language, will get us to the next stage. We are phasing these in, beta release by beta release, and will show you the difference between real semantics and yesterday’s attempts at avoiding semantics. Stay tuned!

Bill Gates Speaks Up For Semantics

Bill Gates’ criticism that pure keyword search is “just syntax and not semantics and has limits no matter how much you build those things up” came at the heels of a heated conversation I took part in at the Semantic Technology 2008 conference in San Jose. The session title was: Will Semantics Give Web Search a Face-lift?

It was clear from the outset that very different notions of semantics were used, so a lively discussion ensued during the Q&A, where the panelists compared their own approaches to take search to the next level. Since everyone belongs to a different school of thought, we agreed not to agree: Fernando Pereira, Research Director at Google, assumes that semantics can be captured from the use and formatting of language–ironically, he later stated that Wittgensteinian (meaning is use) or Fregian (meaning can be reduced to formal logic) approaches are futile. Google’s approach is using classic statistical machine-learning methods (robust, in the sense of a brick being a robust tool for switching off a light, but as we know non-scalable), so we know that there is no “semantics” focus. Peter Mika, a recent hire at Yahoo!, on the other hand, talked about their new SeachMonkey interface that is, inter alia, to be fed by RDF markup. Obviously, hakia’s position is rather different.

Keyword Co-occurrence Statistics as Semantics (Google)

Fernando killed the light with a rock.

If meaning is co-occurence for you, then this sentence will be a possible answer to queries about people dying in avalanches. Not much that structure of a webpage could help you here, either. Seemingly relevant words, not disambiguated as to their actual senses in the given context, will easily mislead you.

Syntax as Semantics (Powerset)

It should also be mentioned that Ron Kaplan, CSO of Powerset, made a few statements from the audience, including the very telling one that Powerset believes in the “syntactic web”, which pointedly illustrates his belief that you can get to meaning from surface syntax.

If meaning is syntax, then for you the sentence above is not distinguishable from this one:

Ron killed the program with a memory leak.

The surface structures of the sentences are identical, even some words overlap, but killing a light is different from killing a program (not to mention, killing an animate being), and the ‘rock’ is an instrument in the first case, while the ‘memory leak’ is a cause in the second. Syntax does not grant you access to any of these important differences in meaning.

Semantic-Web Markup as Semantics (Yahoo!)

If, on the other hand, you believe in semantic-web-style markup as the solution, then the author of the sentence will have to add tags that clarify that a lamp was switched off, hopefully in a way that another user has tagged this sentence:

Peter used his usual brick to turn off the lamp.

Semantics as Semantics (hakia)

If, finally, you have access to semantics, your constraints on the different senses of ‘kill’, ‘light’, and ‘rock’ will get you to the meaning automatically, and you will serve the sentence above only as an answer to queries about methods to switch off lamps, and not pollute your results with it otherwise. For more examples, you can read my prior blog posts.

Where we currently are in search is nice, but there is much room for improvement. Non-semantic methods have reached their ceiling. Carefully tested and appropriate semantic methods, based on understanding natural language, will get us to the next stage. We are phasing these in, beta release by beta release, and will show you the difference between real semantics and yesterday’s attempts at avoiding semantics. Stay tuned!

Tuesday, June 24, 2008

hakia Adds 10 million PubMed Articles to its Semantic Search Engine

PubMed.gov is one of the largest data aggregation points in medicine, and the only one that covers more than 4000 journal entries. We are proud to announce that hakia has QDEXed more than 10 million PubMed abstracts, and is now offering PubMed search exclusively at pubmed.hakia.com, or at hakia.com as part of a general search.

You don’t know what you are missing using PubMed’s own Search Engine
We start with an interesting observation that PubMed’s own search engine has some serious holes in it, and the user may not realize what he/she is missing. Although we do not like the exercise of showing comparisons for example purposes, in this particular case, there seems to be no other way to demonstrate the alarming importance of searching efficiently for health information on the Internet.

The first query is a simple one: Protein C deficiency

As part of a general search, hakia’s first result from PubMed is an article written by Nizzi FA Jr, Kaplan HS., from the University of Texas Southwestern Medical Center. It is all about Protein C and S deficiency. PubMed’s own search engine not only fails to bring this article, but all 20 results are irrelevant to this query.

Protein C deficiency is not dull subject. It causes blood clots and should be on the radar of medical doctors, nurses, medical students, researchers, and even the standard health consumer.

Next query is a bit more research oriented: phosphorylation sites in glycine

The situation is the same. hakia’s first result from PubMed is an article written by Luca Z. et al, from Vanderbilt University School of Medicine. PubMed’s own search engine fails to bring this abstract and nothing seems to be related in the first 20 results.

Before making this blog post too repetitious, we will finalize it with a third example picked from dozens of other examples we have analyzed. This time, the query is a more generalized concept in genetics: modulation of ion channels.

hakia’s first result from PubMed is an article written by Dascal N. from Sackler School of Medicine, Tel Aviv University. PubMed’s own search engine fails to bring this abstract and the first 20 results are not promisingly relevant.

Google Site Search for PubMed shows the same holes

So, we turned to Google site search. The query protein C deficiency fails. So do the other two queries:
phosphorylation sites in glycine and modulation of ion channels. However, these failures are expected from Google due to (1) undefined coverage, and (2) limitations of the popularity algorithms.

Semantic search making a difference at the basic level

These arguments are to remind the readers of the fact that semantic search technology can make a difference at the basic level of retrieval because of its built-in consistency, and because the technology does not depend on any statistics. We have not even discussed the semantic variations between the queries and text.

hakia’s PubMed coverage will continue on a daily basis as we utilize the power of semantic algorithms handling dynamic data (new abstracts emerging daily.) Stay tuned for an update.

Wednesday, June 11, 2008

Hakia Web Search

hakia is an Internet search engine. The company has invented QDEXing technology, an alternative new infrastructure to indexing that uses SemanticRank algorithm, a solution mix from the disciplines of ontological semantics, fuzzy logic, computational linguistics, and mathematics. Founded in 2004, the company is privately held and based in New York City.