kwc.org Photos Spare Cycles MythBusters

Talk: Application of Artificial Intelligence to Web Search

Peter Norvig, Google

Web search matters * Part of public conscious * Hummer ad: "Search Engine", thought as likely as "Markov Chain Monte Carlo Method"

Web is escape from toy worlds. larger than block world, larger than any domain. Every conceivable language, fresh. Not as big as real world, but it's more accessible for machines, predominantly symbolic.

Commonsense knowledge

Take all encyclopedic knowledge and encode it, but still missing type of knowledge that doesn't make sense writing articles about.

"Each of us has a vast storehouse of general knowledge, though we rarely talk about any of it explicity to one another; we just assume that other people already know these things. If they are included in a conversation, or an article, they confuse more than..."

  • "water flows downhill": typed into Google, found lesson plan: 1 in 2 million pages. perhaps cheating, not discovering but verification
  • "water flows uphill": typed into Google, 1/3rd less results (better until Dyson came up with fountain that made it look like water going uphill)

Applications

  1. Classifying web page spam: known categories
  2. Named entity detection and categorization: make up categories as you go
  3. Statistical machine translation
  4. Word clustering: don't have to place into categories, just group

Classifying Web page spam

Adversarial IR. Can attack relevance + authority of page (link and anchor analysis).

Relevance attacks: * e.g. "cheap airplane tickets" attacking relevance (including misspellings). Serve different version of page to Google ("cloaking"). * can look at sequence of words and decide that it's unlikely (e.g. bigram, n-gram) * Medical page: subheadline not capitalized, some of the sentences don't make sense. Spammer substituting keyword nouns in for the medical terms. Need sophisticated n-gram model to decide sequence is bad. Would pass syntactic parser. * Web version of the Turing test. Have to distinguish text coming from legitimate Web author from automated spammer

Authority attacks: * Clayton New Mexico guestbook: lots of viagra spam. * Go to all sites that offer guest feedback. * W3C mailing list spam. * Don't want to take away their authority, but have to recognize parts of site that have are bad.

Google Bombing attacks * Miserable failure: Bush, Carter, and Michael Moore popular attacks. Previously uncontested query (1 in a 100 million to 1 in a billion queries made). Not displacing information that was previously there. * Nigritude Ultramarine: person who used in first week used spammer-like attacks. Grand Prize went to Anil Dash: "link to me so I can win contest." Helped that he makes blog software. Resembles vote by author of popular content, legitimate technique.

Name Entities ("IBM")

Task is to find them in text, sometimes categorize them

What kind of learning technique? * supervised/lightly/unsupervised * language-/domain-specific/General

Can start with seed entities (e.g. IBM and Microsoft), from there learn new ones.

Will describe approach that does not require seed entities.

What sort of classification? * one/many categories * binary/probabilistic membership * flat/tree/graph * attributes/none (e.g. MB of memory) * regular expressions/grammars/other

Pasca algorithm * does not require seed entities * filter HTML, break text into sentences, and mark POS with tagger (Brants' TnT) * Match sentences against patterns like (domain independent) * C ["such as"|"including"] N (C: concept, N: noun) * "Companies such as IBM" * Retain (N isa C) if C is a simple, ends in plural noun * Plenty of source text, so only have to keep ones with high confidence * Discard noisy data; regularize over names * Plug (N isa C) pairs back into text to find new patterns. Iterate. * Combine categories based on overlap in category names and overlap in group membership

Extracted entities and categories * hybrid cars: Toyota Prius, Honda Insight * high-speed networks: ATM, Gigabit Ethernet, B-ISDN, Myrinet, Frame Relay, Fast Ethernet, etc.. * rappers: Eminem, Jay-Z, Nas, Dmx, etc... * anti-depressants: prozac, etc..

88% precision

Hard to calculate recall

Lots of data means you can use less sophisticated techniques

Banko & Brill, 2001: Effect of training corpus size * word sense disambiguation * five different types of learning algorithms * 1 million to 1 billion training words. * regardless of techniques, all increase in accuracy over training corpus size

Statistical machine translation

Large corpus technique doesn't help.

Machine translation * statistical approach: n-gram model over source, target words. statistical approach hit 71% in competition and rule based only 49%. Now everyone shifting to statistical approach. * phrase-based: * EN: [Let's meet][next Wednesday][at six thirty]. * DE: [Treffen wir uns][nachsten Mittwoch][um halb sieben] * template-based: "Wednesday" -> WEEKDAY. Don't have to learn same thing over for other weekdays. * Discriminative training: P(target | source), not the generative P(source|target)P(target) from Bayes Rule. Latter has advantage that P(target) can be calculated within the language (mono-language), but former has better results.

In DARPA challenge, human performance at 53% (Blue metric, automated metric, measure against one good translation, cheap metric, easy to implement). ISI system is at 43%, 2nd best at 29%. If give machines larger corpus, ISI and 2nd best system will approach human score (wasn't able to run on larger corpus, projected outward by generating results backwards, looked linear, then projected trend forward). ISI system still much better than 2nd best, so case of having the right algorithm is important.

Word clustering

Bayes Net clustering * good for providing negative evidence. Don't return New Mexico when searching for Mexico. * harder to propagate negative evidence in rule-based systems.

In the future

"In the future, search engines should be useful as HAL in 2001... but hopefully they won't kill people" -- Sergey Brin

Q and A

Interested in enhancing image search by actually looking at picture, but state-of-the-art is still to look at metadata

Audio search hampered by questions of ownership

How do you measure search quality? Anecdote about how AskJeeves scores themselves against Google, "I'm happy to report we're scoring 100%." Track aggregate statistics, customer loyalty/satisfaction."

Expert Googles: multiple ways. Google-in-a-Box. Restrict to site domain. Recently added site-flavored personalized search box for Web site.

tags.

related entries.

what is this?

This page contains a single entry from kwc blog posted on July 29, 2004 10:31 AM.

The previous post was *blog seizure*.

The next post is Burning poo.

Current entries can be found on the main page.