kwc.org Photos Spare Cycles MythBusters

Yahoo term extraction examples

I was curious to see how well the new Yahoo term extraction API would work, so I coded up a quick script to get some results on my most recent blog entries. I was hoping that this api might make it easier to write something to go auto-tag all my previous entries or otherwise allow me to add some interesting bit of info to these pages, but I'm not so sure. You can see the samples in the extended entry as well as python code for doing it yourself.

Java/Windows hate hate hate

  • default timezone
  • java forums
  • daylight savings time
  • daylight savings
  • software programming
  • cathartic
  • occassionally
  • milliseconds
  • sarcasm
  • stupidity
  • utc
  • january 1
  • appreciate

Book: Maya Lin Boundaries

  • idea
  • yale course
  • typeface
  • architecture
  • mit medical
  • singular
  • overwhelm
  • design art
  • bembo
  • tufte
  • technical consultants

Support kwc.org

  • plagiarism
  • boring
  • banner ads
  • recuperate
  • blogging
  • outsourcing

Ode to '97

  • blinking text
  • kwc
  • nostalgic
  • homage
  • 20th century

I am a plagiarist

  • blog
  • plagiarist
  • hair school
  • haircuts
  • confess
  • accountability
  • purchase high quality

Yahoo 360 First Impressions

  • yahoo groups
  • my yahoo groups
  • yahoo address
  • yahoo e mail
  • yahoo services
  • fantasy sports
  • tv listings
  • daily basis
  • weather
  • personal calendar
  • blogging
  • address book
  • personal information
  • networking service
  • social networking
  • personal information organizer
  • information organizer
  • comics news
  • gut reaction
  • sports tv

Code

doc: http://developer.yahoo.net/content/V1/termExtraction.html

import urllib

import urllib2

USER_AGENT = 'kwccontent robot'

OPENER = urllib2.build_opener()

def doQueries():
    results = []
    for context in contexts:
        request = urllib2.Request(apiUrl)
        request.add_header('User-Agent',USER_AGENT)
        print "building query"
        dataDict = [ ('appid', appid), ('context', context)]
        queryData = urllib.urlencode(dataDict)
        request.add_data(queryData)

        print "fetching result"

        result = OPENER.open(request).read()
        print "got result: " 
        print result
        results.append(result)

    from xml.dom.minidom import parseString

    for r in results:
        doc = parseString(r)
        print doc.toprettyxml()


apiUrl = 'http://api.search.yahoo.com/ContentAnalysisService/V1/termExtraction'
appid = register for your own
contexts = [ array of text to send to yahoo ]

Comments (2)

bp:

That's not completely hopeless. Just label them as "auto categories" or something, and you'll probably have something semi-decent. Or union the results with hand-picked tags, if you'd prefer, just to make sure there's some signal to the noise.

kwc:

Not sure how to put it to use yet. It might be useful, as you said, in a "suggested tags" sort of setting. delicious offers suggested tags with their experimental UI, which is very useful, though they base that on the input of other users. This could try to derive the same benefits of the delicious UI, given that my blog doesn't have 100 other people tagging each blog entry.

I've been considering moving to a more keywords-oriented approach to categorization. The main problem is that there is no elegant way to do this in MovableType. I may try to build a tagging system layered on top of everything, which would have the advantage in that it could tag anything, not just my blog entries (e.g. internally and externally tagged photos/links). I think that it would be cool that if someone clicked on a "gehry" tag on my site, they would be shown a sampling a gehry photos from each of my kwc.org photo albums (and Flickr photo albums), as well as links to blog entries that mention gehry, and possibly any external links I've collected related to gehry.

Post a comment


tags.

related entries.

what is this?

This page contains a single entry from kwc blog posted on April 4, 2005 8:20 PM.

The previous post was Java/Windows Hate Hate Hate.

The next post is Google's plan: World Domination.

Current entries can be found on the main page.