Thursday, April 30th, 2009

YQL execute now allows you to convert scraped data with server side JavaScript

Category: Examples, JavaScript, Yahoo!

<p>I am a big fan of YQL, a terribly easy and fuss-free way to access APIs and mix data retrieved from them in a simple, SQL style language. Say for example you want photos of Paris,France from Flickr that are licensed with Creative Commons attribution, you can do this with a single command:

  1. select * from flickr.photos.info where photo_id in (select id from flickr.photos.search where woe_id in (select woeid from geo.places where text='paris,france') and license=4)

Try it out here and you see what I mean.

The next step of this interface was to open it out to the public. You can define an “Open Table” as a simple XML schema and bring your own API into this interface with that.

One thing that’s been burning on my tongue to tell the world about has been finally released now: YQL execute. Instead of making the YQL language itself much more complex (and thus running in circles) we now allow you to embed JavaScript in the Open Table XML that will run on the YQL server and allow you to access other web services, authenticate and scrape HTML with JavaScript and E4X. As Simon Willison put it:

This is nuts (in a good way). Yahoo!’s intriguing universal SQL-style XML/JSONP web service interface now supports JavaScript as a kind of stored procedure language, meaning you can use JavaScript and E4X to screen-scrape web pages, then query the results with YQL.

Using this, you can augment the original functionality of YQL to whatever you need. For example, you can scrape HTML with YQL using XPATH, but there was no way to use CSS selectors. Using an open table that invokes James Padolsey’s css2xpath JavaScript on the server side, this is now possible.

  1. use 'http://yqlblog.net/samples/data.html.cssselect.xml' as data.html.cssselect; select * from data.html.cssselect where url="www.yahoo.com" and css="#news a"

Run this query in YQL

The data table is pretty easy:

  1. < ?xml version="1.0" encoding="UTF-8" ?>
  2. <table xmlns="http://query.yahooapis.com/v1/schema/table.xsd">
  3.   <meta>
  4.     <samplequery>select * from {table} where url="www.yahoo.com" and css="#news a"</samplequery>
  5.   </meta>
  6.   <bindings>
  7.   <select itemPath="" produces="XML">
  8.     <urls>
  9.       <url></url>
  10.  
  11.     </urls>
  12.     <inputs>
  13.       <key id="url" type="xs:string" paramType="variable" required="true" />
  14.       <key id="css" type="xs:string" paramType="variable" />
  15.     </inputs>
  16.       <execute>< ![CDATA[
  17.   //include css to xpath convert function
  18.   y.include("http://james.padolsey.com/scripts/javascript/css2xpath.js");
  19.   var query = null;
  20.   if (css) {
  21.      var xpath = CSS2XPATH(css);
  22.      y.log("xpath "+xpath);
  23.      query = y.query("select * from html where url=@url and xpath=\""+xpath+"\"",{url:url});
  24.   } else {
  25.      query = y.query("select * from html where url=@url",{url:url});
  26.   }
  27.   response.object = query.results;
  28.      ]]></execute>
  29.     </select>
  30.   </bindings>
  31. </table>

Check the official Yahoo Developer Network blog post on YQL execute for more examples, including authentication examples for flickr and netflix.

Related Content:

Posted by Chris Heilmann at 9:28 am
11 Comments

+++--
3.9 rating from 29 votes

11 Comments »

Comments feed TrackBack URI

So I can use Yahoo!’s servers to screen scrape anything I want?

That’s terrific. But what prevents abuse, such as huge attacks on some poor guy’s $5 a month data-limited hosted account?

Comment by Nosredna — April 30, 2009

@Nosredna YQL access is limited to a cap that would prevent that:

YQL has the following API usage restrictions:
Per application limit (identified by your Access Key):
* 100,000 calls per day.
Per IP limits:
* /v1/public/* 1000 calls per hour
* /v1/yql/* 10000 calls per hour
All rates are subject to change. In addition, you may also be subject to the underlying rate limits of other Yahoo and 3rd party web services.

However, what prevents me to curl his page every second? I don’t need YQL to scrape people’s pages. What YQL does though is cache the results which actually means less hits for the scraped page.

Comment by Chris Heilmann — April 30, 2009

I did not know James Padolsey function but it seems quite incomplete compared with the one I created for vice-versa.
Here the specific function via experiments and document.query.css2xpath function
Maybe me and James could collaborate to create a complete and stable function (mine at least pass every CSS selector used in SlickSpeed test ;) )

Comment by WebReflection — April 30, 2009

Thanks for the answer Chris. Agreed that it’s always been possible to scrape. It’s the ease of doing it and the indirection through Yahoo! servers that I was thinking of.

The caching is nice.

Comment by Nosredna — April 30, 2009

yep, tested right now and James Padolsey function is both incomplete and buggy (with results as well) … James, give me a shout if you read me.

Comment by WebReflection — April 30, 2009

@Chris, Awesome work! I recommend using WebReflection’s converter though; as mentioned it’s more complete (and less buggy) than mine.

Comment by JimmyP22 — April 30, 2009

All,

One of the main reasons we made use of James’ CSS/xpath converter to show how easy it was to plug in useful JS functions and libraries into a table, to get new functionality that people want in YQL.

Why not create a better CSS selector open data table and submit it to github for others to use and share? The sample ones aren’t part of the community respository (datatables.org) so that seems a good place for a better version to go.

Jonathan

Comment by JonathanT — April 30, 2009

JonathanT, I partially agree about a better version but I do not get the “should be part of datatables.org” part … I mean what’s wrong with my or James website/project? I better see a specific one out of whatever box … what do you think about?

Comment by WebReflection — April 30, 2009

@Nosredna

Three words:
YQL honors robots.txt

Comment by infosage — April 30, 2009

unfortunatly you can’t safely scrape everything on the web, because there are some conversion quirks

the main problem is that YQL return well-formed XML, but the web is often a mess of both HTML and XHTML (also notice you can only scrape what’s inside the body tag)

look at this sample page I made (valid HTML 4): http://www.playquery.it/sandbox/yql/test3.html

this is how YQL parse it

some convertion errors:

- some HTML entities are converted to the corresponding character code (nbsp and reg), and some other not (amp,lt,gt)

- an anchor with a name=”top” now has also an id=”top”

- the textarea has some whitespaces inside, but in the YQL result is empty

- the table really freaks out (some p tags added, the form on the bottom of the page is put inside a td tag, the table is moved under the main paragraph)

and, as I noted on James blog some days ago (http://james.padolsey.com/javascript/using-yql-with-jsonp/), if you are forced to use the JSONP format instead of the XML is even worse

but, anyway, if you know very well the source of your query, and it’s XHTML well-formed, I think YQL could be really awesome

Comment by postream — April 30, 2009

I looked at getting Sizzle running to do the CSS selectors before we launched YQL Execute and in order to use it you need a DOM. In order to get a DOM in Rhino you need env.js which currently runs to about 8k lines of code.

This means in order to get Sizzle working you need about 9k lines of JS. CSS2XPath currently weighs in at under 100 lines of code. XPath is natively implemented in Rhino and doesn’t require any additional code.

So, from the perspective of speed it’s 9k lines of interpretation vs 100 lines, and from the perspective of the execution cycle limits YQL has you can spend them on processing data, not creating a DOM.

Comment by sh1mmer — May 1, 2009

Leave a comment

You must be logged in to post a comment.