The Markos Giannopoulos Blog: How to read pages with PHP and Diffbot, a visually learning robot

Thursday, 8 September 2011

How to read pages with PHP and Diffbot, a visually learning robot

Say you want to copy Facebook's any-page-on-the-web sharing feature where a title, photo and part of the content is automagically presented to the user. Or you want to build yet another news reading service (after you finished building your geo-location group-chat app ;)). You can now do this with Diffbot.

But what is Diffbot?

“We provide technology that allows applications to interpret web pages like a human being,” explains Tung. “We’ve discovered that the entire Internet can be classified into about 30 different page types. What that means is that even though there’s essentially an infinite number of web pages on the web, there are certain common layouts and ways humans structure web pages that are understandable.”
Rather than looking at the tags or markup within a page, Diffbot looks at things like the X and Y coordinates for different parts of a page, the amount of screen real estate that each part is given, how a certain part is positioned relative to everything else on the page, and what kind of fonts and borders are used. Diffbot is developing an API for each type of page and currently has APIs available for both front pages and article pages.

via Robert Scoble - A look at Diffbot, a visually learning robot
Read also Diffbot lets developers navigate code the way our eyes see the world on TNW

I've recently been using Diffbot for a project and it runs smoothly. Specifically I've been testing the Article API to extract structured content from pages and present links to them via a website. There is also a Follow API that will understand when a page has been updated, an RSS API to follow changes in RSS feeds and a Frontpage API that is customized to read a page listing news articles.

In the Article API scenario, Diffbot asks for a simple HTTP GET request (a free-for-personal-use) developer's token is required. Diffbot will return a JSON response (can also respond in XML). Obviously, handling it with PHP is not rocket science, but since the API documentation doesn't have a PHP example here's my code

[code lang="php"]$geturl="http://www.diffbot.com/api/article?tags=1&token=".$token."&url=".$url;
$json = file_get_contents($geturl);
$data = json_decode($json, TRUE);
$article_title=$data['title'];
$article_author=$data['author'];
$article_date=$data['date'];
$article_text=$data['text'];
$article_tags=$data['tags'];
$article_image=$data['media'][0]['link'];[/code]

The code is quite simple, as you can see. Diffbot can also generate tags based on the text of the page. I've found that getting the correct photo does not always work but all in all Diffbot is a life saver for what it can accomplish in a very easy manner.

Diffbot is already being used by some big names like AOL for it's AOL Editions app. Also, check out Diffbot's own Feedbeater service which lets you create a RSS feed or email alert for any webpage.

1 comment:

Say you want to copy Facebook’s any-page-on-the-web sharing feature where a title,… « The Markos Giannopoulos Blog25 December 2011 at 21:03
[...] How to read pages with PHP and Diffbot, a visually learning robot « The Markos Giannopoulos Blog [...]
ReplyDelete
Replies

Add comment