How to write an RSS aggregator
From Erlang Community
| Revision as of 12:44, 29 June 2006 (edit) Admin (Talk | contribs) ← Previous diff |
Current revision (15:26, 14 June 2007) (edit) (undo) Defza (Talk | contribs) m (reverted spam...) |
||
| (16 intermediate revisions not shown.) | |||
| Line 1: | Line 1: | ||
| + | ==Author== | ||
| + | Tobbe | ||
| + | |||
| ==How to write an RSS aggregator== | ==How to write an RSS aggregator== | ||
| ===Introduction=== | ===Introduction=== | ||
| - | In the article: [http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html?page=2|What is RSS], they describe the various RSS formats and create a simple RSS aggregator written in Python. Inspired by this I decided to do the same in Erlang. | + | In the article: [http://www.xml.com/pub/a/2002/12/18/dive-into-xml.html?page=2| What is RSS], they describe the various RSS formats and create a simple RSS aggregator written in Python. Inspired by this I decided to do the same in Erlang. |
| In this example I have been using <b>OTP-R10B-3</b> release and the | In this example I have been using <b>OTP-R10B-3</b> release and the | ||
| Line 11: | Line 14: | ||
| ===Getting information from an RSS feed.=== | ===Getting information from an RSS feed.=== | ||
| Let us use the RSS feed at Slashdot in this example. | Let us use the RSS feed at Slashdot in this example. | ||
| - | The RSS info can be reteived from the URL: <b>http://slashdot.org/ | + | The RSS info can be reteived from the URL: <b>http://rss.slashdot.org/Slashdot/slashdot</b>. |
| - | We make use of a function in the | + | We make use of a function in the ibrowse package to retrieve the |
| file. | file. | ||
| - | + | {{CodeSnippet|Code listing 1.1: Getting the RSS info|<pre> | |
| - | + | 1> {ok,_StatusCode,_Headers,B} = ibrowse:send_req("http://rss.slashdot.org/Slashdot/slashdot", [], get). | |
| - | + | {ok,"200", | |
| - | + | [{"Age","2"}, | |
| - | 1> | + | {"Transfer-Encoding","chunked"}, |
| - | {ok, | + | {"Date","Thu, 07 Sep 2006 13:07:50 GMT"}, |
| - | </pre> | + | {"Content-Type","text/xml;charset=utf-8"}, |
| - | + | {"Server", | |
| + | "Apache/2.0.54 (Debian GNU/Linux) mod_fastcgi/2.4.2 mod_jk/1.2.15"}, | ||
| + | {"Last-Modified","Thu, 07 Sep 2006 12:54:15 GMT"}, | ||
| + | {"ETag","MiaYBqfDcpuUu6jqri59Oyhorvc"}, | ||
| + | {"P3P","CP=\"ALL DSP COR NID CUR OUR NOR\""}], | ||
| + | "<?xml version=\"1.0\" encoding..."} | ||
| + | </pre>}} | ||
| ===Parsing the XML content.=== | ===Parsing the XML content.=== | ||
| We continue by parsing the XML content of the retrieved file. This time we make use of xmerl. file. | We continue by parsing the XML content of the retrieved file. This time we make use of xmerl. file. | ||
| - | + | {{CodeSnippet| Code listing 1.2: Parsing the XML content|<pre> | |
| - | + | 2> {Doc,Misc} = xmerl_scan:string(B). | |
| - | + | ||
| - | + | ||
| - | 2> | + | |
| {#xmlElement{name = 'rdf:RDF', | {#xmlElement{name = 'rdf:RDF', | ||
| parents = [], | parents = [], | ||
| Line 46: | Line 52: | ||
| parents = [], | parents = [], | ||
| ..... | ..... | ||
| - | </pre> | + | </pre>}} |
| - | + | ||
| <table class="ncontent" width="100%" border="0" cellspacing="0" cellpadding="0"><tr><td bgcolor="#bbffbb"><p class="note"><b>Note: </b>Note that we have made use of the fantastic shell command <b>rr/1</b> | <table class="ncontent" width="100%" border="0" cellspacing="0" cellpadding="0"><tr><td bgcolor="#bbffbb"><p class="note"><b>Note: </b>Note that we have made use of the fantastic shell command <b>rr/1</b> | ||
| - | (as in | + | (as in rr(xmerl_scan)) before issuing the call to xmerl. This gives |
| us the output in a nice record format.</p></td></tr></table> | us the output in a nice record format.</p></td></tr></table> | ||
| Line 55: | Line 60: | ||
| Now we have to extract the information we need from the parse tree. We will write some simple code to do this. But first, let us see at how the result looks like. | Now we have to extract the information we need from the parse tree. We will write some simple code to do this. But first, let us see at how the result looks like. | ||
| - | + | {{CodeSnippet|Code listing 1.3: Printing out the RSS info|<pre> | |
| - | + | 3> myxml:printItems(myxml:getElementsByTagName(Doc, item)). | |
| - | + | ||
| - | + | ||
| - | 3> | + | |
| title: United Kingdom Leads the World in TV Downloads | title: United Kingdom Leads the World in TV Downloads | ||
| link: http://slashdot.org/article.pl?sid=05/02/18/0324238from=rss | link: http://slashdot.org/article.pl?sid=05/02/18/0324238from=rss | ||
| Line 83: | Line 85: | ||
| link: http://science.slashdot.org/article.pl?sid=05/02/18/0027246from=rss | link: http://science.slashdot.org/article.pl?sid=05/02/18/0027246from=rss | ||
| ........... | ........... | ||
| - | </pre> | + | </pre>}} |
| - | + | ||
| - | + | ||
| Our first function will extract all <b>item</b> elements. | Our first function will extract all <b>item</b> elements. | ||
| To do this we create a function <b>getElementsByTagName/2</b> which | To do this we create a function <b>getElementsByTagName/2</b> which | ||
| takes the XML parse tree and the Tag that we want to find. | takes the XML parse tree and the Tag that we want to find. | ||
| - | + | {{CodeSnippet|Code listing 1.4: getElementsByTagName/2|<pre> | |
| - | + | ||
| - | + | ||
| - | + | ||
| getElementsByTagName([H|T], Item) when H#xmlElement.name == Item -> | getElementsByTagName([H|T], Item) when H#xmlElement.name == Item -> | ||
| [H | getElementsByTagName(T, Item)]; | [H | getElementsByTagName(T, Item)]; | ||
| Line 105: | Line 102: | ||
| getElementsByTagName([], _) -> | getElementsByTagName([], _) -> | ||
| []. | []. | ||
| - | </pre> | + | </pre>}} |
| - | + | ||
| Next we want to print each entry. The function printItems/1 walks through each item, exctracts and prints the info we are interested in. | Next we want to print each entry. The function printItems/1 walks through each item, exctracts and prints the info we are interested in. | ||
| - | + | {{CodeSnippet|Code listing 1.5: printItems/2|<pre> | |
| - | + | ||
| - | + | ||
| - | + | ||
| printItems(Items) -> | printItems(Items) -> | ||
| F = fun(Item) -> printItem(Item) end, | F = fun(Item) -> printItem(Item) end, | ||
| Line 125: | Line 118: | ||
| io:format("author: ~s~n", [textOf(first(Item, 'dc:creator'))]), | io:format("author: ~s~n", [textOf(first(Item, 'dc:creator'))]), | ||
| io:nl(). | io:nl(). | ||
| - | </pre> | + | </pre>}} |
| - | + | ||
| The last two functions to implement <b>first/2</b> and | The last two functions to implement <b>first/2</b> and | ||
| <b>textOf/1</b> | <b>textOf/1</b> | ||
| - | + | {{CodeSnippet|Code listing 1.6: printItems/2|<pre> | |
| - | + | ||
| - | + | ||
| - | + | ||
| first(Item, Tag) -> | first(Item, Tag) -> | ||
| hd([X || X <- Item#xmlElement.content, | hd([X || X <- Item#xmlElement.content, | ||
| Line 142: | Line 131: | ||
| lists:flatten([X#xmlText.value || X <- Item#xmlElement.content, | lists:flatten([X#xmlText.value || X <- Item#xmlElement.content, | ||
| element(1,X) == xmlText]). | element(1,X) == xmlText]). | ||
| - | </pre> | + | </pre>}} |
| - | + | ||
| ===The RSS aggregator.=== | ===The RSS aggregator.=== | ||
| To go from here to a RSS aggregator is easy. You just have to extend the code above with the functionality to retreive info from several RSS feeds. You may also want to present the info in some other format, e.g HTML via a Yaws page. This however, is left as an exercise for the reader to do. | To go from here to a RSS aggregator is easy. You just have to extend the code above with the functionality to retreive info from several RSS feeds. You may also want to present the info in some other format, e.g HTML via a Yaws page. This however, is left as an exercise for the reader to do. | ||
| + | |||
| + | ==Download xml== | ||
| + | [http://wiki.trapexit.erlang-consulting.com/upload/howto/howto_rss_aggregator.xml howto_rss_aggregator.xml] | ||
| + | |||
| + | [[Category:HowTo]] | ||
Current revision
Contents |
[edit] Author
Tobbe
[edit] How to write an RSS aggregator
[edit] Introduction
In the article: What is RSS, they describe the various RSS formats and create a simple RSS aggregator written in Python. Inspired by this I decided to do the same in Erlang.
In this example I have been using OTP-R10B-3 release and the jerl Jungerl start script. By using the Jungerl start script I automatically get www_tools in my path. This will probably also make it possible to make the example work in older Erlang releases since Jungerl also contains xmerl before it was added into OTP.
[edit] Getting information from an RSS feed.
Let us use the RSS feed at Slashdot in this example. The RSS info can be reteived from the URL: http://rss.slashdot.org/Slashdot/slashdot. We make use of a function in the ibrowse package to retrieve the file.
|
Code listing 1.1: Getting the RSS info |
1> {ok,_StatusCode,_Headers,B} = ibrowse:send_req("http://rss.slashdot.org/Slashdot/slashdot", [], get).
{ok,"200",
[{"Age","2"},
{"Transfer-Encoding","chunked"},
{"Date","Thu, 07 Sep 2006 13:07:50 GMT"},
{"Content-Type","text/xml;charset=utf-8"},
{"Server",
"Apache/2.0.54 (Debian GNU/Linux) mod_fastcgi/2.4.2 mod_jk/1.2.15"},
{"Last-Modified","Thu, 07 Sep 2006 12:54:15 GMT"},
{"ETag","MiaYBqfDcpuUu6jqri59Oyhorvc"},
{"P3P","CP=\"ALL DSP COR NID CUR OUR NOR\""}],
"<?xml version=\"1.0\" encoding..."}
|
[edit] Parsing the XML content.
We continue by parsing the XML content of the retrieved file. This time we make use of xmerl. file.
|
Code listing 1.2: Parsing the XML content |
2> {Doc,Misc} = xmerl_scan:string(B).
{#xmlElement{name = 'rdf:RDF',
parents = [],
pos = 1,
attributes = [#xmlAttribute{name = 'xmlns:rdf',
parents = [],
pos = 1,
language = [],
expanded_name = [],
nsinfo = [],
namespace = {"xmlns","rdf"},
value = "http://www.w3.org/1999/02/22-rdf-syntax-ns#"},
#xmlAttribute{name = xmlns,
parents = [],
.....
|
Note: Note that we have made use of the fantastic shell command rr/1 (as in rr(xmerl_scan)) before issuing the call to xmerl. This gives us the output in a nice record format. |
[edit] Printing out the RSS information.
Now we have to extract the information we need from the parse tree. We will write some simple code to do this. But first, let us see at how the result looks like.
|
Code listing 1.3: Printing out the RSS info |
3> myxml:printItems(myxml:getElementsByTagName(Doc, item)). title: United Kingdom Leads the World in TV Downloads link: http://slashdot.org/article.pl?sid=05/02/18/0324238from=rss description: SumDog writes "The UK is known for many things, great food, a wonderful climate and beautiful women. However, according to a story on the Guardian, a new study puts the UK ahead in one more category: it leads the world in TV piracy, accounting for 38.4% of the world's TV downloads, with Australia coming in second at 15.6% and the US in third at a pitiful 7.3%" date: 2005-02-18T09:31:00+00:00 author: CowboyNeal title: Skype-Ready Phones From Motorola link: http://slashdot.org/article.pl?sid=05/02/18/0314225from=rss description: Hack Jandy writes "Seamlessly integrating VoIP and GSM might not be a fantasy after all, as Motorola announced their decision to build cell phones and handsets that have Skype Internet Telephony integrated into the devices. Obviously, one could use Skype for outgoing calls near wi-fi hotspots (essentially free) but default on GSM for outgoing calls in areas that lack coverage." date: 2005-02-18T06:09:00+00:00 author: CowboyNeal title: London Nuke Plant Loses 30 Kilos of Plutonium link: http://science.slashdot.org/article.pl?sid=05/02/18/0027246from=rss ........... |
Our first function will extract all item elements. To do this we create a function getElementsByTagName/2 which takes the XML parse tree and the Tag that we want to find.
|
Code listing 1.4: getElementsByTagName/2 |
getElementsByTagName([H|T], Item) when H#xmlElement.name == Item ->
[H | getElementsByTagName(T, Item)];
getElementsByTagName([H|T], Item) when record(H, xmlElement) ->
getElementsByTagName(H#xmlElement.content, Item) ++
getElementsByTagName(T, Item);
getElementsByTagName(X, Item) when record(X, xmlElement) ->
getElementsByTagName(X#xmlElement.content, Item);
getElementsByTagName([_|T], Item) ->
getElementsByTagName(T, Item);
getElementsByTagName([], _) ->
[].
|
Next we want to print each entry. The function printItems/1 walks through each item, exctracts and prints the info we are interested in.
|
Code listing 1.5: printItems/2 |
printItems(Items) ->
F = fun(Item) -> printItem(Item) end,
lists:foreach(F, Items).
printItem(Item) ->
io:format("title: ~s~n", [textOf(first(Item, title))]),
io:format("link: ~s~n", [textOf(first(Item, link))]),
io:format("description: ~s~n", [textOf(first(Item, description))]),
io:format("date: ~s~n", [textOf(first(Item, 'dc:date'))]),
io:format("author: ~s~n", [textOf(first(Item, 'dc:creator'))]),
io:nl().
|
The last two functions to implement first/2 and textOf/1
|
Code listing 1.6: printItems/2 |
first(Item, Tag) ->
hd([X || X <- Item#xmlElement.content,
X#xmlElement.name == Tag]).
textOf(Item) ->
lists:flatten([X#xmlText.value || X <- Item#xmlElement.content,
element(1,X) == xmlText]).
|
[edit] The RSS aggregator.
To go from here to a RSS aggregator is easy. You just have to extend the code above with the functionality to retreive info from several RSS feeds. You may also want to present the info in some other format, e.g HTML via a Yaws page. This however, is left as an exercise for the reader to do.

Digg It
Del.icio.us
Reddit
Facebook
Stumble Upon
Technorati

