Erlang/OTP Forums

Author Message

<  Erlang questions mailing list  ~  Screen scraping

Guest
Posted: Tue Aug 29, 2006 10:16 pm Reply with quote
Guest
Does anyone have tools for screen scraping with Erlang?

It's a combination of HTTP client with parsing and regexp-ing through
HTML. Ruby has nice tools for this like hpricot and scrAPI and they
parse HTML into a structure and let you query for elements based on
their class, id, name, etc.

Thanks, Joel

--
http://wagerlabs.com/





Post recived from mailinglist
doublec
Posted: Wed Aug 30, 2006 1:07 am Reply with quote
User Joined: 03 Nov 2005 Posts: 17
On 8/30/06, Joel Reymont <joelr1@gmail.com> wrote:
> Does anyone have tools for screen scraping with Erlang?

I've done screen scraping with Erlang but I don't use anything
sophisticated. I have a function that, given a tag, returns the
contents between those tags,and another that returns an array of child
tags nested within another tag.

Using that I access tabular data fairly easily (for my purposes) and
then use string search for attributes etc. I'd be interested in
anything out there that is more sophisticated. One advantage of simple
string searches is it handles bad HTML a bit easier.

Chris.
--
http://www.bluishcoder.co.nz
Post recived from mailinglist
View user's profile Send private message
Ke Han
Posted: Wed Aug 30, 2006 3:29 am Reply with quote
User Joined: 02 Mar 2005 Posts: 107 Location: Shanghai
Joel,
How about jungerl's www_tools ??

Here is a snippet of its example code to show you how easy it is to
tokenize an HTML stream or file and harvest element of interest:

%%********************************
file(File) ->
Toks = html_tokenise:file2toks(File),
analyse(Toks).

analyse(Toks) ->
Hrefs = [H || {tagStart, "a", L} <- Toks, {"href", H} <- L],
Images1 = [S || {tagStart, "img", L} <- Toks, {"src", S} <- L],
Images2 = [S || {tagStart, "body", L} <- Toks, {"background", S}
<- L],
{remove_duplicates(Hrefs), remove_duplicates(Images1++Images2)}.
%%********************************

ke han



On Aug 30, 2006, at 5:46 AM, Joel Reymont wrote:

> Does anyone have tools for screen scraping with Erlang?
>
> It's a combination of HTTP client with parsing and regexp-ing
> through HTML. Ruby has nice tools for this like hpricot and scrAPI
> and they parse HTML into a structure and let you query for elements
> based on their class, id, name, etc.
>
> Thanks, Joel
>
> --
> http://wagerlabs.com/
>
>
>
>
>

Post recived from mailinglist
View user's profile Send private message
klacke
Posted: Wed Aug 30, 2006 7:24 am Reply with quote
User Joined: 28 Feb 2005 Posts: 138
Chris Double wrote:
> On 8/30/06, Joel Reymont <joelr1@gmail.com> wrote:
>> Does anyone have tools for screen scraping with Erlang?
>

There is a decent and forgiving HTML parser in the Yaws source tree
written by Johan bevemyr. I'd start with that.

/klacke



--
Claes Wikstrom -- Caps lock is nowhere and
http://www.hyber.org -- everything is under control
cellphone: +46 70 2097763
Post recived from mailinglist
View user's profile Send private message AIM Address MSN Messenger
Guest
Posted: Wed Aug 30, 2006 9:05 am Reply with quote
Guest
Hi there!

Good advice Ke Han.
And here is a patch for Joe "www_tools" to fix some parsing problems.

cheers
Youn
anders_n
Posted: Wed Aug 30, 2006 2:13 pm Reply with quote
User Joined: 28 Feb 2005 Posts: 155 Location: Saltillo, Mexico
On 8/30/06, Claes Wikstrom <klacke@hyber.org> wrote:
> Chris Double wrote:
> > On 8/30/06, Joel Reymont <joelr1@gmail.com> wrote:
> >> Does anyone have tools for screen scraping with Erlang?
> >
>
> There is a decent and forgiving HTML parser in the Yaws source tree
> written by Johan bevemyr. I'd start with that.
>

I tried to use it once and I found it to not be so forgiving on broken
real-world(tm)
html. If it doesn't find the end tags it expects it chokes.

/Anders
Post recived from mailinglist
View user's profile Send private message Yahoo Messenger
Guest
Posted: Wed Aug 30, 2006 4:16 pm Reply with quote
Guest
How would you compare it to www_tools and its parser?

Have you tried that?

On Aug 30, 2006, at 3:08 PM, Anders Nygren wrote:

> I tried to use it [yaws parser] once and I found it to not be so
> forgiving on broken
> real-world(tm)
> html. If it doesn't find the end tags it expects it chokes.

--
http://wagerlabs.com/





Post recived from mailinglist
anders_n
Posted: Wed Aug 30, 2006 4:23 pm Reply with quote
User Joined: 28 Feb 2005 Posts: 155 Location: Saltillo, Mexico
On 8/30/06, Joel Reymont <joelr1@gmail.com> wrote:
> How would you compare it to www_tools and its parser?
>
> Have you tried that?
>
No, I never tried www_tools, sorry

/Anders

> On Aug 30, 2006, at 3:08 PM, Anders Nygren wrote:
>
> > I tried to use it [yaws parser] once and I found it to not be so
> > forgiving on broken
> > real-world(tm)
> > html. If it doesn't find the end tags it expects it chokes.
>
> --
> http://wagerlabs.com/
>
>
>
>
>
>
Post recived from mailinglist
View user's profile Send private message Yahoo Messenger

Display posts from previous:  

All times are GMT
Page 1 of 1
This forum is locked: you cannot post, reply to, or edit topics.

Jump to:  

You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
You cannot attach files in this forum
You cannot download files in this forum