| Author |
Message |
|
| Guest |
Posted: Wed Aug 15, 2007 6:57 pm |
|
|
|
Guest
|
I've been trying to learn erlang for a while, and I recently found
what I thought to be an easy starter project. I currently have a
simple application that reads data from a couple of Xml files using
SAX, and inserts it using a rpc over http.
I'm not sure about the terminology here, I've been stuck in OO land
for so long that everything looks like an object, but here's what I'm
thinking: One thread reading the xmls and piecing together the data,
and then handing off each record to a pool of workers that issue the
http requests, or, maybe the xml-reading part could just spawn a new
thread for each record it reads, and ensure that only X are running at
the most?
The http request was easy enough to get working, but I'm having
trouble with reading the xml, I used xmerl_scan:file to parse the
file, but that loads the file into memory before starting to process.
I took a look at Erlsom, and it's SAX reader examples, but that read
the entire file into a binary before passing it off to the Xml reader.
Thanks,
Patrik
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Wed Aug 15, 2007 7:24 pm |
|
|
|
Guest
|
I'd also be interested to hear how experienced Erlangers handle this.
I'm trying to do some heavy SAX parsing as well and it'd be nice to
not have to load the entire file into memory at once.
--Kevin
On Aug 15, 2007, at 2:23 PM, Patrik Husfloen wrote:
> I've been trying to learn erlang for a while, and I recently found
> what I thought to be an easy starter project. I currently have a
> simple application that reads data from a couple of Xml files using
> SAX, and inserts it using a rpc over http.
>
> I'm not sure about the terminology here, I've been stuck in OO land
> for so long that everything looks like an object, but here's what I'm
> thinking: One thread reading the xmls and piecing together the data,
> and then handing off each record to a pool of workers that issue the
> http requests, or, maybe the xml-reading part could just spawn a new
> thread for each record it reads, and ensure that only X are running at
> the most?
>
> The http request was easy enough to get working, but I'm having
> trouble with reading the xml, I used xmerl_scan:file to parse the
> file, but that loads the file into memory before starting to process.
>
> I took a look at Erlsom, and it's SAX reader examples, but that read
> the entire file into a binary before passing it off to the Xml reader.
>
>
> Thanks,
>
> Patrik
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-questions
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Wed Aug 15, 2007 7:26 pm |
|
|
|
Guest
|
Interesting - I've been writing some new XML libraries and handling
infinite streams (Well very large) is one of the problems I've been
thinking about
I'll poke around tomorrow and send you some code that might help
/Joe Armstrong
On 8/15/07, Patrik Husfloen <husfloen@gmail.com> wrote:
> I've been trying to learn erlang for a while, and I recently found
> what I thought to be an easy starter project. I currently have a
> simple application that reads data from a couple of Xml files using
> SAX, and inserts it using a rpc over http.
>
> I'm not sure about the terminology here, I've been stuck in OO land
> for so long that everything looks like an object, but here's what I'm
> thinking: One thread reading the xmls and piecing together the data,
> and then handing off each record to a pool of workers that issue the
> http requests, or, maybe the xml-reading part could just spawn a new
> thread for each record it reads, and ensure that only X are running at
> the most?
>
> The http request was easy enough to get working, but I'm having
> trouble with reading the xml, I used xmerl_scan:file to parse the
> file, but that loads the file into memory before starting to process.
>
> I took a look at Erlsom, and it's SAX reader examples, but that read
> the entire file into a binary before passing it off to the Xml reader.
>
>
> Thanks,
>
> Patrik
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Wed Aug 15, 2007 8:20 pm |
|
|
|
Guest
|
On Aug 15, 2007, at 11:23 , Patrik Husfloen wrote:
> I'm not sure about the terminology here, I've been stuck in OO land
> for so long that everything looks like an object, but here's what I'm
> thinking: One thread reading the xmls and piecing together the data,
> and then handing off each record to a pool of workers that issue the
> http requests, or, maybe the xml-reading part could just spawn a new
> thread for each record it reads, and ensure that only X are running at
> the most?
This sounds very similar to the design of my load replay tool. I've
got a tool that reads a pcap file and writes out a binary file that I
suppose is conceptually similar to XML. The playback tool reads that
file and issues HTTP requests with the same types of payload (some
contents rewritten for validity on playback) with the same timings
(to whatever scale is desirable) and logs the results. It works like
this:
1) There's an overseer process that starts all of the other
processes and facilitates communication among them.
2) One process is responsible for reading the file, sleeping as
appropriate, and sending records up to the overseer.
3) Another process is responsible for performing HTTP requests. It
receives the messages from the overseer, issues an async http request
against inets, and adds the result to a dict with a timer. When a
response comes back from inets, it looks up the request and sends the
timing, request, and results back up.
4) The logging process figures out what the request meant, on
behalf of what user it was sent, and some other stuff and logs it.
On startup, I find all available nodes and run one of the requestor
processes (#3) on each node. The overseer has a queue of these
processes and pops the next available requestor off the front, sends
it a request, and adds it to the back of the queue again.
If you want to control how many concurrent requests you're
executing, you can issue the requests synchronously and use a process
queue like I've got there.
--
Dustin Sallings
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| dmitriid |
Posted: Thu Aug 16, 2007 6:52 am |
|
|
|
User
Joined: 17 Aug 2006
Posts: 213
|
Patrik Husfloen wrote:
> I've been trying to learn erlang for a while, and I recently found
> what I thought to be an easy starter project. I currently have a
> simple application that reads data from a couple of Xml files using
> SAX, and inserts it using a rpc over http.
>
>
I guess that for large files you could do better with
http://www.codeproject.com/cpp/HTML_XML_Scanner.asp
"
It is fast. We managed to reach a speed of scanning nearly 40 MB of XML
per second (depends on the hardware you have, of course).
"
And bind that to Erlang... Mmmmmm...
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Thu Aug 16, 2007 7:34 am |
|
|
|
Guest
|
Hi,
There is already support handling infinite streams in the xmerl application.
You should use the xmerl_eventp module and the functions there.
The documentation is here: http://www.erlang.org/doc/man/xmerl_eventp.html
We are using it ourselves , it works ok, but I admit that the
documentation is a bit
sparse.
/Kenneth (Erlang/OTP team at Ericsson)
On 8/15/07, Joe Armstrong <erlang@gmail.com> wrote:
> Interesting - I've been writing some new XML libraries and handling
> infinite streams (Well very large) is one of the problems I've been
> thinking about
>
> I'll poke around tomorrow and send you some code that might help
>
> /Joe Armstrong
>
> On 8/15/07, Patrik Husfloen <husfloen@gmail.com> wrote:
> > I've been trying to learn erlang for a while, and I recently found
> > what I thought to be an easy starter project. I currently have a
> > simple application that reads data from a couple of Xml files using
> > SAX, and inserts it using a rpc over http.
> >
> > I'm not sure about the terminology here, I've been stuck in OO land
> > for so long that everything looks like an object, but here's what I'm
> > thinking: One thread reading the xmls and piecing together the data,
> > and then handing off each record to a pool of workers that issue the
> > http requests, or, maybe the xml-reading part could just spawn a new
> > thread for each record it reads, and ensure that only X are running at
> > the most?
> >
> > The http request was easy enough to get working, but I'm having
> > trouble with reading the xml, I used xmerl_scan:file to parse the
> > file, but that loads the file into memory before starting to process.
> >
> > I took a look at Erlsom, and it's SAX reader examples, but that read
> > the entire file into a binary before passing it off to the Xml reader.
> >
> >
> > Thanks,
> >
> > Patrik
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@erlang.org
> > http://www.erlang.org/mailman/listinfo/erlang-questions
> >
> _______________________________________________
> erlang-questions mailing list
> erlang-questions@erlang.org
> http://www.erlang.org/mailman/listinfo/erlang-questions
>
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Thu Aug 16, 2007 8:27 am |
|
|
|
Guest
|
After thought - I *won't* be sending you any code
My XML stuff is in the middle of a big re-write - I'll do this first.
I'm trying to make several inter-related XML processing things. I
don't believe the
one-tool-suits-all approach for manipulating XML.
Parsing XML raises a number of tricky design issues. What
do we want to do with the XML? --- do we have to handle
infinite (or at least very large) inputs. Is all the input
available at the time of parsing, or does it arrive in
fragmented chunks from a stream. If the data is streamed do we
want to handle the chunks as they come in a re-entrant parser,
or do we want to wait until all the chunks have come and then
do the parsing? In this case we'll have to pre-scan the data so
that we know when to do the parsing.
Given that we've got a tokenizer can we write a parser that
works with lists of tokens, or with streams, or do we have to
write a number of different parsers to handle the different
cases?
Do we want to write a validating parser, or a non-validating parser?
Should it be re-entrant or not?
Do we want to handle simple ASCII character sets or many different
character sets
and is the code very different in the two cases?
Do we want to *exactly reconstruct* the input or
should the parse tree represent the logically equivalent of the
input. For example, do we want to pass tag attributes in the
same order as they appear in the input. Do we want to exactly
retain white space and tabs in places where they are not
semantically important?
These are difficult design questions and it is difficult to
write the libraries in such a way that all of these things can be
done. If we write a very general set of routines they will
probably not be very fast for a specific purpose. If we write fast
specialised routines, they will not be very general.
A lot of XML processing can be done at a token level alone - there is no
need to even have a well-formed document - here parsing and validating would
be a waste of time.
Then we have to decide on performance - a set of routines that work correctly
of GByte files will also work on small files - but if we were only processing
small files then a more efficient algorithms would be possible. Do we have to
write two sets of routines (for large and small files) and can they
share common code?
Anyway - I'm trying to make a toolkit that can allow you to manipulate
a document
either as a stream of tokens, or as a well-formed or as a validated document.
Another question I have is:
What do you want to do with an infinite document?
(here infinite means "too big to keep the parse tree in memory in
an efficient manner")
Do you want to:
a) - produce another infinite document
b) - extract a sub-set according to some filter rules
If it's a) are the things in the output document in the same order
as the things
in the input document? - I guess both a and b would be candidates
for some kinds of
higher order functions that work on xml parse trees
Lot's to think about
/Joe
On 8/15/07, Joe Armstrong <erlang@gmail.com> wrote:
> Interesting - I've been writing some new XML libraries and handling
> infinite streams (Well very large) is one of the problems I've been
> thinking about
>
> I'll poke around tomorrow and send you some code that might help
>
> /Joe Armstrong
>
> On 8/15/07, Patrik Husfloen <husfloen@gmail.com> wrote:
> > I've been trying to learn erlang for a while, and I recently found
> > what I thought to be an easy starter project. I currently have a
> > simple application that reads data from a couple of Xml files using
> > SAX, and inserts it using a rpc over http.
> >
> > I'm not sure about the terminology here, I've been stuck in OO land
> > for so long that everything looks like an object, but here's what I'm
> > thinking: One thread reading the xmls and piecing together the data,
> > and then handing off each record to a pool of workers that issue the
> > http requests, or, maybe the xml-reading part could just spawn a new
> > thread for each record it reads, and ensure that only X are running at
> > the most?
> >
> > The http request was easy enough to get working, but I'm having
> > trouble with reading the xml, I used xmerl_scan:file to parse the
> > file, but that loads the file into memory before starting to process.
> >
> > I took a look at Erlsom, and it's SAX reader examples, but that read
> > the entire file into a binary before passing it off to the Xml reader.
> >
> >
> > Thanks,
> >
> > Patrik
> > _______________________________________________
> > erlang-questions mailing list
> > erlang-questions@erlang.org
> > http://www.erlang.org/mailman/listinfo/erlang-questions
> >
>
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| Guest |
Posted: Thu Aug 16, 2007 5:33 pm |
|
|
|
Guest
|
Well, I think I have a decent idea on where to go from here,
I'll report back this weekend with results.
Thanks everyone,
Patrik
On 8/16/07, Joe Armstrong <erlang@gmail.com> wrote:
> After thought - I *won't* be sending you any code
>
> My XML stuff is in the middle of a big re-write - I'll do this first.
>
> I'm trying to make several inter-related XML processing things. I
> don't believe the
> one-tool-suits-all approach for manipulating XML.
>
>
> Parsing XML raises a number of tricky design issues. What
> do we want to do with the XML? --- do we have to handle
> infinite (or at least very large) inputs. Is all the input
> available at the time of parsing, or does it arrive in
> fragmented chunks from a stream. If the data is streamed do we
> want to handle the chunks as they come in a re-entrant parser,
> or do we want to wait until all the chunks have come and then
> do the parsing? In this case we'll have to pre-scan the data so
> that we know when to do the parsing.
>
>
> Given that we've got a tokenizer can we write a parser that
> works with lists of tokens, or with streams, or do we have to
> write a number of different parsers to handle the different
> cases?
>
> Do we want to write a validating parser, or a non-validating parser?
> Should it be re-entrant or not?
>
> Do we want to handle simple ASCII character sets or many different
> character sets
> and is the code very different in the two cases?
>
> Do we want to *exactly reconstruct* the input or
> should the parse tree represent the logically equivalent of the
> input. For example, do we want to pass tag attributes in the
> same order as they appear in the input. Do we want to exactly
> retain white space and tabs in places where they are not
> semantically important?
>
> These are difficult design questions and it is difficult to
> write the libraries in such a way that all of these things can be
> done. If we write a very general set of routines they will
> probably not be very fast for a specific purpose. If we write fast
> specialised routines, they will not be very general.
>
> A lot of XML processing can be done at a token level alone - there is no
> need to even have a well-formed document - here parsing and validating would
> be a waste of time.
>
> Then we have to decide on performance - a set of routines that work correctly
> of GByte files will also work on small files - but if we were only processing
> small files then a more efficient algorithms would be possible. Do we have to
> write two sets of routines (for large and small files) and can they
> share common code?
>
> Anyway - I'm trying to make a toolkit that can allow you to manipulate
> a document
> either as a stream of tokens, or as a well-formed or as a validated document.
>
> Another question I have is:
>
> What do you want to do with an infinite document?
>
> (here infinite means "too big to keep the parse tree in memory in
> an efficient manner")
>
> Do you want to:
>
> a) - produce another infinite document
> b) - extract a sub-set according to some filter rules
>
> If it's a) are the things in the output document in the same order
> as the things
> in the input document? - I guess both a and b would be candidates
> for some kinds of
> higher order functions that work on xml parse trees
>
> Lot's to think about
>
> /Joe
>
>
>
> On 8/15/07, Joe Armstrong <erlang@gmail.com> wrote:
> > Interesting - I've been writing some new XML libraries and handling
> > infinite streams (Well very large) is one of the problems I've been
> > thinking about
> >
> > I'll poke around tomorrow and send you some code that might help
> >
> > /Joe Armstrong
> >
> > On 8/15/07, Patrik Husfloen <husfloen@gmail.com> wrote:
> > > I've been trying to learn erlang for a while, and I recently found
> > > what I thought to be an easy starter project. I currently have a
> > > simple application that reads data from a couple of Xml files using
> > > SAX, and inserts it using a rpc over http.
> > >
> > > I'm not sure about the terminology here, I've been stuck in OO land
> > > for so long that everything looks like an object, but here's what I'm
> > > thinking: One thread reading the xmls and piecing together the data,
> > > and then handing off each record to a pool of workers that issue the
> > > http requests, or, maybe the xml-reading part could just spawn a new
> > > thread for each record it reads, and ensure that only X are running at
> > > the most?
> > >
> > > The http request was easy enough to get working, but I'm having
> > > trouble with reading the xml, I used xmerl_scan:file to parse the
> > > file, but that loads the file into memory before starting to process.
> > >
> > > I took a look at Erlsom, and it's SAX reader examples, but that read
> > > the entire file into a binary before passing it off to the Xml reader.
> > >
> > >
> > > Thanks,
> > >
> > > Patrik
> > > _______________________________________________
> > > erlang-questions mailing list
> > > erlang-questions@erlang.org
> > > http://www.erlang.org/mailman/listinfo/erlang-questions
> > >
> >
>
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| 0x6e6562 |
Posted: Fri Aug 17, 2007 6:34 am |
|
|
|
User
Joined: 12 Jul 2007
Posts: 250
|
On 8/16/07, Joe Armstrong <erlang@gmail.com> wrote:
> Another question I have is:
>
> What do you want to do with an infinite document?
>
> (here infinite means "too big to keep the parse tree in memory in
> an efficient manner")
If I understand you correctly, you are asking about real world
scenarios where one would want to parse massive XML files in a
streaming fashion? If so, one example use case is a payments system
that I'm working on that accepts instructions in an ISO 20022 XML file
which can contain in excess of 100000 items in a single file (but
there is no hard limit). Each item might be on average 4k, but could
be up to 10k (this is the size of the uncompressed ASCII XML
encoding). Then if you have to process multiple input files
concurrently, you can't materialize the whole thing into memory. So
the approach is to parse and materialze in a streaming fashion and
send the resulting data objects off to a downstream process system.
Anyway, apologies if I misunderstood you and this comment is
irrelevant to this conversation.
HTH,
Ben
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post recived from mailinglist |
|
|
| Back to top |
|
| lummie |
Posted: Wed Jun 18, 2008 9:56 pm |
|
|
|
Joined: 18 Jun 2008
Posts: 3
|
I have to extract information from some extremely large XML files, 6GB+ as well. All signs point to xmerl_eventp but the documentation is non exactly fleshed out and as I've only been doing erlang for a few weeks I am really struggling.
Did anyone go down the eventp route and did you find any documentation or can any provide a basic example of the callback module that is required.
Any help so greatly appreciated...
regards,
Matt
P.s. Great book Joe, I've gone zero to attempting to code an olap system in 3 weeks  |
|
|
| Back to top |
|
| lummie |
Posted: Wed Jun 18, 2008 10:10 pm |
|
|
|
Joined: 18 Jun 2008
Posts: 3
|
I have to extract information from some extremely large XML files, 6GB+
as well. All signs point to xmerl_eventp but the documentation is non
exactly fleshed out and as I've only been doing erlang for a few weeks I
am really struggling.
Did anyone go down the eventp route and did you find any documentation
or can any provide a basic example of the callback module that is
required.
Any help so greatly appreciated...
regards,
Matt
P.s. Great Pragmatic book Joe.
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org
http://www.erlang.org/mailman/listinfo/erlang-questions
Post received from mailinglist |
|
|
| Back to top |
|
| emofine |
Posted: Thu Jun 19, 2008 2:47 am |
|
|
|
User
Joined: 29 Aug 2008
Posts: 24
|
I don't know xmerl eventp, but for such large XML files it may be prudent performance-wise to interface to a fast 'C' SAX parser such as expat. The ejabberd project has an Erlang driver written to use the expat parser. It's also some pretty good code for a proven real-world system that one could learn a lot from. You may be able to adapt the code (I think it's called xml_stream.erl and the 'C' files under c_src) although if, as you write, you have only been doing Erlang for a few weeks it may be a bit of a leap to get into linked-in drivers.
Hope this helps.
On Wed, Jun 18, 2008 at 6:07 PM, Matt Harrison <matt@lummie.co.uk (matt@lummie.co.uk)> wrote:
Quote: I have to extract information from some extremely large XML files, 6GB+
as well. All signs point to xmerl_eventp but the documentation is non
exactly fleshed out and as I've only been doing erlang for a few weeks I
am really struggling.
Did anyone go down the eventp route and did you find any documentation
or can any provide a basic example of the callback module that is
required.
Any help so greatly appreciated...
regards,
Matt
P.s. Great Pragmatic book Joe.
_______________________________________________
erlang-questions mailing list
erlang-questions@erlang.org (erlang-questions@erlang.org)
http://www.erlang.org/mailman/listinfo/erlang-questions
Post received from mailinglist |
|
|
| Back to top |
|
| timrila |
Posted: Tue Jun 12, 2012 9:31 am |
|
|
|
User
Joined: 28 Mar 2012
Posts: 32
|
|
| Back to top |
|
| wuji |
Posted: Mon Aug 13, 2012 7:07 am |
|
|
|
User
Joined: 10 Aug 2012
Posts: 654
|
Anderson has ordered a review of police response to the the imitation designer *beep* the incident.Luther was remembered as an "awesome guy" and was
well loved," Jay Allis, the kitchen manager at the restaurant, restaurant, cheap polo shirts restaurant, told ABC News' Nashville affiliate WKRN.A recording at the
East Cafe said it would remain closed for "an indeterminate indeterminate jordan 6s indeterminate amount of time due to a tragedy in out
Cafe family."Police Pursuits in California Have Injured More Than 10,000Nearly 10,000Nearly jordan 6 10,000Nearly 90 Percent of Pursuits are for Non-Violent OffensesBy DAVID
19, 2012 More bystanders are injured or killed during high-speed high-speed [h3]cheap polo ralph lauren[/h3] high-speed police chases than by stray bullets. In California, more
10,000 people have been injured and over 300 people killed killed [h4]cheap designer *beep*[/h4] killed because of police chases in the last decade, according
newly released statistics from the California Highway Patrol.Nationally, it's estimated estimated [h2]jordan 6[/h2] estimated nearly 300 people die each year as a result |
|
|
| Back to top |
|
|
|
All times are GMT
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum You cannot attach files in this forum You cannot download files in this forum
|
|
|