Arc Forumnew | comments | leaders | submitlogin
Strings as lists: scanner.arc and treeparse.arc (about wiki-arc.arc's implementation)
13 points by almkglor 6127 days ago | 31 comments
Strings as lists are very useful; they made Arkani, the wiki in arc, much easier to implement. Scanners allow us to treat strings as lists; this also makes a parser combinator library, such as raymyers' treeparse.arc, very much useable for string parsing.

pg once mentioned that he might actually, some day, implement strings such that their interface would be identical to that of lists. Memory inefficiency concerns aside (I've posted a memory-cheap implementation of lists which uses arrays for much of the list run; it has all the semantics of lists but has some of the access times of arrays)[1], I've found it very useful in implementing Arkani (Arki?), the wiki in Anarki. (It's in the file wiki-arc.arc on what used to be arc-wiki.git)

Scanners[2] are an attempt to, primarily, use strings as lists. pg hasn't implemented it in ArcN yet because it's "misleading", since 'scar (= (car foo) 42) and 'scdr (= (cdr foo) 42) won't work properly on strings. However, scanners represent the realization that 'scar and 'scdr are pretty rare anyway; so you might as well create an abstract "limited" form of list, which supports only 'car and 'cdr operations. These scanners, among other things, can be used to scan into strings.

For example, in Arkani, the history list is modelled as exactly that: a list of changes between revisions of the article. However, it has to be stored on-disk, as a string of UTF-8 (it seems that mzscheme can actually handle this properly). We could store it as an Arc-readable representation of the list, but this has the drawback that it makes the metadata longer. As an example, this is how the diff between revisions might look:

  ((4 skip) delete (insert "article.\r\n\r\n") delete (68 skip))
Instead, Arkani uses its own format:

  4sd12iarticle.
  
  d68s@
(the newlines exist because of the \r\n sequence). Each one-letter command might have a number before it, representing the number of times it is executed, or in the case of `i' the number of characters to insert. An `@' ends the diff list. However, this is obviously not parseable by 'read.

In addition, most of the time we expect that users would be more interested in changes in more recent versions of the article rather than in older ones. If we were to use the built-in Arc reader, it would parse the entire history; however, scanners are inherently lazy, and won't execute the 'cdr unless you actually ask for it (and will also take the liberty of memoizing it).

Now although the scanner library I created includes scanners for strings (as lists of characters), it also allows you to create your own scanner. In the case of the Arkani history reader, it reads through the string, decomposing each history entry in the string and creating a virtual object for each history entry for us.

However, it doesn't scan through the entire string. Instead it just computes the 'car of the history list, and then adds a promise for the 'cdr - the promise being to call itself, but with the index set to after the end of the current entry.

Another part of Arkani which uses scanners is the paragraph divider. When rendering the page, Arkani first tries to figure out paragraph divisions. Similar to the way the history log scanner works, the paragraph divider first scans through the text, ignoring empty lines until it reaches a set of non-empty lines. It then ends the paragraph just prior to an empty line, and adds a promise to look at the next paragraph starting after the empty line.

However, the main advantage of scanners is really the way in which they can be used in conjuction with treeparse.arc [3]. Treeparse.arc was designed for use with lists, not strings; however, fortunately by using scanners, strings are lists (or rather, can be wrapped by something which quacks convincingly like a list).

For example, to detect links [[like this]], we have the following code in wiki-arc.arc:

        (= open-br
          (seq-str "[["))
        (= close-br
          (seq-str "]]"))
        (= p-alphadig
          (pred alphadig:car anything))
     ...
        (= plain-wiki-link
          (seq open-br
               ; should really be (many anything), however treeparse.arc
               ; currently does not do backtracking on 'many
               (sem on-plain-wiki-link (many (anything-but #\| close-br)))
               close-br
               (sem on-wiki-link-completed (many p-alphadig))))

'seq-str is an extension to 'seq, and simply wraps the string in a scanner that 'seq can understand. Basically, it simply searches for the literal sequence of characters in the given string. 'seq, of course, simply scans for the given series of sub-parsers. 'many means 0 or more instances of a parser, while 'anything-but means any element, except for the elements listed. 'pred adds a predicate function, so p-alphadig means that we add an 'alphadig predicate to 'anything.<p>'sem is used to add "semantics" or meaning. 'sem accepts a function and a parser. If the parser succeeds, 'sem passes the parsed sublist to the function; in the case above, the function 'on-plain-wiki-link stores the link destination, while 'on-wiki-link-completed prints the link's text (which includes the trailing alphanumeric characters on [[link]]s).

Using the treeparse library on scanners is quite easy, and allows us to use the same library for lists (the original intended function for treeparse) and strings (possible by the user of scanners)

[1] http://arclanguage.org/item?id=4146

[2] http://arclanguage.org/item?id=4739

[3] http://arclanguage.org/item?id=3970



4 points by raymyers 6126 days ago | link

Great to see someone using treeparse. Using scanner-string to view strings as lists is an approach I had never considered. As nex3 said, scanners are a surprisingly useful abstraction, showing yet more uses for lazy evaluation.

-----

1 point by almkglor 6126 days ago | link

That said, the formatter in wiki-arc is plenty darned slow on &amp; codes (I used a really long 'alt form). Lacking a profiler, I can't really tell whether it's treeparse or scanner-string which is slow. Any suggestions on optimization? I think one problem is that each parser returns a list of stuff, without any chance to reuse that list (especially for 'seq forms). Another problem might be that scanner-string is just too slow for easy usage.

-----

1 point by raymyers 6126 days ago | link

I haven't fully digested the wiki parser yet. As a first thought, the use of enclose-sem confuses me a bit -- seems like a reinvention of filt. I doubt that would be a performance issue of course.

Maybe the grammar should be factored, or maybe this is an opportunity to optimize treeparse.

Try this: convert the string to a normal list of characters in advance, and see if the performance improves:

  (map idfn (scanner-string "hello world"))
If that doesn't help, then scanner-string is not the problem.

-----

3 points by almkglor 6126 days ago | link

Hmm, sorry, I didn't fully digest 'filt either. I thought 'filt was for transforming the 'parsed field in the return value? 'enclose-sem is intended to act on the 'actions field in the return value (although I did end up passing the entire return value).

The main slowdown occurred on adding &amp; codes. Here's my most degenerate case:

&Agrave; &Aacute; &Acirc; &Atilde; &Auml; &Aring; &AElig; &Ccedil; &Egrave; &Eacute; &Ecirc; &Euml; &Igrave; &Iacute; &Icirc; &Iuml; &Ntilde; &Ograve; &Oacute; &Ocirc; &Otilde; &Ouml; &Oslash; &Ugrave; &Uacute; &Ucirc; &Uuml; &szlig; &agrave; &aacute; &acirc; &atilde; &auml; &aring; &aelig; &ccedil; &egrave; &eacute; &ecirc; &euml; &igrave; &iacute; &icirc; &iuml; &ntilde; &ograve; &oacute; &ocirc; &oelig; &otilde; &ouml; &oslash; &ugrave; &uacute; &ucirc; &uuml; &yuml; &iquest; &iexcl; &sect; &para; &dagger; &Dagger; &bull; &ndash; &mdash; &lsaquo; &rsaquo; &laquo; &raquo; &lsquo; &rsquo; &ldquo; &rdquo; &trade; &copy; &reg; &cent; &euro; &yen; &pound; &curren; x&sup1; x&sup2; x&sup3; &alpha; &beta; &gamma; &delta; &epsilon; &zeta; &eta; &theta; &iota; &kappa; &lambda; &mu; &nu; &xi; &omicron; &pi; &rho; &sigma; &sigmaf; &tau; &upsilon; &phi; &chi; &psi; &omega; &Gamma; &Delta; &Theta; &Lambda; &Xi; &Pi; &Sigma; &Phi; &Psi; &Omega; &int; &sum; &prod; &radic; &minus; &plusmn; &infin; &asymp; &prop; &equiv; &ne; &le; &ge; &times; &middot; &divide; &part; &prime; &Prime; &nabla; &permil; &deg; &there4; &alefsym; &oslash; &isin; &notin; &cap; &cup; &sub; &sup; &sube; &supe; &not; &and; &or; &exist; &forall;

Takes about 5-6 seconds to render; also, if I do some thing on the repl (such as searching using (help "searchstring"), or loading some random library) the parsing takes up to 12 seconds. Anyway I've added a rendering time at the lower right of each rendered page. Note also that I've added caching, to disable caching you'll need to search through wiki-arc.arc for *wiki-def and change the (cached-table) call to (cached-table 'cachetime 0).

-----

3 points by raymyers 6126 days ago | link

I think I've found the problem. Calling a many parser on a list that long (over 1000 characters) takes waaay too long.

Even something simple like this takes around 6 second on my machine:

   ((many anything) (range 1 1000)))
If treeparse is going to be feasible for parsing large strings, the basic combinators like maybe and alt will need to speed up.

-----

1 point by almkglor 6126 days ago | link

I tried the following modification on 'many:

  (def many (parser)
    "Parser is repeated zero or more times."
    (fn (remaining) (many-r parser remaining nil nil nil nil)))

  (let lastcdr (afn (p) (aif (cdr p) (self it) p))
    (def many-r (parser li acc act-acc acctl act-acctl)
      (iflet (parsed remaining actions) (parse parser li)
             (do
               (when parsed
                 ; edit: necessary, it seems that some of the other
                 ; parsers reuse the return value
                 (zap copy parsed)
                 ; end of edit
                 (if acc
                     (= (cdr acctl) parsed)
                     (= acc parsed))
                 (= acctl (lastcdr parsed)))
               (when actions
                 ; edit: necessary, it seems that some of the other
                 ; parsers reuse the return value
                 (zap copy actions)
                 ; end of edit
                 (if act-acc
                     (= (cdr act-acctl) actions)
                     (= act-acc actions))
                 (= act-acctl (lastcdr actions)))
               (many-r parser remaining
                       acc act-acc acctl act-acctl))
             (return acc li act-acc))))
Basically instead of using join, I used a head+tail form of concatenating lists. It seems to work, and the optimization above seems to drop the test:

   ((many anything) (range 1 1000)))
down to 27 msec (edited: 58msec) on my machine (it was about 7350msec on the older version)

What are your thoughts? The code now looks unprintable. Also, I'm not 100% sure of its correctness.

UPDATE: yes, it's not correct, however the edited version above seems to work now. Rendering of my "difficult" page has dropped to 1100msec.

-----

2 points by almkglor 6126 days ago | link

I've since added a 'tconc facility to Anarki. Basically tconc encapsulates away the head+tail form of list catenation; a single cons cell is used with car==head and cdr==tail.

The head of the list is the start of the list, while the tail of the list is the last cons cell:

  cons
    O->cdr
    v
    car

  the list (1 2 3 4 5):
  O->O->O->O->O->nil
  v  v  v  v  v
  1  2  3  4  5

  the tconc cell for the above list:
  tconc cell
  O-----------+
  | head      | tail
  v           v
  O->O->O->O->O->nil
  v  v  v  v  v
  1  2  3  4  5
'tconc creates a new cell and modifies the tconc cell to repoint the tail to the new tail. You can extract the currently concatenated list by using 'car on the tconc cell.

The diff between my version of treeparse and yours is now:

  --- treeparse.arc     2008-03-21 11:59:13.000000000 +0800
  +++ m_treeparse.arc   2008-03-22 23:00:51.000000000 +0800
  @@ -23,4 +23,6 @@
   ; Examples in "lib/treeparse-examples.arc"
   
  +(require "lib/tconc.arc")
  +
   (mac delay-parser (p)
     "Delay evaluation of a parser, in case it is not yet defined."
  @@ -112,12 +114,12 @@
   (def many (parser)
     "Parser is repeated zero or more times."
  -  (fn (remaining) (many-r parser remaining nil nil)))
  +  (fn (remaining) (many-r parser remaining (tconc-new) (tconc-new))))
   
   (def many-r (parser li acc act-acc)
     (iflet (parsed remaining actions) (parse parser li)
            (many-r parser remaining
  -                 (join acc parsed) 
  -                 (join act-acc actions))
  -         (return acc li act-acc)))
  +                 (nconc acc (copy parsed))
  +                 (nconc act-acc (copy actions)))
  +         (return (car acc) li (car act-acc))))
   
   (def many1 (parser)
edit: note that use of 'tconc/'nconc is slightly slower than explicitly passing around the tails. For the test, it runs at 79 msec on my machine (explicit passing ran at 58msec); this is expected since we must destructure the cons cell into the head and tail of the list under construction. Would it be perhaps better to use a macro to hide the dirty parts of the code in explicit passing of hd and tl?

-----

1 point by raymyers 6126 days ago | link

Nice optimization. I'm not so sure about the naming of nconc, though. Although it is used for a similar purpose as the traditional CL nconc, I would expect anything called nconc to behave like this:

  (def last-list (li)
    (if (or (no li) (no (cdr li))) li
        (last-list (cdr li))))

  (def nconc (li . others)
    "Same behavior as Common Lisp nconc."
    (if (no others) li
        (no li) (apply nconc others)
        (do (= (cdr (last-list li)) (apply nconc others))
            li)))

-----

1 point by almkglor 6125 days ago | link

Ah crick; let me change that to lconc, that was what I was thinking ^^

I picked up 'tconc and lconc from Cadence Skill; see:

http://www.ece.uci.edu/eceware/cadence/sklanguser/chap8.html...

Funny that CL doesn't actually have this facility ^^

Will rename this soon, in the meantime, do you think this optimization is worth putting in treeparse?

-----

4 points by raymyers 6125 days ago | link

>> do you think this optimization is worth putting in treeparse?

Certainly. At the moment you are probably 50% of the treeparse user base, so it needs to be fast enough for your use case :)

I admit that efficiency wasn't a big thought when I first wrote treeparse (besides avoiding infinite loops -- hopefully those are gone now...). I fondly remember my CL optimization days... we've gotta make ourselves one of those nifty profilers for Arc.

-----

3 points by almkglor 6125 days ago | link

>> we've gotta make ourselves one of those nifty profilers for Arc.

True, true. I was optimizing random functions in treeparse, but didn't get a boost in speed until you suggested optimizing 'many.

-----

1 point by almkglor 6125 days ago | link

It seems that 'many is the low-hanging fruit of optimization. I've since gotten an 8-paragraph lorem ipsum piece, totalling about 5k, which renders in 3-4 seconds (about around 3800msec).

Hmm. Profiler.

I'm not 100% sure but maybe the fact that nearly all the composing parsers decompose the return value of sub-parsers, then recompose the return value, might be slowing it down? Maybe have parsers accept an optional return value argument, which 'return will fill in (instead of creating its own) might reduce significantly the memory consumption (assuming it's GC which is slowing it down)?

mockup:

  (def parser-function (remaining (o retval (list nil nil nil)))
    (....)
    (return parsed li actions retval))

  (def many-r (parser remaining acc act-acc (o retval (list nil nil nil)))
      (while (parse parser remaining retval)
        ; parsed
        (lconc acc (copy (car retval)))
        ; actions
        (lconc act-acc (copy (car:cdr:cdr retval)))
        (= remaining (car:cdr scratch)))
      (return (car acc) remaining (car act-acc) retval))
Removing 'actions might help too - we can now use just a plain 'cons cell, with car == parsed and cdr == remaining.

-----

1 point by raymyers 6125 days ago | link

I tried taking out actions for the heck of it. Removing them yields roughly a 30% speed increase on this benchmark:

  (time (do ((many anything) (range 1 5000)) nil))
Using the following method, we can keep actions as a feature but still get the 30% speedup when we don't use them.

  (def many (parser)
    "Parser is repeated zero or more times."
    (fn (remaining) (many-r parser remaining (tconc-new) nil)))

  (def many-r (parser li acc act-acc)
    (iflet (parsed remaining actions) (parse parser li)
           (many-r parser remaining
                   (lconc acc (copy parsed))
                   (if actions (join act-acc actions) act-acc))
           (return (car acc) li act-acc)))
Not bad, but still not as fast as we'd want for processing wiki formatting on the fly...

ed: Yes. act-acc, not (car act-acc).

-----

1 point by almkglor 6125 days ago | link

Hmm. If you remove 'actions, how about also trying to use just a single 'cons cell:

  (iflet (parsed . remaining) (parse parser remaining)
    ...)

  (def return (parsed remaining)
    (cons parsed remaining))
?

If the speed increase is that large on that testbench, it might very well be due to garbage collection.

This might be an interesting page for our problem here ^^

http://www.valuedlessons.com/2008/03/why-are-my-monads-so-sl...

-----

1 point by raymyers 6125 days ago | link

Tried changing the list to a single cons cell. I did not see any additional performance boost.

-----

1 point by almkglor 6125 days ago | link

  (def many-r (parser li acc act-acc)
    (iflet (parsed remaining actions) (parse parser li)
           (many-r parser remaining
                   (lconc acc (copy parsed))
                   (if actions (join act-acc actions) act-acc))
           (return (car acc) li (car act-acc))))
s/(car act-acc)/act-acc maybe?

Personally I don't mind losing 'actions, it does seem that 'filt would be better ^^.

-----

1 point by almkglor 6125 days ago | link

I tested this on my 8-paragraph 5000-char lorem ipsum page, and the run dropped down to about 3400msec (from 3800 msec).

Hmm. Not sure where the slow down is now ^^

I've tried my "retval" suggestion and it's actually slower, not faster. So much for not creating new objects T.T;

-----

1 point by almkglor 6125 days ago | link

Arrg, I've built a sort-of profiler for the wiki, it's on the git, to enable just look for the line:

  (= *wiki-profiling-on nil)
And change it to t, then reload Arki to turn it on. Then use (*wiki-profile-print) to print out the profile report.

Note that turning on profiling increases time by a factor of > 5. Don't use unless desperate.

Anyway a sample run - this page was rendered in about 800msec without profiling, with profiling it took about 5150msec:

  bold: 305
  bolded-text: 914
  nowiki-e: 31
  open-br: 305
  seq-r: 2681
  italicized-text: 793
  many-format: 4830
  plain-wiki-link: 841
  alt-r: 4555
  nowiki-text: 584
  nowiki: 184
  joined-wiki-link: 413
  ampersand-coded-text: 486
  ampersand-codes: 171
  italics: 128
  many-r: 4829
  formatting: 4616
  close-br: 39
Note that the timing will not be very accurate or particularly useful IMO, since it doesn't count recursion but does count calls to other functions. Sigh. We need a real profiler ^^

-----

2 points by raymyers 6124 days ago | link

>> Sigh. We need a real profiler ^^

Maybe this'll help. http://www.arclanguage.org/item?id=5318

-----

1 point by raymyers 6125 days ago | link

Knowing a bit about the call hierarchy, maybe we can squeeze a bit more knowledge out of that. Here's what seems to be going on:

    formatting               4616
    [-] alt-r                4555
     | bolded-text            914
     | plain-wiki-link        841
     | italicized-text        793
     | nowiki-text            584
     | ampersand-coded-text   486
     | joined-wiki-link       413

-----

1 point by almkglor 6124 days ago | link

Hmm, then the total time of alt-r's children is 4031, leaving 524 msec in alt-r itself.

My test page has quite a bit of bolded text (for testing), so I suppose it's the reason why bolded-text is the highest. Hmm.

Anyway I'm thinking of adding the following parser to the top of the big 'alt structure in formatting:

  (= plain-text
    (pred [or (alphadig _) (whitec _) (in _ #\. #\,)] anything))

  (= formatting
    (alt
      plain-text
      ...))
Hmm. It seems we can't squeeze much performance out of 'alt, I can't really see a way of optimizing 'alt itself, so possibly we should optimize the grammar that uses 'alt.

-----

1 point by almkglor 6124 days ago | link

Did something highly similar to this, it reduced my 8-paragraph lorem ipsum time from about 3200msec to 2100msec.

-----

1 point by raymyers 6126 days ago | link

True, filt doesn't touch the actions field, while sem does. However, I am usually able to replace the use of actions with filters that operate on the parsed field. I prefer this, because the filter style is a more clean and functional style -- rather than relying on side-effects. Hopefully that made sense.

I don't yet know for certain whether filters could or should be used in this particular case. enclose-sem might be the right way to go after all.

-----

1 point by almkglor 6126 days ago | link

I'll have to defer to you on this one - I've only written a parser combinator type parser once before, and that was before I learned what it was. I did end up using something nearer to filters (i.e. acts on the returned value instead of having a separate 'actions field).

Edit: Perhaps part of the optimization of treeparse could be to eliminate the 'actions field. None of your samples used 'sem, and I might prefer 'filt for such enclosing stuff as ''bold'' and '''italics'''. The [[link]]s might be more difficult (because of the necessity of adding alphadigs after the closing bracket in [[link]]s to the actual link text but not the link target), but I think it's doable.

-----

2 points by raymyers 6125 days ago | link

I've thought several times of removing sem in favor of filt. As features, they are very similar but filt is usually cleaner. If I don't see a compelling use case for actions soon I may indeed remove them. This would simplify the interface considerably.

Just between us, filt is actually an analogue to Haskell's monadic lift operator. They're even anagrams! (this happened by accident.)

-----

1 point by almkglor 6125 days ago | link

Okay, I've since ported the Arki wikiformat parser to use filt instead of sem. Removing 'actions would reduce slightly the memory consumption of treeparse.

-----

1 point by almkglor 6126 days ago | link

>> Try this: ...

Doesn't help. I inserted (map idfn ...) on the value given to enformat, which is really the currying of enformat-base (the very short (fn ) at the end):

  (fn (p)
    (carry-out (parse (many formatting) (map idfn p)))))
Hmm. I tried refactoring the (apply alt (map str-seq ...)) thing to an alt-str-seq, basically I created a trie and then did alt on each key, with each sub-trie another 'alt and stored sequences converted to 'seq, and stored end-of-strings (nil keys) as 'nothing, but it was buggy and didn't improve performance T.T.

I've been doing quite a few attempts at improving performance but often I ended up getting more bugs, waa!

-----

2 points by raymyers 6126 days ago | link

Hmm. I just tried replacing the scanner. Just as you also found, it didn't help performance.

  (def scanner-string2 (s (o start 0) (o end (len s)))
    (map idfn (scanner-string s start end)))
I'll see if I can find time soon to digest the wiki-arc grammar. Possibly it could be optimized, but I suspect treeparse could be handling it better. There are probably lessons to be learned from how Parsec gets good performance.

-----

1 point by almkglor 6125 days ago | link

Found the parsec paper:

http://legacy.cs.uu.nl/daan/download/papers/parsec-paper.pdf

http://legacy.cs.uu.nl/daan/pubs.html#parsec

It seems that part of Parsec's optimization is that it is actually limited to an LL(1) grammar (whatever that means) and it's <|> ('alt) will fail immediately if the first parser ever consumed anything at all. Not sure how that translates to treeparse.

-----

1 point by raymyers 6125 days ago | link

LL(1) grammars only require one token of look-ahead to parse.

Parsec does not strictly require this, it can handle infinite look-ahead grammars. However, for good performance, it is best to use LL(1) grammars -- so there will be no backtracking required.

When using Parsec, I have often been surprised by the quick-failing behavior of <|> that you mentioned. Thus, I did not duplicate it in treeparse.

-----

1 point by almkglor 6124 days ago | link

Hmm. Apparently it's the fast way of doing it, although I'm not sure how to implement it in the first place.

-----