Arc Forum | Arc Challenge

Arc Forum

Arc Challenge

8 points by lojic 6194 days ago | 12 comments

I asked about implementing Peter Norvig's simple spelling corrector in Arc previously w/o much interest. However since then, I discovered Clojure and found an implemention in Clojure (by Rich Hickey - Clojure's inventor) that's one line shorter than Norvig's 22 line Python program, and quite a bit shorter than my Ruby translation, that shows off a little bit of Clojure.

This seems like a reasonably small problem for an Arc challenge. And with 3 versions to serve as a reference (Python, Ruby, Clojure), I wouldn't think it would be that difficult for someone fluent in Arc (I am not).

Can an Arc version match or exceed the Clojure version with respect to brevity, beauty or other criteria of your choosing? Feel free to assume the existence of re-seq and slurp - it's a language challenge, not a library challenge.

I'll paste some resource links into a comment so they're clickable:

  (defn words [text] (re-seq #"[a-z]+" (. text (toLowerCase))))

  (defn train [features]
    (reduce (fn [model f] (assoc model f (inc (get model f 1)))) 
            {} features))

  (def *nwords* (train (words (slurp "big.txt"))))

  (defn edits1 [word]
    (let [alphabet "abcdefghijklmnopqrstuvwxyz", n (count word)]
      (distinct (concat
        (for [i (range n)] (str (subs word 0 i) (subs word (inc i))))
        (for [i (range (dec n))]
          (str (subs word 0 i) (nth word (inc i)) (nth word i) (subs word (+ 2 i))))
        (for [i (range n) c alphabet] (str (subs word 0 i) c (subs word (inc i))))
        (for [i (range (inc n)) c alphabet] (str (subs word 0 i) c (subs word i)))))))

  (defn known [words nwords] (for [w words :when (nwords w)]  w))

  (defn known-edits2 [word nwords] 
    (for [e1 (edits1 word) e2 (edits1 e1) :when (nwords e2)]  e2))

  (defn correct [word nwords]
    (let [candidates (or (known [word] nwords) (known (edits1 word) nwords) 
                         (known-edits2 word nwords) [word])]
      (apply max-key #(get nwords % 1) candidates)))


  user=> (correct "misstake" *nwords*)
  "mistake"

5 points by lojic 6194 days ago | link

Norvig's original article: http://norvig.com/spell-correct.html

Norvig's article has a large file for training the corrector that you can download, or use your own.

My Ruby translation: http://lojic.com/blog/2008/09/04/how-to-write-a-spelling-cor...

Reference for the Clojure version above: http://en.wikibooks.org/wiki/Clojure_Programming#Norvig.27s_...

Previous post here: http://arclanguage.com/item?id=7906

-----

4 points by cchooper 6193 days ago | link

This is my first attempt:

  (= *nwords* (counts:tokens (downcase:w/infile f "big.txt" (tostring (drain:pr:readline f))) whitec))

  (def edits1 (word)
    (with (alphabet "abcdefghijklmnopqrstuvwxyz"
           n (len word))
      (dedup:accum add
        (forlen i word (add:+ (cut word 0 i) (cut word (+ 1 i))))
        (for i 0 (- n 2) (add:string (cut word 0 i) (word (+ 1 i)) (word i) (cut word (+ 2 i))))
        (forlen i word (each c alphabet (add:string (cut word 0 i) c (cut word (+ 1 i)))))
        (for i 0 n (each c alphabet (add:string (cut word 0 i) c (cut word i)))))))

  (def known-edits2 (word) (flat:map known:edits1 (edits1 word)))

  (def known (words) (keep [*nwords* _] words))

  (def correct (word)
    (let candidates (or (known:list word) (known:edits1 word) (known-edits2 word))
      (best (compare > [*nwords* _]) candidates)))

This achieves nothing close to 10 words per second in the worst case scenario, more like one word per minute.

-----

2 points by lojic 6193 days ago | link

Thanks! I'd say that definitely compares favorably with the Clojure version aesthetically.

The performance is a little disappointing though :( Here's the performance of my Ruby (i.e. slowest of all languages!) version:

  ~/sync/code/ruby$ irb
  irb(main):001:0> load 'spelling_corrector.rb'
  => true
  irb(main):002:0> def time_sc n
  irb(main):003:1> t1 = Time.now
  irb(main):004:1> n.times { correct "zaqxsw" }
  irb(main):005:1> puts Time.now - t1
  irb(main):006:1> end
  => nil
  irb(main):007:0> time_sc 10
  25.809097

So that's 2.6s per word.

The Clojure version is considerably faster (0.33s per word):

  user=> (time (dotimes i 10 (correct "zaqxsw" *nwords*)))
  "Elapsed time: 3254.863 msecs"

For words for which a correction was found, the Clojure version processed 900/s, Ruby 100/s

-----

1 point by cchooper 6193 days ago | link

Oddly, the Ruby version runs more slowly every time I run it. I think memory locality may be the problem, as the Ruby process grows each time (now at 85 MB). That's nothing compared to my MzScheme process which is now consuming 700MB. That's probably due to the inefficient way in which nwords is constructed. A bit of tuning there might work wonders for overall performance.

-----

1 point by babo 6188 days ago | link

Please post python's speed as a reference, it's wicked fast as far as I remember.

-----

1 point by cchooper 6193 days ago | link

This change reduces the running time by about a third:

  (def known-edits2 (word) 
    (accum add (each e (edits1 word) (map add (keep [*nwords* _] (edits1 word))))))

-----

4 points by rkts 6192 days ago | link

Probably because it's incorrect. You should have (edits1 e) at the end there.

-----

2 points by cchooper 6191 days ago | link

True. The correct version is still faster though.

-----

1 point by rkts 6186 days ago | link

Here's mine. It seems to be faster than the versions presented so far, assuming I haven't made any dumb mistakes. You'll have to put

  (xdef 'sub substring)

in your ac.scm first.

  (= wordfile "big.txt")

  (= *nwords* (table))

  (w/infile f wordfile (whilet l readline.f (counts (tokens downcase.l whitec) *nwords*)))

  (= str [coerce _ 'string])

  (= alphabet (accum a (each c "abcdefghijklmnopqrstuvwxyz" (a str.c))))

  (def edits1 (w)
    (let n len.w
      (accum a
        (for i 0 (- n 1) (a:+ (sub w 0 i) (sub w (+ i 1))))
        (for i 0 (- n 2) (a:+ (sub w 0 i) (str:w (+ i 1)) (str:w i) (sub w (+ i 2))))
        (for i 0 (- n 1) (each c alphabet (a:+ (sub w 0 i) c (sub w (+ i 1)))))
        (for i 0 n       (each c alphabet (a:+ (sub w 0 i) c (sub w i)))))))

  (def known (ws) (keep [*nwords* _] ws))

  (def known-edits2 (w)
    (accum a (each e edits1.w (each e edits1.e (if *nwords*.e a.e)))))

  (def correct (w)
    (best (compare > [*nwords* _])
          (or (known:list w) (known:edits1 w) (known-edits2 w))))

The main optimizations I made were replacing cut with sub, replacing string with +, and dropping the dedup.

-----

1 point by rkts 6186 days ago | link

Also, here's a version which uses iterators to reduce consing. It's ~2x faster on long words (again, assuming no blunders on my part).

  (= wordfile "big.txt")

  (= *nwords* (table))

  (w/infile f wordfile (whilet l readline.f (counts (tokens downcase.l whitec) *nwords*)))

  (= str [coerce _ 'string])

  (= alphabet (accum a (each c "abcdefghijklmnopqrstuvwxyz" (a str.c))))

  (def edits1 (w)
    (fn (f)
      (let n len.w
        (for i 0 (- n 1) (f:+ (sub w 0 i) (sub w (+ i 1))))
        (for i 0 (- n 2) (f:+ (sub w 0 i) (str:w (+ i 1)) (str:w i) (sub w (+ i 2))))
        (for i 0 (- n 1) (each c alphabet (f:+ (sub w 0 i) c (sub w (+ i 1)))))
        (for i 0 n       (each c alphabet (f:+ (sub w 0 i) c (sub w i)))))))

  (def editsn (n w)
    (let ws [_ w]
      (for i 1 n (= ws imappend.edits1.ws))
      (iornil:ikeep [*nwords* _] ws)))

  (def correct (w)
    (aif (or editsn.0.w editsn.1.w editsn.2.w)
      (ibest (compare > [*nwords* _]) it)))

  ; iterator utils

  (def iempty (i) (ccc (fn (c) (i [c nil]) t)))

  (def iornil (i) (and (~iempty i) i))

  (def imappend (r i) (fn (f) (i [r._ f])))

  (def ikeep (p i) (fn (f) (i [if p._ f._])))

  (def ifoldl (r q i) (i [= q r.q._]) q)

  (def ibest (gt i) (ifoldl (fn (y x) (if (or no.y gt.x.y) x y)) nil i))

-----

4 points by rkts 6160 days ago | link

Well, not that anyone cares, but it turns out that the changes I made here do close most of the performance gap between Arc and Python. Python is still faster, but it's nothing to lose sleep over. The most important change was replacing Arc's cut with MzScheme's substring, which works the same way but is implemented in C. The other changes were important too, though, especially replacing lists with iterators. So I guess the lessons are

1. Any Arc library function that has an MzScheme equivalent should be rewritten to use the MzScheme version.

2. Using iterators to reduce consing is a win. So, maybe we should make a library of common iterator utilities (map, filter, etc.)?

-----

1 point by cchooper 6186 days ago | link

This is my final attempt. It's faster, shorter and easier to read than my first attempt, but the performance is still shockingly bad.

  (= *nwords* (table))

  (w/infile f "c:/big.txt" (whilet l (readline f) (counts (tokens (downcase l) whitec) *nwords*)))

  (def edits1 (word)
    (with (alphabet "abcdefghijklmnopqrstuvwxyz"
           n (len word))
      (dedup:accum add
        (forlen i word (add:+ (cut word 0 i) (cut word (+ 1 i)))
                       (each c alphabet (add:copy word i c)))
        (for i 0 (- n 2) (add:copy word i (word (+ 1 i)) (+ 1 i) word.i))
        (for i 0 n (each c alphabet (add:string (cut word 0 i) c (cut word i)))))))

  (def known-edits2 (word edits) 
    (accum add (each e edits (map add (known (edits1 e))))))

  (def known (words) (keep [*nwords* _] words))

  (def correct (word)
    (let edits (edits1 word)
      (best (compare > [*nwords* _]) (or (known:list word) (known edits) (known-edits2 word edits)))))

-----