Arc Forum | Probably one potential problem would be the splitting of a string into words. A...

Arc Forum

2 points by almkglor 6442 days ago | link | parent

Probably one potential problem would be the splitting of a string into words. A minor problem is that of figuring out what a "word" is, i.e. the division between words.

Otherwise looks like a pretty standard Bayesian analysis, which I believe pg has done already.

1 point by fallintothis 6440 days ago | link

Not really. It's a simple regexp that Norvig uses.

Python:

  def words(text): return re.findall('[a-z]+', text.lower())

Arc:

  (def words (text) (tokens (downcase text) [~<= #\a _ #\z]))

-----