it would be great to supply tokeniser/parser hooks instead
I used to think that, but these days I'm coming around to applying the "Arc philosophy" to libraries: every implementation should contain the minimum amount of code needed to implement just what it does.
So we have one commit in git which is parser.arc; and another commit in git which is parser + table reader.
Now, this is a weird thing to do. What everyone does is write software so that it can be configured and extended without having to change the source. Like everyone else, I've done this myself: just look at my "extend" function for example.
Yet Arc is different. It doesn't try to let you configure it or extend it. It just does what it does in the simplest way. And that simplicity makes it easy to hack Arc to do what you want. While the usual way, of having configurations and hooks and extension methods often fails: whatever I want to do turns out not to be supported by what the original author thought of. So I end up having to hack the source anyway, which is then made difficult ironically by the added complexity of all the configuration and extension code!
So to make the parser easily extensible, rather than adding hooks, instead make sure that you've removed any code duplication.
Are all those functions defined in arc-tokeniser just to keep them out of the top level namespace? Or is something else going on?
The "add hacks not options" idea is intriguing, and I adore the simplicity it both requires and begets. I'm not sure how to control damage from conflicting patches though. And I'd much rather share hacks via libraries - it's so easy to just redefine any core arc function, and your "extend" function makes that even easier and safer. I suppose there are some kinds of hacks that are harder to share this way though - especially if you're hacking ac.scm.
Are all those functions defined in arc-tokeniser just to keep them out of the top level namespace?
Yes. I've noticed that most arc code is not like this, so maybe arc-tokeniser is really bad style. What's the correct way to deal with namespace clashes? I'd like to avoid having a whole bunch of functions called arc-tokeniser-<something>.
Although upon reflection, it's true that popping these kinds of functions up into the toplevel makes them more readily hackable (by simple substitution, or 'extend-ing). I wonder if there's a way to do that without spawning a crowd of verbosely-named little functions?
I don't have answers to any of these questions - my brain is still wired mostly in java, and in the process of re-wiring I can't write well in any language ...
I'm not sure how to control damage from conflicting patches though.
We publish a commit which is a merge of the two patches. You can see an example in my arc2.testify-table0+testify-iso0 commit, which is a merge of my arc2.testify-table0 and my arc2.testify-iso0 patches. Here's the original testify from arc2:
(def testify (x)
(if (isa x 'fn) x [is _ x]))
My arc2.testify-table0 patch makes testify treat tables like it does functions:
(def testify (x)
(if (in (type x) 'fn 'table) x [is _ x]))
My arc2.testify-iso0 patch has testify use "iso" instead of "is":
(def testify (x)
(if (isa x 'fn) x [iso _ x]))
And the merge of the two:
(def testify (x)
- (if (isa x 'fn) x [iso _ x]))
- (if (in (type x) 'fn 'table) x [is _ x]))
++ (if (in (type x) 'fn 'table) x [iso _ x]))
The first "-" line shows the arc2.testify-table0 patch, the second "-" line shows the arc2.testify-iso0 patch, and the "++" shows how I merged the two patches. (You can get this output by using the -c option to git-log: "git log -p -c")
and your "extend" function makes that even easier and safer
Right, I think that functions like "extend" would arise from seeing patterns in code and abstracting them out, making the code more succinct. It was more my thought process I was commenting on, I had thought "patches are hard to deal with so I'll make functions like extend". Now I'm thinking, "what if patches were easy?".
What's the correct way to deal with namespace clashes?
I've been wondering about that. One possibility I've wondered about is to have an abbreviation macro:
So the actual name of the function would be arc-tokeniser-make-token, but anyone can refer to it by make-token by using the abbreviation macro if that would be convenient for them.
This is even more speculative, but I've also wondered if maybe it needn't be your job as a library author to worry about namespace clashes. What if the user of the library said "oh, look, make-token is clashing", and could easily load it with particular symbols renamed...
I think the first step is not to worry about namespace clashes. Instead think, "oh, namespace clashes are easy to deal with, so if they happen no problem". Otherwise you end up doing work (and maybe making the code more complicated or harder to extend) to avoid a namespace clash that may never happen.
Where w/locals expands into a big (withs ...) form with key-value pairs taken from the list.
Use an alternative def to put function definitions in such a table instead of the global namespace:
(def-in tokeniser-helpers make-token (kind tok start length) ...)
This way, arc-tokeniser would be less horribly big, easier to hack, and non-namespace-polluting. As a kind of plugin system, it doesn't seem terribly obtrusive does it? The disadvantage is that you need to search further for the definitions of your helper functions.
Is this something like how your prefix-abbrev macro would work?
I think it's not just a question of worrying about clashes that may never happen - it also feels inelegant, dirty even, to have globally-accessible functions that are relevant only in a very specific context. Otherwise I would completely agree - it would be a kind of premature optimisation to worry about them.
it also feels inelegant, dirty even, to have globally-accessible functions that are relevant only in a very specific context
Yes, but how do you know that your Arc parser functions are only going to be relevant in the code you've written? Perhaps someday I'll be writing my own parser, or something completely different, and I'll find it useful to use one of your functions in a way that you didn't think of!
I suggest trying out writing your code is in the simplest possible way. For example, in your original:
(def arc-tokeniser (char-stream)
(withs (make-token (fn (kind tok start length)
(list kind tok start (+ start length)))
"make-token" does not use "char-stream", so we can make this simpler:
(def make-token (kind tok start length)
(list kind tok start (+ start length))
Now I can look at "make-token" in isolation. I can easily understand it. I know that all that other stuff in arc-tokeniser isn't affecting it in some way. And, if I'm writing my own parser and I want to use "make-token", I can do so easily.
And sure, down the road there may be some other library that also defines "make-token". At that point, it will be easy to make a change so that they work together. Perhaps by renaming one or the other, or by doing something more complicated. The advantage of waiting is that then we'll know which functions actually conflict, instead of going to a lot of work now to avoid any possibility of future conflict, the majority of which may never happen.
Now of course I'm not saying to pull every single function out of arc-tokenizer. You've some functions that depend on char-stream and token and states and so on. So those it makes perfect sense to leave inside arc-tokenizer. My claim is to today write the simplest possible parser.arc library, explicitly not worrying about future namespace clashes. That it is better to deal with them in the future, when they actually happen.