2. Consider this: when a process is created, a private heap is allocated for it and the "cache" of global variables is copied to the heap. When each process mutates any structures, check whether the structures are in its own heap. If not, we can raise an error.
Well, first: does reading a global variable at some point create a copy or not?
Having to do the checking on each write has the slight disadvantage of slowing down writes. For that matter, we still have to consider how heaps are organized.
For example, in my current stop-and-copy, always-copy-on-message-send, each heap has several semispaces. Semispaces conceptually belong to only one heap. One of the semispaces is the current semispace, the one that is being allocated into. The others were semispaces that were received during message sending.
When a message is sent, it is copied into its own semispace, then the semispace is added to the list of "other semispaces" held by the heap.
During a GC, the GC copies all semispaces into a single new semispace, then the heap releases all the other semispaces and sets the new semispace as the current semispace.
In theory it should be possible to reduce slightly the GC overhead by just copying some (but not all) semispaces into the newest semispace: this is basically an incremental GC
Edit: As an aside, the advantage of leaving it "undefined" is so that we can change the implementation of global variables without committing ourselves to stone. "What, you kept a table in a global variable and expected to use that to communicate with other processes? Well, sorry, but that won't necessarily work (or not work)"
Refcounting performs a lot worse than a generational gc. When dealing with many deep data structures, it becomes more worse. And a simple generational gc is not very hard to implement.
It seems that all of us have put too much emphasis on premature optimizations and fancy features... The first priority may be to simulate the whole arc's core semantics and make the compiler able to compile itself.
True. I'm trying to hack the macro stuff, to not much effect. Erk.
In fact quite a bit of arc.arc is now compileable, although you do have to transform (def ...) to (set .... (fn ...)) manually. So really what's needed now is macros. Also trying to think of how best to implement optional args and destructuring args - probably by just hacking off rest arguments (for optional args) and let's (for destructures)
Macros should be easier to implement once the compiler is able to compile itself, because this way the compiler and the compiled macro have the same internal representation of data structures, so passing arguments between the two shouldn't be too hard.
There are several uses of macros in the compiler, unfortunately. In particular the 'def macro is too much of a convenience. So in order for the compiler to easily compile itself, it first has to implement macros. Chicken, meet egg.
Ah heck, maybe I should just use 'eval now and implement a compiled 'eval interpreter later that can interpret code and yet allow interpreted code to call compiled code and vice versa.
In fact I already have a bit of a sketch for this (which is necessary if we want to allow compiled programs to use 'eval). Basically put interpreted '(fn ...) forms into a 'interpreted-fn annotated type together with surrounding environment, add an entry to the 'calls* table (via defcall, say) for 'interpreted-fn to, say, a $$interpreted-fn-apply function which binds the parameters into an environment table and calls the 'eval interpreter.
Of course this requires some changes in the base system: we need at the very least a %symeval primitive which when given a symbol will give its global binding, a %symset primitive which will modify a symbol's global binding, and obviously we need a link from the symbol to the GLOBAL() array (and dynamically create new containers for created symbols - if it's not in the GLOBAL() array then the compiled code would never read that global anyway, only the interpreted code ever will).
The rest of the interpreter is just a standard scheme interpreter, the only real support we need is to be able to call compiled-from-interpreted and interpreted-from-compiled, and the reading and binding of global symbols, including those that aren't in the GLOBAL table.
Ouch, my head hurts. And sacado's the one doing the Unicode strings. LOL
Would it be worth implementing 'def directly? This would give a lot more functionality right away. This could be temporary until macros are implemented.
Possibly. There's a bunch of "macro" transformations in xe.arc, possibly I'm just a bit too lazy to think. However I don't like depending on those transforms, I want to do it "properly"
For the global vars problem, a solution could be to associate top level values directly with the symbol, this way a symbol would consist of three values: its string representation, its global value (initially a special 'unbound value') and a property list.
The current style has an optimization where all globals are simply referenced directly from an array in O(1). I'd rather that symbols point to entries in this array, because symbol-as-global-variable lookups are expected to be completely nonexistent if 'eval isn't involved in the program anyway (who uses 'eval in a language with 'read?). Only newly created symbols must have allocated variable values, and only for the benefit of 'eval'ed code - we can already know the global variables in the compiled code, because the compiler need that info anyway.
Basically:
struct {
long type; /*T_SYM*/
char* stringform;
#ifdef EVAL_USED
obj* binding;
#endif
} symbol;
int main(){
/*compiler generated only if eval is used*/
obj sym; symbol* sympt;
sym = SYM2OBJ("globalvar0");
sympt = (symbol*) sym;
sympt->binding = &GLOBAL(0);
sym = SYM2OBJ("globalvar1");
sympt = (symbol*) sym;
sympt->binding = &GLOBAL(1);
...
}
This way the current performance is retained (global variable lookups are O(1)).
I don't know how much this solution will be once support for a dynamic load (e.g. from the REPL) will have to be implemented, because you'll have to keep an index of the last global variables created across different compilation sessions. With threads it gets even more complicated (mutex on the index?). With symbols it would be simpler to implement a dynamic load or definition of a global var from the REPL. The price paid is a slightly slower access to global variables, because 2 references to memory are necessary for every refrence to a global var. Global variables lookups are still O(1) though, e.g: sym->binding for read access and sym->binding = value for write access.
> the last global variables created across different compilation sessions
I don't understand this part. I was proposing that 'eval would be an interpreter, not a compiler. My intentions was that compiled code would be statically generated (the way it's done now), so 'eval cannot possibly compile code. It would be a compiled interpreter of Arc. arc2c is a static compiler, so 'eval won't add ever add compiled code; the best it can do is create a 'interpreted-fn object that contains an interpreted function's code (as a list) and the enclosing interpreted environment
So a dynamic load would just interpret the expressions in the file being loaded:
(set load
(fn (f)
(w/infile s f
(whilet e (read s)
(eval e)))))
'eval would be able to access the global variable table indirectly via the symbols and %symeval/%symset.
Basically, 'eval would be compiled to something like this:
(set eval
(fn (e (o env nil))
(if (isa e 'symbol)
(if env (lookup-environment env e)
(%symeval e))
(...))))
Also: if the compiled code doesn't reference it, it won't be in the GLOBAL() array. The reason is simple: the compiled code won't reference it, ever. If 'globalvar isn't in GLOBAL(), then it does not exist in the compiled code. So it doesn't matter that it's not in the GLOBAL() array - the compiled code never referenced that global, so it won't ever use an index into the GLOBAL() array to refer to it. The interpreted code might, but that's why we have an indirect reference connected to the symeval.
Also, when I say O(1), I mean O(1) with the number one, as in only one layer of indirection (an index within a table). If global bindings are kept with the symbol only, then all global accesses - even precompiled ones - need (1) to find the symbol and (2) get the binding, for a total of O(2).
In other words: 'compile-file compiles, but it creates a new executable which is never connected to the process that ran 'compile-file. 'eval just interprets, and if the interpreted code mutates a global of the program, then that global gets mutated for real, in the program (what are you doing using 'eval on untrusted coe anyway). But if the interpreted code mutates a global that is never used in the program, it just creates a new global variable, one which is never referenced by the program (by definition, because the program never used it).
I thought eval compiled code, loaded it and then executed it. I've been mistaken. With the compiled code completely static then your strategy is better than assigning values to symbols.
What closure representation does arc2c have now? Flat or nested? The former is quicker for variable lookup, but slower at set and closure creation. The latter is just on the opposite side, and it eats more memory becaust it allocates many useless frames.
For the flat closure representation, we should notice the following:
Every potentially seted local variable should be indirected by a reference. Of course you can reduce the number of them by analysising which are only seted once or not shared by other closures.
Basically instead of a cons cell (as suggested by stefano) I use a new structure, the "sharedvar", which is just a container for an obj. These also means that the actual closure objects are immutable after creation.
Only local variables that are ever set are put in sharedvar's, other local variables are kept in the flat closure. However I don't analyse for variables that are set only once, or which aren't shared by other closures yet.
Yeah, this is the approach taken by chez, mlton, and many recent compilers of functional languages.
And all the optimization stuff(unboxing of "sharedvar", inlining, type inference, known function analysis, unused variable elimination, etc) can be made agressive by only global flow analysis, which is rather time-consuming. But I'm curious to know how far a relatively conservative compiler which doesn't do any flow analysis is able to go. If stalin performs a little worse, but compiles 10x times faster, I believe that much more people would use it.
Sharedvar unboxing is a little difficult, since there are two types of local variables: those in closures, and those in parameters:
(let kept ()
(set keeper
(fn (x)
(set kept (cons x kept)))))
; vs.
(set fooer
(fn (x)
(set x (rev x))
(do-something x)))
So basically we need two types of local-variable-set primitives: one for closures-variable-set (first case above) and another for parameter-variable-set (second case)
Re: Stalin - is it that slow? Meaning an order of magnitude improvement of time is needed to make it comfortable?
Edit:
Type inference: well I can't think of a good way of getting type inference generically, but certainly it's possible for e.g. '+. '+ requires that all parameters are either numbers, or all strings, or all lists, and if we can determine that one parameter is of a specific type, we can put the checking that the other parameters are of that type and immediately bind the + to the specific type.
For example if we have %n+ for numeric addition, %s-join for string concatenation, and %l-join for list concats:
(+ x y z)
=>
(+ x y z) ; can't determine type
(+ 1 x)
=>
(%n+ 1 (let check x
(if (is (type check) 'num)
check
(err "+: type mismatch"))))
(+ (list 1 2 3) x)
=>
(%l-join (list 1 2 3)
(let check x
(if (is (type check) 'cons)
check
(err "+: type mismatch"))))
Stalin might be the most optimizing but slowest functional language compiler ever written.
Sharedvar unboxing is not an important issue because it doesn't make much difference in efficiency. Most scheme programs don't update local variables very often. The most useful optimization parts are(in my opinion): Special treatment of let and letrec, inlining and known function detection.
General ML-style type inference for scheme is impossible. What we can do is to infer as more types as possible.
1)Trasforming let and letrec to ((fn (...) ...) ...) is not efficient. First, it would allocate a closure. Second, it would perform a function call. Instead, the variable bound by let and letrec should be allocated on the stack, and no function calls are needed.
2)For example, in:
(f x)
If f is statically known, and f's environment is null or is the same as the environment of the calling site, then the function call should be a direct jump. It eliminates the cost of (1)global fetching of 'f, (2)extracting the information of the address and environment, (3)switching the environment, (4)an indirect jump.
By known function detection, tail recursive functions can be compiled to exactly the same code as loops in imperative languages do.
1) Given that everything is transformed to CPS, pretty much everything - including sequences I think - ends up being a function call. In fact, only jumps exist at all.
Not sure about how the closure-conversion works. This may be workable, would you be willing to work on this?
2) I get this now, although I'm not sure how to translate this into the current Arc2c output. I'll discuss the current arc2c calling convention and you tell me how workable your proposal is.
------
Currently, arc2c output works out like this:
1) Each function is simply a case in a large switch statement:
jump: switch(pc){
case 0:
...code for function 0...
case 1:
...code for function 1...
}
2) There exists a "stack" which is not the C-stack:
3) At the start of each function (with the exception of function 0, which is the top-level), the stack contains a [0] closure for the current function, [1] the continuation, and [2+] the Arc parameters. This is assured by the calling function.
4) Functions are passed around as closure structures. The first eleemnt of the closure structure is a non-encoded number, representing the case for that function, while the rest is simply an array of closure variables.
5) Functions simply use the stack for temporary scratch space. For example this is how (+ 1 2) would compile to:
PUSH(FIX2OBJ(1));
PUSH(FIX2OBJ(2));
ADD();
6) Just prior to calling a function, the calling function pushes the parameters in order: [0] closure (the function to call), [1] continuation [2+] arguments. The number of elements N for the call is computed by the compiler
7) Then at the function call, the calling function copies the top N elements of the stack into the bottommost N elements, and assures that sp = &stack[N]. Then it sets the C-variable pc to the closure's function field, and does a C goto jump;
Well, I only have experience of writing direct-style compilers, not CPS-style ones, so my advice needs to be adapted.
But from mechanism of the current arc2c output you showed above, I see many places for improvement:
1)In a function:
(fn (x y z ...) (g A B C D ...)),
if B doesn't rely on x, C doesn't rely on x and y, D doesn't rely on x, y and z...etc, the calling function could avoid copying elements to the bottom. Instead, it moves the stack pointer to the bottom first, and then pushes the arguments.
2)For functions having no environments, we don't have to push a full closure, we just have to push pc.
3)For known functions, we just do a C goto jump not to the jump label, but to the (case n), because C cases are in fact labels.
Finally, in my opinion, a CPS-style compiler is no longer a better choice nowadays. It complicates the source, the debugging information and the (human) analysis of the program structure. Since we are already using a separate stack that is different to C's, continuations can be implemented in direct-style compilers as easily as in CPS-style ones. And codegen for direct-style compilers is just slightly more difficult, which isn't an issue. In addition, a naive direct-style compiler performs much better than a naive CPS-style one. The latter needs a source simplifying step to eliminate unnecessary closures and function calls produced by CPS conversion.
1) personally I think this is a rare case, but I could be wrong
2) arc2c closures are very lightweight: it's just a simple array of obj(s), with the first obj being the pc. So in effect for functions having no environment, we are pushing a pointer to the pc.
That said, closures are also used to represent functions that can be passed around. Unfortunately closures are currently untyped, so we expect the current closure style to be changed.
Also we need to support the possibility that a "function" being called isn't really a function: after all table syntax is just (tb key). And this is perfectly valid Arc:
(let sometable (table)
(each k lst
(= sometable.k (generate-something k)))
(map sometable ; yes, we're passing a table as if it were a function!
foolst))
3) I was actually thinking of this too, although I haven't gotten around to it.
re: CPS: I wouldn't really know. Me, I'm just hacking around at the transformations before the CPS and Closure conversions. Because of the somewhat modular construction of arc2c, in theory you could write a drop-in replacement for CPS and Closure conversions, as well as code generator, and we can then put either CPS or the direct style as options, maybe.
1)It's not a rare case. It's important for speed improvement for most of useful programs. For example, map & foreach, which are used quite often, can be optimized by not copying data on stack.
A non-optimizing compiler leads easily to a "fast enough" executable. Without optimizations I think the compiled code would be 7x~10x slower than C.
Edit: I've tried the Fibonacci "benchmark" on a simple compiler i'm writing: it takes 0.2 seconds to compile the program and to compute the 32snd Fibonacci's number. On the current Arc interpreter it takes ~5 seconds.
Your compiler might be much slower if it's with true scheme numbers, + operator as a function(not a primitive operator) and stack overflow checking. These features are currently supported by the arc interpreter on mzscheme.
If you can correctly eliminate function calls on +, your compiler is an optimizing one, not non-optimizing...
I've tried the same example putting a function call and a test around every arithmetic operation, and execution time went from ~0.2s to ~0.26s, not a big difference, although a few optimization will probably be necessary for something more complex than fibonacci's example.
Is the function call overhead so small? I didn't realize.^^
But there are other issues: the fib example is not a very good benchmark suit, because in C, general recursion is not a common paradigm. If we compare C loops to Arc tail recursive calls generated by a simple compiler instead of comparing C recursions to Arc recursions, I believe that the difference will be much larger. Because C compiler writers have spent at least 20 years on optimizing loops...
That's absolutely true. Reaching C speed with high level languages such as Lisp it's very very difficult. CMUCL and SBCL reach roughly the speed of C, but they've been developed for a long time.
As of loops speed vs. tail recursion speed, the difference shouldn't be too big.
Stalin performs as better as C in numerical programs and many other benchmarks. The most exciting thing is that unlike CL, stalin doesn't need type declarations to guide optimizations. It would infer as much type information as possible. The problems is that it compiles too slow and it's not maintained anymore.
Naively implemented tail recursions is still not fast, because many common loop optimizations can't be directly applied to them unless you eliminate the function calls and regard them as true goto's. It's a rough task because the global flow analysis is needed for eliminating as many calls as we can.
BTW, the only country I know which has an extra holiday in this weekend is China. Could you please tell me your email address? I know this kind of topic is not proper in this forum...
Actually I'm in the Philippines. Our holiday is actually supposed to be tomorrow, but our president has a tendency to move holidays near weekends to give long weekends.
LOL. I suppose it's because the populace is mostly dissatisfied with the president, and the president is trying to appease the populace? Those are the conditions in our country anyway ^^
1)We can make a simple inliner, and name the primitives like #car#, #cdr#, etc. Then define car as (set car (fn (x) (#car# x)). In the last, we use the inliner to do the job. The inliner approach is better than an extra pass of eliminating primitive calls, because it can do more optimization.
2)Maybe writing a metacircular interpreter in compiled arc is the best way of implementing both macros and eval-when.
3)I don't know if the current unicode libs are good enough.
4)Implementing green threads via continuations should be a good start.
5)For standard I/O, use stdio. Anything else could be done by an ffi. Since arc2c is a static compiler, ffi could be portable even what we have is an ANSI-C system, because we have to deal with neither the .dll/.so stuff nor the libffi lib.
1) hmm. Interesting. Can't think of how to do inlining yet though.
As an aside, my intent was that library functions in a specially defined library file can access primitives %car etc., but not other code - user code can use %car etc. for their own purpose without clashing with the primitives, if only for compatibility with Arc.
2) Yes, this seems correct. And there's also 'eval. Yes, eval's not often used, but still...
5) The problem is using green threads with blocking I/O. Obviously in a server if one thread is blocked by I/O, other threads should still continue. It's really the threads/IO interaction that's bothering me.
Edit: which reminds me - currently closure structures are untyped, meaning we can't safely get the type of a function.
Some background first: the compiler first puts all top-level expressions as parts of a do-block. For much of the compilation run (until it reaches CPS transformation) the compiler represents the entire program in this do-block.
I intend that the library's will simply be inserted at the front of the do-block's list of subexpressions.
The inline transformation phase then iterates over the top-level elements of the topmost do-block. If a top-level element is an assignment to a global variable, we attempt to determine if the assignment is eligible for inlining.
To determine if the assignment is eligible for inlining, we check if it's assigning a function. Since this is a top-level block, the function cannot close over any variables. Then we detect if the function's parameters are referenced 0 or 1 times (if referenced more than that, we can't safely inline it without putting it in a let-block - which creates a function anyway, so no point inlining). Note that we can actually allow the function to reference itself via the global, since we won't remove the assignment to the global.
If we determine that a global is eligible for inlining, we add the symbol and its function to a table of inlinable functions.
Now here's the hard part: we also have to ensure that the global can be safely inlined. If a global is assigned to exactly once, then it could.
While scanning, we check if the global is already in the inlineable set. If it is, we add the global in the banned set. This means that redefining a global will prevent it from being inlined:
(set global
(fn () t))
(prn:global) ; t
(set global
(fn () nil))
(prn:global) ; nil
; cannot safely inline
If a top-level expression isn't an assignment to a global, we scan through its subexpressions for an assignment to a global. For each global assignment, we add the global in the banned set. This prevents us from inlining non-trivially inlineable stuff:
(let c nil
(set reader
(fn () c))
(set writer
(fn (v) (set c v))))
After this scan through, we have a set of inlinable functions and a set of banned-from-inlining. We remove from the inlineable set those that are in the banned set. Then we perform inlining.
Inlining is then done this way: We scan the entire syntax tree and search for function applications, where the function position is a reference to a global variable in our final inlineable set. If it is, we then replace the application with a copy of the function's contents (the function's contents are always placed in a do-block, incidentally). We scan through the copy and look for references to the function's parameters, replacing the parameters with the appropriate expression in the function application. For vararg inlining, we may use the %cons primitives directly to build the vararg parameter.
The assignment to the global is retained. However, we can then repeat the unused-global-removal step (or move that step after this step) to remove the actual non-inlined version if it's not being passed as a function.
1. Local functions which have enclosing environments are harder to inline. If a function's environment is different to the caller's environment, we should replace all its free variables to references to its environment. For simplicity, you can inline only the combinators(functions which have no free variables).
2. When inlining, we should rewrite the parameters only if they are free to the function body, not bound by other local functions in the body.
1. I'm not proposing yet to inline local functions, especially those that close on environments. However, what algorithm would you propose for inlining local functions?
As an aside, closure-conversion makes the creation of environments explicit. Perhaps an inlining step can be added after closure-conversion?
I see. This is actually handled implicitly in the compiler's AST structure: during the conversion from the list form to the AST form, each local variable is given a unique ID:
(fn (x y) (g x y (fn (x) (h x))))
=>
(fn (x@1 y@2) (g x@1 y@2 (fn (x@3) (h x@3))))
; approximation of the AST structure, the AST
; is really a table of properties
So mindless replacement of the inlined version will simply replace x@1, not x@3.
Hmm. Turns out this is a real issue, but for a different reason: since local variables are given unique ID's, we should actually replace local variable ID's for subfunctions when a function is inlined several times:
eval-when would be used, I think, only to differentiate between compile-time and run-time. It's not a part yet of ArcN. Hmm. Lemme think more deeply about this.
An extra hash argument can implement both optional and named args. Representing optional args as varargs would complicate the source transformation when they are used simultaneously in the same function.
It's not at all complicated: just make the last argument take the value of your temporary rest argument. Edit: This is what I did in creating the p-m macro, and it works.
I'm not sure how adding an extra hash would work well if I might pass several different functions, with different optarg names:
Edit: My approach appears to be somewhat similar to what pg did in the Scheme version - cref ac-fn, ac-complex-fn, ac-complex-args, ac-complex-opt in ac.scm
Since code is spec (grumble) I think it might be better to follow mostly what pg did, although I dunno, not sure ^^. How can passing in a hash table seamlessly emulate the same functionality while improving performance?
For that matter my concern is with the expansion of PUSH and POP:
PUSH(x); y = POP();
=>
*sp++ = x; y = *--sp;
Can gcc peephole the above, completely eliminating the assignment to the stack (which is unnecessary in our case after all)?
y = x; //desired target
Without somehow informing gcc to the contrary, gcc will assume that writing to * sp is significant, even though in our "higher-level" view the assignment to * sp is secondary to transferring the data between two locations.
Well, with full optimizations on gcc (-O3), it doesn't change anything (at least in execution time, I didn't compare generated machine codes). Wow, gcc seems really clever. Now that I know how hard it is to implement a compiler, I can only applaud :)