Arc Forumnew | comments | leaders | submitlogin
Tag, type, and rep?
3 points by waterhouse 5003 days ago | 8 comments
I'm writing a basic Arc interpreter--in Arc, of course. I'm kind of surprised that I never actually did this until now, but now I am. I have some grandiose eventual plans for it, but in the meantime, I've noticed something.

Currently, my interpreter makes an expression whose car is 'fn evaluate to itself. In other words, functions are stored as lists. (No lexical scope yet; I just added global variables.) I'm about to add the Arc "data structures are functions" feature. That is, if xs is a list, then (xs n) is interpreted as (list-ref xs n).

But what if xs is the list '(fn (x) (+ x 2))? (This might actually happen if, say, I'm reading code--or running an Arc interpreter.) This is a conflict. Either that won't work properly, or functions won't work properly. There needs to be a way to distinguish the function and the list. I'm seeing the need to tag objects, and why Paul Graham put "tag", "type", and "rep" there in the first place. [1]

But, now, wait a moment. We need "type" to work, so that polymorphic functions ("apply" here) will work. We'll want "tag" so we can create and use new types. Why do we need a "rep" function? After all, functions, numbers, and conses are plain Racket objects, but the "type" function can recognize their types.

Furthermore, at least three Lisp implementations (probably all that compile to machine code) use some of the low bits of a machine word to identify an object as a fixnum[2], a character, a pointer to a cons, and probably as a pointer to anything else[3]. It is entirely possible for the object and its tag to be the same object in an extremely literal sense. If you wanted to tag a cons as a "function" or "compiled-closure", you could just change the tagging low bits; the underlying "apply" function would use "type" to determine what kind of cons the cons was and act accordingly (likely extracting the car/cdr in the process).

(...Actually, a better idea would be to have some user-controllable slightly higher bits that the builtin car/cdr wouldn't look at; the user would add his new tag and leave the "this is a cons" bits in place; then the compiler would feel comfortable, knowing that it is a cons cell (not an arbitrary integer) and it's safe to take the car of it. This takes a lot of bits, but on a 64-bit computer, a CPU word with a pointer in even the extremely forward-looking 48-bit address space of current computers still has 16 taggable bits left over. And even if you didn't want to supply bit-tags for all user-defined types--or they wanted non-symbol types--then you could escape by having a particular bit-tag mean "this is a pointer to an object containing the full tag followed by the actual object".)

In Racket, you don't exactly have access to the low bits of a cons cell for tagging purposes. Instead we have to make a compound data structure--tag things with a vector--and we need a "rep" function to extract the object itself. But this is an implementation detail, and it can and should be hidden.

It can be hidden by making all of the builtin functions just treat vector-tagged values as though they weren't tagged. E.g. we would say, in ac.scm:

  ;old version
  (define (ar-xcar x)
    (if (or (eqv? x 'nil) (eqv? x '()))
        'nil
        (car x)))

  ;new version
  (define (ar-xcar x)
    (cond ((ar-tagged? x) (ar-xcar (ar-rep x)))
          ((or (eqv? x 'nil) (eqv? x '())) 'nil)
          (#t (car x))))
If you pass "car" a tagged cons, or you pass "+" a tagged number, it just strips off the tag, finds that this object is indeed something it knows how to add, and just adds it. If desired, you could add polymorphism--redefine + to dispatch on numbers tagged as numbers mod 7 or something, which it would discover by calling "type" on it before stripping off the tag. However, the strong default would be to treat it like the thing that it is--when "car" is called on the "fn"-tagged list, it's probably being called by "ar-apply", which knows exactly what it's doing and would like "car" to shut up and give me the car, thank you.

(...Hmm. This vector representation seems to allow, in theory, for tags to be arbitrarily nested--or for, as people mentioned[4], to be any values. An ideal compiler would probably use low bits to tag the builtins, though, and if you just used simple symbol tags, maybe you could tell it to please allocate some more bit-tags for some heavily used types in your particular program.)

The current state of affairs is, if you want to tag cons cells as polynomials or something, and you want to call car or map or anything on them, you have to use "rep" every single time you reference your polynomial, or else you have to redefine "car" and "map" to do some dispatching. It seems pretty horrible to me. The arbitrarily nested tagging is technically possible here, but I think people haven't run with it very much because the interface sucks.

I'm kind of dumbfounded by how much better my proposal seems than the current scheme, and by how no one seems to have thought of it until now. Is there something wrong with this? Am I missing something? I've googled and grepped around, but haven't found anyone mentioning it.

Hmm, one possible flaw occurs to me. If something is (tag a (tag b x)), a user-defined function won't be able to extract the "b" using 'rep. However, I don't think that's much of a problem: if the thing that tagged the (tag b x) as an "a" wanted to allow its callers to do inheritance or whatever on the "b" tag, it could instead have returned (tag (list a b) x). This seems pretty good to me, in fact; you can do anything you want with types, but the default (if you just tag things with symbols) is the very simple "everything has a single type (and a single physical representation)". And the interface is probably even pleasant to use.

TLDR: "rep" is an implementation detail that can and should be hidden, and when it is hidden, we can all be happier.

[1] http://www.paulgraham.com/ilc03.html

[2] This is hella cool. In particular, if you use 000 [however many 0's] to tag fixnums, then you can do arithmetic with little overhead. All numbers are represented as 8 * their actual values; 8x + 8y = 8(x+y), so addition can be done with just an ADD instruction; and 8x * 8y = 64xy, so multiplication can be done with a MUL instruction plus a rightward shift by 3 bits. SBCL does this (note that 40 = 8 * 5):

  * (defun meh (x) (declare (optimize (speed 3) (safety 0)) (fixnum x))
    (the fixnum (+ x 5)))
  MEH
  * (disassemble 'meh)
  ; disassembly for MEH
  ; 0404275F:       4883C228         ADD RDX, 40                ; no-arg-parsing entry point
  ;       63:       488BE5           MOV RSP, RBP
  ;       66:       F8               CLC
  ;       67:       5D               POP RBP
  ;       68:       C3               RET
  NIL
[3] http://wiki.call-cc.org/man/4/Data%20representation#immediate-objects

[4] http://arclanguage.org/item?id=4855



2 points by rocketnia 5002 days ago | link

"In Racket, you don't exactly have access to the low bits of a cons cell for tagging purposes. Instead we have to make a compound data structure--tag things with a vector--and we need a "rep" function to extract the object itself. But this is an implementation detail, and it can and should be hidden."

Isn't it the other way around? Why are "low bits" not an implementation detail?

---

"If you pass "car" a tagged cons, or you pass "+" a tagged number, it just strips off the tag, finds that this object is indeed something it knows how to add, and just adds it."

If I define a new type, it makes little sense for it to inherit the qualities of the particular old type I'm using to implement it. That idea does nothing but expose implementation details.

---

"The arbitrarily nested tagging is technically possible here, but I think people haven't run with it very much because the interface sucks."

The interface indeed sucks a bit, 'cause [annotate 'foo _] doesn't wrap something that's already a 'foo. That means if you want to make a new type which wraps a single arbitrary value, then you need to do something like [annotate 'foo list._] so that you don't get mysterious bugs wrapping things of type 'foo. I end up having to do this all the time.

A long time ago (http://arclanguage.org/item?id=12076) I didn't like 'annotate and 'rep because I found them wordy, but I got over it. :-p Ssyntax goes a long way; once I started accessing compound types using things like (let (a b c) rep.foo ...), rep.foo.1, and rep.foo!fieldname, the "rep." part didn't seem so bad. I said something similar recently at http://arclanguage.org/item?id=14373, but I don't expect you to read that thread. ^^;

---

Speaking of that thread, though, this part might be relevant:

One could go all the way down, making a (deftype wrap-foo unwrap-foo isa-foo) form that eliminates the need for 'annotate, 'rep, 'isa, and 'type altogether. In this example, 'isa-foo would be a function that indicates support for 'unwrap-foo, and 'wrap-foo would be a simple data constructor that makes an encapsulated value with nothing but innate support for 'unwrap-foo. (http://arclanguage.org/item?id=14265)

I consider that to be a great way to get rid of implementation details, since someone can extend 'unwrap-foo and 'isa-foo for a new type if they want to. However, I'd also like to leave in 'annotate and friends, just so people who'd really like to dig into the details can do so.

---

"Hmm, one possible flaw occurs to me. If something is (tag a (tag b x)), a user-defined function won't be able to extract the "b" using 'rep. However, I don't think that's much of a problem: if the thing that tagged the (tag b x) as an "a" wanted to allow its callers to do inheritance or whatever on the "b" tag, it could instead have returned (tag (list a b) x)."

As I've mentioned recently (http://arclanguage.org/item?id=14176), I only expect 'type to determine the concrete representation of something, since it naturally arises (in my experience) that a value has one and only one concrete representation.

With the way I code, saying (tag a (tag b x)) means I'm making an "a"-typed value that has a "b"-typed value that has "x". I don't consider them all to be the same value, so the flaw you're talking about is relevant to me, and your example workaround solves a different problem.

-----

1 point by waterhouse 5002 days ago | link

> Isn't it the other way around? Why are "low bits" not an implementation detail?

The implementation detail is that "you can't tag an object without wrapping it in a compound data structure that old functions will have to be told to reach inside to get the original object".

"Hiding implementation details" isn't something to be done for its own sake. You want to hide implementation details when they're uglier than the semantics you want to convey; e.g. it's impossible to use a MUL instruction to check if the result of a multiplication is a bignum and allocate one and return a tagged pointer to it if necessary, but keeping track of all that is a pain; so instead we have a generic * function that hides all this scrambling around, and we can think of all integers as a single type of object. Exposing that stuff to a user who isn't extremely concerned with performance just sucks. It's bad because the interface sucks, and we'd identify it as "exposing implementation details" because that was the reason it happened.

On the other hand, does anyone want to hide the fact that lists are implemented as conses and that you can take their car and cdr? Not me. That stuff is useful.

> If I define a new type, it makes little sense for it to inherit the qualities of the particular old type I'm using to implement it. That idea does nothing but expose implementation details.

When you first define a new type, no preexisting functions will know how to handle the new type. You have two choices for what should happen in the absence of such instructions. Either they can give errors and fail (unless you use "rep", in which case they do their job as usual), the way they do now, or they can do their job as usual, the way it would be under my proposal. I think the latter choice is clearly--I would say strictly, unless you find the error messages useful--superior.

Now, if you define a new type, the first thing you'll probably do is define some functions that handle the new type, and they will likely need to use old functions that manipulate the objects that your new type is made of. (Perhaps the functions you're defining will be intelligent, wrapped versions of the old ones.) As mentioned, either they have to use "rep", or they don't.

And if you want to define a second new type that's supposed to be a subtype of the first one, then you can still build that on top of the provided "tag" and "type". (I think this answers your last point.) For example:

  ;Single inheritance. The path of inheritances is an improper list.
  ;Builtins will have atomic types; you could give something an
  ;atomic type to pretend it's a builtin.
  (def inherit-tag (typ x)
    (tag (cons typ (type x)) x))
  (def inherit-type (x)
    (if (acons (type x))
        (car (type x))
        (type x)))
  (def inherit-rep (x) ;more like "cast to supertype"
    (if (acons (type x))
        (tag (cdr (type x)) x)
        x))
  (def is-a (typ x)
    (if (atom (type x))
        (is typ (type x))
        (or (is typ (car (type x)))
            (is-a typ (inherit-rep x)))))
If you wanted "has-a" type semantics or something, you could figure out a similar way to define those. (Also, it occurs to me that you could use "tag" to identify your tagged objects, as in (tag 'inheritance-tagged (list (list 'type1 'type2 ...) 'implementation-type x)) or something.) I think you can build things on the type system just as much as before, and meanwhile the initial implementation is cleaner.

[Hmm. Come to think of it, given that the user doesn't have access to the internal "rep", and the built-in functions don't use it, it's meaningless (and useless) for (tag 'a (tag 'b x)) to return nested vectors; I'll make it return the same thing as (tag 'a x). So in the language primitives, everything has a single type--which would likely be just a symbol, but conceivably an Arc standard library might have things that do inheritance with lists of symbols. Anyway, this doesn't change anything I've said above.]

-----

2 points by rocketnia 5001 days ago | link

"The implementation detail is that "you can't tag an object without wrapping it in a compound data structure that old functions will have to be told to reach inside to get the original object"."

When I use 'annotate, what I want is to do this: Take a value that's sufficient for the implementation of a new type, and produce a different value such that I can extract the old one using an operation (in this case 'rep) that I can't confuse for part of the type's stable API.

Any "tag" semantics that overwrites the original type is useless to me for that.

I do think "wrap" is a better name for what I want than "tag" or "annotate," and that's the word I reach for when naming a related utility or recreating 'annotate in a new language. I don't mind if you say that tagging is something different. ^_^

---

"You want to hide implementation details when they're uglier than the semantics you want to convey"

From a minimalistic point of view, any accidental feature is ugly complexity, right? But I grant that having easy access to accidental features is useful for a programmer who's writing experimental code.

---

"On the other hand, does anyone want to hide the fact that lists are implemented as conses and that you can take their car and cdr? Not me. That stuff is useful."

I consider Arc lists to be conses by design, not just by implementation. As you say, conses are a positive feature.

On the other hand, the core list utilities would be easier for custom types to support if they relied on a more generic sequence interface, rather than expecting a type of 'cons. This isn't hiding the implementation so much as deciding to rely on it as little as possible, but it's somewhat causally related: If I don't rely on the implementation, I don't care whether or not it's hidden. If the implementation kinda hides itself (the way the body of a 'fn is hidden in official Arc), then I tend to leave it that way, and it can come in handy as a sanity check.

---

"And if you want to define a second new type that's supposed to be a subtype of the first one, then you can still build that on top of the provided "tag" and "type". (I think this answers your last point.)"

No... I said this:

With the way I code, saying (tag a (tag b x)) means I'm making an "a"-typed value that has a "b"-typed value that has "x". I don't consider them all to be the same value[...]

There's no subtype relationship going on here. For comparison's sake, if I say (list (list 4)), I'm making a cons-typed value that has a cons-typed value that has 4. At no point should my conses inherit from 4, and at no point should my "a" inherit from "x". That's just not what I'm trying to do.

---

In any case, just because I like the status quo (or my own spin on it) doesn't mean I shouldn't give your approach a chance too.

So, as long as (tag 'foo (tag 'bar "hello")) makes something nobody can tell is a 'bar, it's consistent for them not to know it's a 'string either. But then what will core utilities which can handle strings and tables (like 'each) do when they get that value? Do they just pick 'string or 'table and assume that type by default? If they pick 'string, won't that make it seem strangely inconsistent when someone comes along and expects (each (k v) (tag 'baz (table)) ...) to work?

-----

2 points by aw 5002 days ago | link

Fascinating. This sounds perhaps similar in goals to http://awwx.ws/label0, though with a very different implementation.

Labels in implementation are just a hash table with weakly held references to the objects that being "labeled". The idea is that for example we could have "type" be a label that could be applied to an object and the type function would return that value when the label was present; and then (in a way somewhat similar to your idea) I could have my polymorphic functions decide to do something different with my value based on its labeled type, but the built-in Arc runtime functions continue to use it in the same way.

Where this breaks down of course is that labels can't be applied to objects such as numbers and characters that don't have a unique identity. Your approach of automatically unboxing boxed values means that we could box a number and give it a type when we wanted to.

What I like about labels is that we aren't limited to just one dimension for annotating an object like we are when using tag.

Of course, if the Arc runtime automatically unboxed values like you suggest then I could in fact store the labels with the objects themselves, and it would also work with numbers. This would make labels an extension of your idea, allowing a property list to be attached to the box instead of just a type value.

Oh, and by the way, for your interpreter: I suspect you're going to have an easier time of it if you make a meta-circular interpreter where (fn ...) evaluates to a function value. But that's just speculation, it might be better doing it your way :)

-----

1 point by akkartik 5002 days ago | link

  (cond ((ar-tagged? x) (ar-xcar (ar-rep x)))
How would a specific type override car to do something else? Wouldn't car of queues simply return the entire contents rather than the top element[1]? What if I want car of a queue to return the top element? Or am I missing something?

[1] Assuming anarki's implementation of queues: https://github.com/nex3/arc/blob/f6c78e13577fda0b826d4bcd219...

-----

2 points by rocketnia 5002 days ago | link

I think waterhouse was saying that built-in functions would auto-unwrap their arguments by default, but it would just be a default, and you could override it with custom behaviors if you wanted to.

"If desired, you could add polymorphism--redefine + to dispatch on numbers tagged as numbers mod 7 or something, which it would discover by calling "type" on it before stripping off the tag. However, the strong default would be to treat it like the thing that it is --when "car" is called on the "fn"-tagged list, it's probably being called by "ar-apply", which knows exactly what it's doing and would like "car" to shut up and give me the car, thank you."

I guess in a way, this is another suggestion to add a default behavior for something that is currently an error, like letting the function call (1 2 3) return the list (1 2 3). But then waterhouse probably doesn't have that motive in mind, considering the opinions expressed at http://arclanguage.org/item?id=12841 and http://arclanguage.org/item?id=13827, so it's probably just a side effect, for better or worse. ^_^

-----

1 point by akkartik 5002 days ago | link

Ok, just changing the default seems reasonable.

This is kinda similar to Pauan's criticism that you shouldn't have to redefine every last primitive for new types that are like say tables.

-----

1 point by Pauan 5002 days ago | link

Yes, except I don't think this solves my criticism. It still seems like a good idea to get rid of rep, though. They're orthogonal concepts, and both ideas can be used at once.

Then again, if you're going the route of attaching arbitrary tags to objects... might as well go the full message passing approach, which is what Arubic is doing.

-----