Thursday, December 20, 2007

Me, Manifesto, GWT2007

Sigh, another two weeks gone by without a post. Still, that last post was a doozy, so cut me some slack, willya? Incidentally, my readership has increased by 50% in the last two weeks according to Feedburner. I am honored to have all of you paying attention to me, and I will try to be worthy of your eyeballs and neurons.

This post is going to be a bit of a catch-all: some about me, some about my last big-ass posting, and some about the GWT conference.

First, sorry I missed last week -- I got another cold (endemic in the wintertime with two young kids in the house), and my wife kicked into moving high gear. Yes, we're selling our house, for a wide variety of reasons I will discuss in personal email if you're interested enough to send me some. (Not in the comments, sorry!) We're shooting for having it on the market March 1, which means there is no time at all to waste. I'll be doing no hacking for the next two and a half months, that's an absolute given. I'll keep up the blogging, but weekly may be tough; bimonthly WILL happen if it kills me!

Which is all too bad, in a way, because there's plenty to do in hackland. My manifesto was rather well received. The comments on the post itself have a number of intriguing links. I also posted about it on Lambda the Ultimate, where it got a number of other fascinating responses. Thanks to all who engaged with it in either location. I am sorry I won't be able to touch it until March first at the earliest, but that's the nice thing about research -- interesting ideas don't go away!

I also wanted to give an update on the GWT conference. My talk was quite well received if the consensus speaker's feedback was anything to go by. I appreciate the appreciation, to say the least. I did hear that there was a significant fraction (over 50%?) of the audience that felt it was too technical. Evidently a substantial quorum of GWT2007 attendees were relatively unfamiliar with GWT at any level. I apologize to anyone who wasn't experienced enough to get the most out of my talk; I hope that you can find helpful material on the GWT site and/or any of the various GWT books. I've put my slides online for anyone interested. Evidently it'll be Youtubed sooner or later, but I've yet to get a date on that.

The last morning of the conference, we had a participatory breakfast session where Joel Webber and I led a discussion of declarative UI and declarative data binding in GWT. One thing that was clear from my talk is that there are a lot of people doing form-like web applications, and GWT underserves that population right now. Joel spent some time going over the declarative UI work he's doing; he specifically mentioned that various internal Google customers want him to ensure that you can write a GWT-XML document that lays out a GWT widget tree with HTML fragments mixed in almost arbitrarily. This seems like a win insofar as Joel's talk on performance specifically mentioned that innerHTML is your friend; the DOM is faster at inserting HTML fragments than it is at handling individual DOM element insertions. So everyone keep an eye on Joel's stuff as it hits the incubator.

I asked the table what kinds of form-data-binding approaches would be most useful to implement. The consensus seemed to be split. A number of people were familiar with the JSF-like structure of binding to backing beans and using expression language to specify the coupling from bean fields to the UI. I mentioned JSR 295 and various people expressed interst in that. There is a recent thread on the GWT contributors forum about some community work in this very area, which I look forward to investigating in depth (in early March! :-P ). Check that thread out if you want GWT to do this. I talk at more length there about a potential GWT EL parser.

There was one gentleman (whose card I got, and subsequently lost; please contact me again, sir!) who was a forceful advocate of an XForms-like approach; evidently XForms lets you declaratively specify data dependencies within your model, and allows you to lay out interactions witihn your UI purely in your interface specification. This was pretty much news to everyone at the table. I will definitely have to look into XForms now. However, it seems to me that a JSF-like, JSR-295-like bean-binding approach covers many of the basic use cases people want, and should certainly be a near-term project.

OK, that's it for this week. Happy, happy holidays to you all!

Tuesday, December 4, 2007

A Growable Language Manifesto

Warning: this is by far the largest and most action-packed post I've ever made. Grab a cup of [insert favored beverage here], sit back, and enjoy. If you get headcramp from reading this in narrow-column blog format, there's another full-screen version here -- but please return to this post to leave your comments!

I've posted recently about the dynamic versus static flamewar and about recent research in extensible languages. The more I think about these ideas, the more compelling they get. It seems clear to me that these ideas are synergistic and point in the direction of a new kind of programming environment -- one which could potentially offer the ease of dynamic languages, the safety of static languages, and an altogether new level of extensibility (both for enhancing the type system and for allowing safe metaprogramming.)

So I want to lay out a manifesto here for a growable programming language. Or perhaps it's more like a toolbox for language construction. Or perhaps it's a framework for experimenting with extensible syntax and extensible analysis. In any case, here is the vision. At the end, I include extensive references to the research that's inspired these ideas.
  • A growable language should be opt-in. It must support a graceful spectrum of usage styles. A growable language will necessarily be multi-paradigmatic, since it will be extensible with features from various fields of programming. The language should implement a dynamically-typed core on which all static analysis is built, to allow programmers a straightforward and lightweight syntax for prototyping.

  • A growable language should be increasingly powerful. It should support a rich and expanding variety of static analyses and dynamic checking, to give maximal benefit to the programmer who wishes to leverage these features. over time, as more program analyses become possible, a growable language should gracefully integrate them. A growable language must continually increase the expressive power of the programmer, both in terms of what the programmer can say and in terms of what the environment can help the programmer prove about their program. A growable language should have a powerful analysis framework as a foundational component, such that other analyses can share that basic capability; wherever possible, additional analyses should require only incremental effort by leveraging infrastructure already created.

  • A growable language needs a declarative metalanguage. Current languages do not define inherent mechanisms for syntax extension and for type extension. The syntax and the type system are both baked into the language's compiler. A growable language needs to lift the specification level of the language itself to be both declarative and modularly defined, so that syntax and type analytics can both be expressed and implemented as layered extensions.

  • A growable language needs partial typing. Certain analyses are necessarily incomplete or even undecidable. A growable language should permit such type systems, and should be able to fall back on dynamic typing whenever analysis is inconclusive. This lets programmers work freely without having to continually meet the demands of the type system, yet supports graceful type annotation to enhance static safety at the programmer's discretion. (Without partial typing, the opt-in goal cannot be met.) The language should ideally provide clear feedback as to when its analyses are conclusive or inconclusive, and ideally should identify the sources of inconclusiveness so the programmer can either annotate them appropriately or deliberately ignore them.

  • A growable language needs layered types. A growable language should be able to extend itself with primitive types, dependent types (e.g. data-dependent subrange types), traits or other parametric or abstraction-based genericity mechanisms, and multiple flavors of type qualifiers. Integers should be as much of a language extension as non-null types, tainted types, or range types. A growable language requires an extensible subtype lattice.

  • A growable language needs inferrable types. To avoid drowning in explicit type declarations, the language should be able to infer types wherever possible, and the programming environment should support controllable visibility for the inferred type information. Without inferred types and environmental support for controlling analysis visibility, a growable language cannot scale for users; being able to selectively ignore some or all of the (many) analyses is critical.

  • A growable language needs explicit, optional type annotations. An extensible analysis framework will be able to infer or deduce a great deal about the actual behavior of the program. But the actual behavior of the program may or may not correspond to the programmer's intent. The programmer's explicit annotations express the programmer's expectations. A programmer might look at the analyzed annotations and make some of them explicit -- perhaps they reflect properties of the code that the programmer considers important after the fact, and that the programmer now wants to enforce. Or a programmer might add explicit annotations during development, such that the system can confirm they are valid and warn when they are violated. Explicit annotations at module boundaries -- whether programmer-generated or system-generated -- are likely to aid in separate compilation and in module documentation.

  • A growable language must be efficiently implementable. As more language features are added -- more qualifiers, more types, more syntax -- the language must still be efficient and usable. This applies both to programs written in the language (which should have performance competitive at least with Java or the CLR) and to programming environments for the language. The latter requires aggressive attention to analytical optimizations and to multi-threaded analysis frameworks. As the language's analysis structure grows, the language's programming environments must be able to leverage multicore architectures to ensure consistent responsiveness for users. Moreover, the language should be continuously analyzing in the background; incremental feedback as the user edits should be available for all language features and extensions.

  • A growable language must have a unified internal representation. Concrete syntax, abstract syntax, type and name bindings, type qualifier propagation, type parameterization, dependent range type constraints, and logical queries over the program's alias structure should all leverage a single internal representation of the program. This maximizes the reusability of internal language implementation code and ensures consistency in analytical derivations. Where multiple representations are necessary, the derivation rules must be clear and consistent.

  • A growable language must promote static metaprogramming. Fully dynamic metaprogramming -- runtime extension of classes and objects with arbitrary code, or even unrestricted dynamic reflective access to arbitrary methods -- is almost impossible to analyze effectively. To give a growable language's extended type systems maximum exposure to the actual behavior of the code, static metaprogramming techniques must be definable, extensible, and compatible with the rest of the language's structure. One would hope that the very extensibility techniques that implement the language itself would be usable for static metaprogramming.

  • A growable language must support diverse analyses. Some analyses are most naturally expressed as systems of recursive equations over an abstract syntax tree. Others can best be expressed as logical queries over a program's alias graph. Ideally these could both be expressed naturally in the metalanguage.

  • A growable language must be analytically composable. This is likely the single most technically ambitious goal. Traditional compiler development involves subtle and explicit scheduling tradeoffs, layering multiple analyses through manual interleavings that can be fragile or non-optimal. A growable language with a declarative metalanguage needs an analysis engine that can automatically schedule multiple, potentially interleaved analyses, leveraging parallelism where possible to optimize non-dependent analyses. Achieving this goal will be immensely difficult, but proportionately valuable. Interestingly, here is where the metalanguage itself will require many of the same extensibility properties as the language it describes; meta-level closure -- implementing the metalanguage using the language itself -- will be a holy grail for this language design.

Is the above language even possible? Is it too much of a stretch -- particularly the "unified internal representation" and "analytically composable" goals? Maybe so. I'm definitely not an expert at programming language implementation; the only compiler I ever wrote was back in high school. So this may be ridiculously unrealistic in whole or in part. I welcome feedback on which areas of this manifesto are more or less plausible.

Overall, consider this my stab at answering Paul Graham's challenge to ponder the hundred-year language. Given current trends in programming language development, it seems that languages of the future will transcend being "languages" as we know them and will become more like unified environments for language creation and extension. Arguably, this vision has a lot in common with intentional programming, which doesn't bode well, since the intentional guys have been in stealth mode for almost fifteen years and nothing has publicly come of it. But that doesn't mean the general direction isn't interesting, any more than the slow progress of Chandler means that a unified and flexible personal information manager isn't worth pursuing.

I promised references. Here they are:
  • opt-in -- Gilad Bracha's pluggable type systems paper is the origin of this goal. Bracha forcefully posits that static type systems are all necessarily incomplete and that dynamically typed languages have necessary flexibility. Meijer makes a similar point in his static where possible, dynamic where necessary paper. I'm not clear that Bracha's extreme view -- that all static analysis must be kept out of the language kernel -- is the correct one, given the potential performance cost, but I suppose RPython provides an encouraging counterpoint.

  • increasingly powerful -- There are increasingly many varieties of static analysis being developed primarily for Java. One recent paper on points out that its framework could straightforwardly be implemented on top of other analyses with similar intraprocedural resolution. The BDDBDDB framework has already been used for implementing taint analysis, and the JastAdd system for non-null Java type inference. In general it seems there is a lot of opportunity for shared infrastructure here. Also note that support for excellent error messages and great visibility into analysis results (and analysis failures) will be critical for usability. See Grimm's paper titled Systems Need Languages Need Systems for some forceful advocacy here.

  • a defining metalanguage -- Some good examples of metalanguages for syntactic language description are the OMeta pattern-matching language for executable grammars, Gilad's executable grammars in NewSpeak, and the Rats! extensible parsing system. A good example of an extensible language for static analysis is the JastAdd extensible Java compiler with its support for defining rewritable circular reference-attributed grammars... they implemented Java 1.5 generics as a modular declarative compiler extension, which proves their point to me, anyway!

  • partial typing -- The two best examples of this that I know of are the gradual typing work of Siek and Taha, and the hybrid typing for undecidable type systems work of Flanagan, Freund, et al. In both cases, a static type system is enhanced with a generalized type (named Dynamic or "?"), which is inferred to be a more specific static type where possible, and otherwise cast at runtime to preserve dynamic type safety.

  • layered types -- The already-mentioned JastAdd system is perhaps the best example of a structure which permits layering additional analyses. The extensible Java compiler Polyglot is another.

  • inferrable types -- My original thinking about this entire line of research originated in a blog post from a couple of months ago where I realized that some implementations of type qualifiers -- for example, a "tainted" type qualifier in C++ -- would ripple through the whole program due to mandatory explicit static typing everywhere. It's noteworthy that many type qualifier analyses for Java are not based on explicit syntax. For example, the taint analysis based on the BDDBDDB framework does not require explicit propagation of tainted or untainted declarations, yet it derives such information throughout the program's structure. An environment which made the results of that analysis visible -- and traceable -- at every program point would let the programmer see the flow of tainted values without having to explicitly declare them.

  • explicit, optional type annotations -- Programmers must also be able to add explicit qualifiers, since the programmer's intent may or may not match the analysis; the analysis may be incomplete and the programmer needs to provide more information, or the analysis may be consistent and the programmer wants to declare that they confirm it and want it to be enforced at that location (e.g. if the program changes such that the property is no longer true there, the language would signal an error). The programmer's explicit intent and the analyser's implicit understanding should be able to be flexibly cross-checked. I'm not aware of any inference systems that support this fully; they seem to be either purely inference-based (e.g. JQual) or purely annotation-based (e.g. a C++-based type qualifier system discussed in the "Extending Type Systems in a Library" paper from LCSD '06).

  • efficiently implementable -- This is obviously enormously difficult, insofar as analyses can in general be interdependent. There is a great tension between layering analyses (for separability) and weaving them (for mutual dependence and synergy). See the "analytically composable" goal below. In general, I wouldn't be surprised if aggressive parallelization of a language analysis / compilation framework required software transactions to support optimistic and incremental analysis by multiple threads.

  • a unified internal representation -- I mean something along the lines of Grimm's declarative extensible syntax trees, a concept proven by his weaving of Java and C typechecking into a single system. The JastAdd framework is another example; there is no separate symbol table in JastAdd, since name bindings become reference links in the abstract syntax tree (which is really more of an abstract syntax graph). Note that JastAdd's own declarative language for extending the syntax tree is fundamentally similar to open classes, in that subsequent extensions can directly modify the structure of already-defined syntax node classes. This seems to inhibit modular development of language extensions, but modular language extension development is really hard anyway.

  • promote static metaprogramming -- This goal is about ensuring the entire program text remains analyzable, and about permitting domain-specific languages to be implemented in the same structure used for extending the base language. See OMeta's SQL extensions or C# 2.0's language-integrated queries which reify some expressions as syntax trees visible to the runtime system. The Google Web Toolkit's support for extensible code generation is another example, as is the RapidMind system for creating parallel compute kernels from C++ code. Finally, there's the RPython system which creates a "metaprogramming startup phase" for Python programs, followed by a static compilation phase yielding 100x speedups. Interestingly, this whole goal contradicts Paul Graham's 2001-era view that "true macro systems aren't compatible with static type systems."

  • support diverse analyses -- The best two already-cited examples here are the JastADD grammars formalism and the BDDBDDB Datalog-based analysis specification. These are radically different but potentially very synergistic. I'd be fascinated to see whether there's some deeper commonality between them....

  • analytically composable -- The JastADD framework seems the best example of a declarative structure that supports automatic weaving of multiple analyses. For evidence to this effect, consider that the JastADD folks claim that the Polyglot team is reimplementing their framework in JastADD to avoid the difficulties of scheduling dozens of analysis passes.

I plan to start experimenting with some prototypes of an Eclipse plugin for an extensible language framework along these lines, likely starting with something much like OMeta and extending it with a JastADD-like rewritable grammar formalism. This will be open source all the way. I would enthusiastically welcome all pointers to similar projects, all interest in helping with such a framework, and all critical comments on all or part of this manifesto!

(Disclaimer: our family will also be moving out of our house in the next three months, so progress may well be slow :-)

Monday, December 3, 2007

GWT presentation tomorrow!

I've been working on a truly monster blog post, nothing short of a manifesto for a growable programming language. It's about four pages long now and roughly 90% complete. All I have to do is Google up the two dozen reference links.

But I can't quite get to it tonight, because tomorrow is my presentation at the GWT conference and I've been tweaking my slides semi-compulsively.

So, hopefully I'll see several of you tomorrow at the conference, and stay tuned later this week for my Big Fat Language Rant! Your patience is very deeply appreciated.

Monday, November 26, 2007

Gaming, Turkey, and Burnout

(Warning: relatively low technical content ahead!)

Rats, missed another blog-posting week here. A week ago I finally got my GWT conference presentation done. I was rather nervous about whether I was hitting the right technical level, so I mailed it to a few GWT team members, and Bruce Johnson (the team lead) wrote back and said, basically, "It's perfect." Thanks, Bruce! Eased my mind tremendously. The conference is coming up next week.

Then on Tuesday, Mass Effect shipped, and I was instantly lost in space. I've gone back and forth on whether to talk about gaming here, but this title pushed me over the edge. I've been a computer gamer ever since I've been a computer user -- my very first experience with a computer was playing Spacewar on a PLATO timesharing terminal at my friend's father's college. And gaming has become a true lifelong hobby. Mass Effect is a great example of why that is.

Programmers inherently love systems. A truly intrinsically motivated programmer will spend hours thinking about how something fits together, and more hours tinkering with it and seeing what happens. A great game also has a systemic internal structure that rewards tinkering and exploration. So it's natural that so many programmers are gamers.

Moreover, many programmers are thrilled by advances in technology. The more powerful our computers become, the more we can do with them. And computer games are definitely the most widely visible hardware-pushing applications. Just take a look at some of these screenshots. A $280 XBox 360 can now do real-time facial animation that looks a lot better than ANYTHING being done five years ago (real-time or pre-rendered). And it qualitatively changes the nature of the game, when your character and the zany aliens they're talking to both have facial emotions and lip-syncing that are so much closer to realistic that you can almost lose yourself in the illusion. I'm not claiming that Bioware's crossed the uncanny valley just yet, but they've definitely taken huge strides in that direction, and they've hooked me but good.

I expect to be a gamer for the rest of my life, since games just keep getting cooler and cooler as hardware gets more and more powerful and we learn more about what to do with it. It's a perfect hobby for a systems-oriented, game-addicted technophile.

So Mass Effect took over my after-the-family's-asleep life, and stole my blogging cycles last week. Thanksgiving was also pretty hectic -- we cooked an 18-pound turkey for our church get-together, which took me all day while my wife juggled our three-month-old and our (almost-)three-year-old. That left us pretty wrecked and we laid low the rest of the weekend.

Meanwhile, on the hacking front, I have continued to feel blocked by the whole Seam 2 broke my code issue. Actually, technically, it's that Seam 2 broke the G4JSF code. That code was originally written by the Ajax4JSF team, but they have abandoned it. So the question for me is, how motivated am I to fix it? (And, potentially, to continue fixing it as Seam, JSF, and GWT continue to evolve?)

I originally started on this whole Seam / GWT / JSF project because I had ambitions to create a peer-to-peer distributed blogging system. I made a lot of progress on that system (at least as far as creating a Seam-based model for representing version trees with Java persistence, a la Mercurial). Then I wanted to start on the UI. Which led me to GWT, and the realization that the existing G4JSF library didn't couple gracefully with GWT RPC. So I patched GWT, and then G4JSF, and that led me to presenting to the GWT team at JavaOne last year, which then got me the opportunity to speak at the GWT conference next week. All of that was quite unexpected and quite appreciated -- particularly the speaking opportunity!

BUT, going back to my original project, the peer-to-peer distributed object system lost its mojo somewhere over the last year. It led me down the GWT / Seam / JSF path, which was really interesting and connected me to the open source world like never before... but somehow the journey became the destination, and the original goal no longer feels as interesting. Frankly, I'm not feeling very enthusiastic about being the sole owner of the G4JSF code base, given the evident cost of keeping it current and given the lack of other support from the JSF community. Maybe having a second baby also cut my energy level.

In order to avoid burnout I need to work on the projects that are most interesting to me, which right now is looking like prototyping some extensible programming language ideas that have been burning a hole in my brain for the last two months.

So, I'm going to set aside the G4JSF work and I'm going to leave the Seam/GWT/JSF integration project in its current state. I am more than willing to work with anyone (at Red Hat or elsewhere) who wants to pick it up again and fix it up to work with Seam 2 (and JSF 2 when it comes along), but I can't drive it further on my own. Anyone who was using that integration library and who is interested in fixing it, feel free to email me (rjellinghaus at gmail dot com).

This blog will shift focus away from Seam/JSF and towards programming language research and the experiments I'm doing. Hopefully that'll still be interesting to my loyal readers -- I'll do my damnedest to make it so! Because I'll tell you one thing, it's pretty fascinating to me :-)

Tuesday, November 13, 2007

Android hits the fan

Unless you've been buried under a boulder, you've heard about Android. It's nice to know what my secretive Google mobile JVM guru friend has been working on all this time :-)

The Android forum is awash in C / C++ developers ranting about how lame Android is because it doesn't support raw access to the hardware. It's too bad, really. I feel for them, and I agree: Android apps are going to be slow and clunky compared to native apps.

But Google's making a play to build a whole new kind of mobile ecosystem here. Google's betting on increasing mobile CPU power (certainly an excellent bet), and Google knows that if Android develops a personal-computer-like reputation for flakiness, it will fail and fail bigtime. I would argue that a big reason no mobile platform has really taken off yet (besides crazy licensing costs) is that native apps are simply flaky, and the more of them you run, the flakier your hardware gets. Since Android is trying to greatly increase the hardware diversity of Android devices, allowing raw native access combined with open developer access would rapidly drive the stability of the system to zero.

Native developers hate to hear this, because they spend their lives tuning for performance and chasing crash bugs. I'm sure many of them do write good solid apps. But I'm also sure that making native code crash-proof, especially across a wide spectrum of hardware, is very, very hard to do. And if you want to throw your platform open to anyone who wants to ship an app for it, you're guaranteed that many if not most of those people will lack the uber native skillz to achieve that kind of cross-platform stability.

So HOORAY GOOGLE! for making Android a fully managed platform! Note also that Microsoft sees the writing on the wall; they're working on a fully managed operating system, for crying out loud.

End rant. The most interesting thing I've yet learned about Android is that it is fundamentally trying to end-run around Sun's IP restrictions on Java ME. Stefano Mazzocchi breaks it down in exemplary fashion. Super cool! It's going to be really strange to see Sun and Google at each other's throats, but given how close the two companies are, and given how heavily Android leverages Java-the-language while totally subverting Java-the-runtime, AND given how anti-patent-war Sun is trying to be, I hope for some kind of peaceful reconciliation there. Microsoft, on the other hand, will go apeshit. Let the fireworks begin!

Wednesday, November 7, 2007

GWTJSF broken with Seam 2, sorry folks!

So I've gotten past a number of the issues I blogged about previously, and I now have a Tomcat-only no-EJB3 version of the Seam blogging example. I've tried rolling the GWTJSF code from my original example back into it, and it looks like Seam 2 has thrown another wrench in the works.

As Chris Alfonso at JBoss warned me, Seam 2 installs the SeamPhaseListener by default. The Seam/GWTJSF integration as it previously existed needed its own phase listener, as explained in my article on the topic. This delegating-PhaseListener technique seems to be impossible in Seam 2. (Unless I'm missing something.)

With the GWT conference breathing down my neck, I may need to sideline this for a while until I get a fallback presentation together -- I was hoping to be able to demo GWT 1.5 (if it gets done in time!) with JSF 1.2 and Seam 2, but I can't count on that at this point, there are only four weeks to go and I don't have the hacking time I'd need. So it's Plan B time. I'll take a week or so and get the presentation together with the previous version of the code, and then I can pound on Seam 2 some more.

Also, it looks like the GWTJSF code has officially been left behind in the Ajax4JSF 1.1.1 code tree, which has been rolled into the RichFaces project -- except for GWTJSF, which is nowhere to be found in the RichFaces code tree. GWT can definitely be considered a competitor to RichFaces, so I guess I see why JBoss isn't supporting GWTJSF, but still it's a shame. It'd be great to see some more synergy there, but I'm not sure how to bring it about. (other than blog about it and rock the boat a little ;-)

Anyway, sorry to those who've been wanting to use GWTJSF with Seam 2 -- it's going to be a longish road. All help very much appreciated if anyone else has cycles to hack on this with me!!!

Monday, November 5, 2007

Friendship and fraternal aid

Last week I griped about some hassles I was having with Seam 2, Java 6, Tomcat, and JBoss Embedded. I grumpily grouched about the grim gloom of an open source world where things break and people don't get along very well, and how non-fun it is.

And hey presto, what should happen but two Red Hatters read my kvetching and chimed in with some downright helpfulness. First, Jason Greene mentioned this workaround for the bizarre Java 6 ClassNotFoundException I mentioned. Seems there is a magic JVM flag you can provide to revert to the Java 5 classloading behavior if it causes you too much trouble. That's good to know, and I appreciated the clue.

Second, Chris Alfonso -- who's been asking me when I'm going to get around to fixing my GWT/JSF/Seam demo to work with Seam 2 -- said that he's got a Maven 2 version of the GWTJSF demo building as an EAR that runs under JBoss with EJB3. Which is cool, because I now have a plain old Hibernate version of the Seam 2 blogging demo that deploys under vanilla Tomcat, so it's nice to have both flavors. (Chris, can you send me your version? I'd like to poke at it some -- I won't post it without your say-so.)

So that's the great thing about the open source community -- it is a community, and people offer each other helping hands. Thanks for brightening my week, guys! You put the fun back in :-)

GWT RPC presentation, by popular demand

A couple of months ago, I was honored to learn that I'll be doing a presentation at the upcoming Pearson GWT conference in San Francisco in early December. I'm quite grateful for this opportunity.

What's more, I just learned that they're giving me a discount code! Head to the site and enter code GW-JELL for $100 off the registration fee. Woo!

Originally I contacted Pearson with the suggestion that I might do a GWT/JSF talk, but there wasn't enough room on the agenda for that. Instead they'd planned to have Miguel Mendez (the author of the GWT RPC code) do a presentation about RPC in GWT. It turns out that he can't make it. So I'll be presenting on it instead. It'll be impossible for me to do as good a job as Miguel would do, but I'll give it my best shot. (Thankfully it's the area of GWT that I understand better than any other!)

So the next question is, what exactly should I discuss? I'd like to throw this open to the community to some extent. If you were going to a presentation on GWT's RPC, what would you want to learn about?

I'll start by listing the things I personally want to know more about (from most to least):
  • how to use GWT RPC with Hibernate or other Java persistence frameworks (particularly interaction with lazy loading)
  • best practices for scoping the object graphs you send in RPC (to manage problems of updating server state in a large object graph)
  • Java 5 support in GWT RPC: how generics are handled in GWT 1.5
  • organizing your RPC code for best readability and easy sequencing
  • integrating with JSF, Spring, and other server frameworks
Obviously this is way too much already for a mere hour-long presentation, but that's my problem, not yours :-)

I'd appreciate it if people could post their favorite topics. If you can list them in order from most urgently interesting to least, that would help me.

I'll be working with Miguel to refine the presentation over the next couple of months (thanks, Miguel!). There'll also be some blog posts coming out of it. And I'll see you in San Francisco in December!

Thursday, November 1, 2007

Finger-pointing and frustration

One of the tough things about open source is that the world never stands still. Actually, it's not just open source, it's software in general, but open source exacerbates it. See, one of the main motivations for open source developers is fun and the enjoyment of sharing something cool you did. So you'd like to be able to do something cool and then have it stay cool.

But the problem is, once you get something cool working, it's almost guaranteed that it's going to stop working soon. Why? Because everyone else is doing other cool things. Including all the developers whose projects yours depends on. So the odds approach 100% that before long your cool thing is going to break with the latest version of FooBarLib. And broken things are no longer cool.

And fixing your cool thing, that broke for no fault of your own and for reasons you don't agree with, is the absolute opposite of fun.

Even large developers are vulnerable to this. Right now I'm banging my head against trying to get my Seam + GWT example working with the latest 2.0 release of Seam. But Seam made a number of changes that are frustrating me rather deeply right now. For one thing, they changed the packaging for their embeddable EJB support such that you have to install the JBoss embedded EJB container directly onto your Tomcat. No more delivering a nice simple WAR file that uses Seam and EJB3, now it's a multi-step process that leaves your Tomcat fairly JBossified.

The nominal rationale for doing this is that it lets the EJB3 services be shared by all the apps inside that Tomcat instance. So you might think, great, now my Seam+EJB3 WAR file is 6MB instead of 20MB. (Which is indeed a problem with the demo version on my page -- it's stinkin' HUGE.) But, BUT, it turns out you can't actually run multiple Seam apps under the Embedded EJB3 container!

Why not? Well, because Bill Burke and Gavin King, both of whom work at JBoss and both of whom are fairly bullheaded, can't agree on whose problem it is that multiple Seam apps collide in the Embedded EJB3 component registry. Not only that, but the Embedded EJB container's development has stalled. So, Tomcat is a seriously second-class citizen as far as Seam is concerned now. Of course from a JBoss perspective this is arguably good, because JBoss doesn't get support revenue from Tomcat installations. For me, though, it sucks, because I don't want my Seam / GWT example to require JBoss.

And the icing on the cake is that the Embedded JBoss container doesn't run under Java 6. I like to run on the latest Java. It's just a funny preference I have, I don't know why. But JBoss does classloader tricks that Sun apparently considers invalid loopholes, which Sun closed in Java 6. This results in charming errors like:
ERROR 01-11 21:56:07,921 ( -Error installing to Instantiated: name=DeploymentFilter state=Described java.lang.IllegalStateException: Class not found: [Ljava.lang.String;

And JBoss and Sun are finger-pointing about who broke who. So Seam isn't at all guaranteed to run under Java 6 in any way, and JBoss and the Seam team consider it Sun's problem, not theirs.

So what am I going to do? I've got a conference presentation coming up and I wanted to use this demo with the latest Seam and latest GWT. But it's starting to look like I'm going to have to sink way too much time into mucking about with Seam for reasons I'd rather not have anything to do with. I'll probably follow the path of least resistance, which is just to roll back to Java 5 and grin and bear it. It's definitely demotivating, though. Definitely. Demotivating.

Monday, October 29, 2007

OOPSLA 2007 - Languages Gone Wild

It has come to my attention that my writing here has been a bit boring. Dry. Stuffy. Well, Kevin Bourrillion has shown me the way. Time to liven it up a bit. Just a bit, mind you. Juuuuuuuuust a bit.

And please don't give me trouble over my inability to post more than once a week. Really all I can say is ARRRRGH! Work, upcoming conferences, and raising two very young kids = JEEZ WHEN DO PEOPLE EVER GET TIME TO BLOG? And while we're on that topic, I don't see you blogging enough either, do I? Don't be throwing any stones, Mr. Glass House.

OK, where were we? Ah yes, OOPSLA 2007. What a beautiful conference. Not that I attended or anything, but who needs to attend a conference when you have the Internet?

The dynamic vs. static languages flame war has bogged down the language community over the last decade. It's great to see that, judging by some recent papers from the latest OOPSLA, the logjam has broken with a vengeance. There are all kinds of brain-bending new languages in the works, and it's frankly exhilarating. (Sorry that some of these are only available through the ACM's Digital Library... but at $99 per year, how can you afford NOT to be a member???)

First we have a great paper on RPython, a project which creates a bimodal version of Python. You can run your Python program in a fully interpreted way in a startup metaprogramming phase, and then the RPython compiler kicks in, infers static types for your entire metaprogrammed Python code base, and compiles to the CLR or to the JVM with a modest performance increase of EIGHT HUNDRED TIMES. Yes, folks, they've run benchmarks of Python apps that run 800x faster with their system than with IronPython (which is no slouch of a Python implementation AFAIK). If that isn't a great example of how you can have your dynamic cake and eat it statically, I don't know what is.

Another lovely system is OMeta, a pattern language for describing grammars. You can write everything from a tokenizer up to a full language in a really nice executable grammar structure, with productions that map directly to some underlying base language. They also have a good modularity story worked out, and support for stateful parsing. They have a 177-line Javascript parser, and that's not much!

Then there's an equally great paper on JastAdd, an extensible compiler for Java. The JastAdd compiler is built around an extensible abstract syntax tree. The abstract syntax tree is the ONLY data structure in the compiler -- there are no separate symbol tables or binding tables; everything is implemented as extensions to the abstract syntax tree. The extensions are expressed with a declarative language that lets you define dataflow equations relating different elements in the tree -- inherited elements (for referring to names bound in a parent construct, for example), or reference elements (for referring to remote type declarations, for example).

The compiler has an equation analysis engine that can process all these equations until it reaches a fixpoint, which completely avoids all the usual multi-phase scheduling hassles in compilers around interleaving type analysis with type inference, etc. It seems like The Right Thing on a number of levels, and it makes me want to hack around with building a compiler along similar declarative lines. They give examples of extending Java with non-null types and of implementing Java 5 generics purely as a declarative compiler extension. That, to me, pretty much proves their point. Bodacious! I had been thinking that executable grammars were a nice way to go, but seeing their declarative framework's power is seriously making me reconsider that idea. What would you get if you combined OMeta and JastAdd? Something beautiful. I'm not sure how you'd combine the statefulness of OMeta with the declarativeness of JastAdd, but we must ponder it deeply, because the One True AST is a goal worth seeking.

A truly mind-bending paper discusses breaking the two-level barrier. What's the two-level barrier? Simple: it's the class/object distinction. They point out that many kinds of modeling can't be done with a class hierarchy. What you really want is a programmer-accessible metaclass hierarchy. (And not a weenie one like Smalltalk's, either.) For example, consider an online store. You thought you knew everything about online stores? THINK AGAIN, JACKSON. Let's say you have a DVD for sale, such as Titanic. That Titanic DVD is an instance of the DVD product class. The DVD product class is conceptually a further instance of the DigitalMedia product class. I meant exactly what I said there -- in their framework, one class can be an instance of another class.

You can then state that the DigitalMedia metaclass defines a "categoryName" and a "net price", requiring that "category name" be defined by instances of DigitalMedia, and that "net price" be defined by instances of instances of DigitalMedia.. The DVD class then defines "categoryName" to be "DVD", so ALL DVDs have the same category name. And then particular instances of DVD define their actual net prices individually. In this way, you can take the same kinds of "must provide a value for this field" constraints that exist in the class-object hierarchy, and extend it to multiple levels, where grandparent metaclasses can express requirements of their meta-grandchild instances.

(They use the abysmal word "clabject" -- ClAss obJECT -- to refer to entities that can be used to instantiate objects (like classes), but that ARE themselves instantiated (like objects). I think "clobject" would have been better, or maybe "obclass" or something. "Clabject" just sounds... I don't know... disturbing. Like some kind of hapless crab that's filled with techno-malice. But the concept itself is very interesting. I think that having two orthogonal hierarchies -- the metaclass hierarchy and the subclass hierarchy -- is potentially too confusing for most programmers, including me, but it's nonetheless really thought-provoking.)

Those are just four of the highlights -- I'm only about a third of the way through reading the OOPSLA papers this year -- but I think those are the top three when it comes to language design. It's going to be a great next decade, as the whole static vs. dynamic war gives way to a myriad of bizarre hybrids and mutants, greatly enhancing (and confounding) the lives of hackers everywhere!

Monday, October 22, 2007

Conscientious Software, part 3

(Sigh, keeping on a schedule is pretty darn tough. Sorry for week hiatus. Onwards! Finally finishing off this particular sub-series of posts.)

Conscientious Software, Part 3: Sick Software

Final part in a 3-part series. Part 1, Part 2.

An interesting thing happened when garbage collection came into wider use. Back when I was doing mostly C++ programming, a memory error often killed the program dead with a seg fault. You'd deallocate something and then try to use a dangling pointer, and whammo, your program core dumps. Fast hard death. Even once you fixed the dangling pointer bugs, if your program had a memory leak, you'd use more and more memory and then run out -- and whammo, your program would crash, game over, man!

Once garbage collection came along, I remember being delighted. No more running out of heap! No more memory leaks! Whoa there, not so, is it? You could still leak memory, if your program had a cache that was accumulating stale data. But now instead of killing your program immediately, the garbage collector would just start running more and more often. Your program started running slower and slower. It was still working... kind of. But it wasn't healthy.

It was sick.

Sick software is software that is functioning poorly because of essentially autopoietic defects. It's fail-slow software, instead of fail-fast software. A computer with a faulty virus checker, infected with trojans and spyware that are consuming its resources, is sick. A memory-leaking Java program is sick. These systems still function, but poorly. And the fix can be harder to identify, because the problems, whatever they are, are more difficult (or impossible) to trace back to your own (allopoietic) code. Fail-fast environments make it clear when your code is at fault. Fail-slow environments, self-sustaining environments, are trying to work around problems but failing to do so.

Now, there is every likelihood that as systems scale up further, we will have to deal more and more with sick software. Perhaps it's an inevitability. I hear that Google, for instance, is having to cope with unbelievable amounts of link spam -- bogus sites that link to each other, often running on botnets or other synthesized / stolen resources. In some sense this is analogous to a memory leak at Internet scale -- huge amounts of useless information that somehow has to be detected as such. Except this is worse, because it's a viral attack on Google's resources, not a fault in Google's code itself. (I realize I'm conflating issues here, but like I said, this posting series is about provocative analogies, not conceptual rigor.)

I think there is a real and fundamental tension here. We want our systems to be more adaptive, more reactive, more self-sustaining. But we also want guarantees that our systems will reliably perform. Those two desires are, at the extremes, fundamentally in opposition. Speaking for myself as an engineer, and even just as a human being, I find that certainty is comforting... humans fundamentally want reassurance that Things Won't Break. Because when things stop working, especially really large systems, it causes Big Problems.

So we want our systems to be self-sustaining, but not at the cost of predictability and reliability. And as we scale up, it becomes more and more difficult to have both. Again, consider Google. Google has GFS and BigTable, which is great. But once it has those systems, it then needs more systems on top of them to track all the projects and the storage that exist in those systems. It needs garbage collection within its storage pool. The system needs more self-knowledge in order to effectively manage (maybe even self-manage) its autopoietic resources. And in the face of link spam, the system needs more ability to detect and manage maliciously injected garbage.

Returning to the original paper, they spend a fair amount of time discussing the desire for software to be aware of and compliant with its environment. They give many potential examples of applications reconfiguring their interfaces, their plugins, their overall functioning, to work more compatibly with the user's pre-existing preferences and other software. While extremely thought-provoking, I also find myself somewhat boggled by the level of coupling they seem to propose. Exactly how does their proposed computing environment describe and define the application customizations that they want to share? How are the boundaries set between the environment and the applications within the environment?

In fact, that's a fundamental question in an autopoietic system. Where are the boundaries? Here's another autopoietic/allopoietic tension. In an allopoietic system, we immediately think in terms of boundaries, interfaces, modules, layers. We handle problems by decomposition, breaking them down into smaller sub-problems. But in an autopoietic system, the system itself requires self-management knowledge, knowledge of its own structure. Effective intervention in its own operation may require a higher layer to operate on a lower layer. This severely threatens modularity, and introduces feedback loops which can either stabilize the system (if used well) or destabilize the system (if used poorly). This is one of the main reactions I have when reading the original paper -- how do you effectively control a system composed of cross-module feedback loops? How do you define proper functioning for such a system, and how do you manage its state space to promote (and if possible, even guarantee) hysteresis? This circles back to my last post, in that what we may ultimately want is a provable logic (or at least a provable mathematics) of engineered autopoietic systems.

There are some means to achieving these goals. You can characterize the valid operating envelope of a large system, and to provide mechanisms for rate throttling, load shedding, resource repurposing, and other autopoietic functions to enable the system to preserve its valid operating regime. It's likely that autopoietic systems will be initially defined in terms of the allopoietic service they provide to the outside world, and that their safe operating limits will be characterized in terms of service-level agreements that can be defined clearly and monitored in a tractable way. This is like a thermometer that, when the system starts getting overworked, tells the system that it's feverish and it needs to take some time off.

So the destiny of large systems is clear: we need to be more deliberate about thinking of large systems as self-monitoring and to some extent self-aware. We need to characterize large systems in terms of their expected operating parameters, and we need to clearly define the means by which these systems cope with situations outside those parameters. We need to use strong engineering techniques to ensure as much reliability as possible wherever it is possible, and we need to use robust distributed programming practices to build the whole system in an innately decomposed and internally robust way. Over time, we will need to apply multi-level control theory to these systems, to enhance their self-control capabilities. And ultimately, we will need to further enhance our ability to describe the layered structure of these systems, to allow self-monitoring at multiple levels -- both the basic issues of hardware and failure management, the intermediate issues of application upgrade and service-level monitoring, and the higher-level application issues of spam control, customer provisioning, and creation of new applications along with their self-monitoring capabilities.

We've made much progress, but there's a lot more to do, and the benefits will be immense. I hope I've shown how the concept of conscientious software is both valid and very grounded in many areas of current practice and current research. I greatly look forward to our future progress!

Monday, October 15, 2007

The Best of Both Worlds

Along the lines of my post about avoiding religious wars over computer languages, I recently ran across two papers that got me seriously thinking about this whole "how to have the best of all worlds" problem.

Specifically, there was an excellent follow-on to the last OOPSLA conference, called the Library-Centric Software Design '06 conference. Josh Bloch (of Effective Java fame) was one of the co-chairs. The papers were all focused on the general problem space of "active libraries" or "metaprogramming" or "extensible languages" -- generally, the concept that your programming language can be extended by libraries that your programs use. Generally, this means that the typechecking of the language, or perhaps its very grammar, can be enhanced in an extensible way.

One of the more thought-provoking examples is a paper discussing Extending Type Systems in a Library, specifically about adding new "field types" to C++. One use for this is defining new type modifiers such as "notNull", or "positive", or "tainted". These are all in some sense type template modifiers, in that for any pointer type T you could define a modified type notNull with type assertions preventing you from having null instances of that type. Or, for "tainted", you could have just about any type T and create a variant type tainted indicating that the value of the base type originated from a suspect source.

The "tainted" use case tweaked my memory. The other recent work I've seen on taint analysis was the paper discussing how to use the BDDBDDB system to track tainted values through a Java program (previously discussed here). In that case, the system used sophisticated alias analysis to track the movement of values from tainted sources to vulnerable sinks (e.g. from dangerous methods receiving input from the Internet, to vulnerable methods issuing output to a database or filesystem or even another server connection).

Consider how similar these are. In the first case, you're extending the language's type system to distinguish tainted from untainted values. In the latter case, you're analyzing a source base which has no such type distinctions, and statically proving whether or not the value flow does in fact permit tainted values to be used in a context that requires untainted values.

Yet also notice how different these are. In the extended-C++ case, actually using the tainted/untainted types would require you -- in my reading of the paper -- to essentially modify almost every method in your source code. All methods would need to declare their values as being either tainted or untainted, even methods which don't actually ever use the possibly-tainted values in a vulnerable way. In other words, the property of taintedness is only relevant at one place in the program -- the methods whose inputs are required to be untainted. But declaring the property as an explicit type annotation throughout the program forces your code to expose this distinction everywhere. Suddenly your language extension forces you into a truly massive refactoring!

In contrast, the BDDBDDB-based analysis essentially overlays the tainted property on the program's alias graph. The analysis may well be no less sound... but how much more convenient for the programmer! Taintedness becomes a property that is derivable only when the programmer wishes to be aware of it. The rest of the time, the programmer can avoid worrying about it. In some sense, the security analysis becomes a "pay-as-you-go" property of the system, as opposed to a "pay-up-front" property in the fully statically typed technique.

Now, consider this a bit more broadly. Isn't this exactly what dynamic language advocates hate about statically typed languages? Instead of worrying about tainted vs. untainted, let's consider basic data typing: string vs. number. If you read in a string, and then you want to treat it as a number, isn't that essentially exactly like a taint analysis? The only time you care whether the string is in fact not a number is if you actually try to use it as one. If you don't ever try to use it as one, then why worry? Conversely, if you do try to use it as one, wouldn't it be nice to have the system be able to trace its provenance back to where you originally read it, so you could decide if you wanted to insert a validity check there... without forcing you to declare its type at every intermediate program point where it flows?

It seems to me that extensible languages will have to shift a lot of their safety checking and property checking away from the static type system that's exposed directly to the programmer. Consider all the possible properties you might want to know about the values in your program: are they string? Numeric? Complex? Are they null? Never-null? Are they tainted? Untainted? Are they structurally compliant XML? Are they valid values for your object-relational database? Are their dimensions correct for the physics math you want to do with them? There is a literally infinite variety of possible types that you might want to track through your program. And if you have to declare all of these types on every variable and method you declare, you pay a "static declaration tax" equal to the number of extended type properties times the number of declarations in your program. Whereas if the system is able to derive, or trace, or deduce the flow of these properties, and if the programming environment gives you enough insight into how that derivation works and what annotations you can add (very selectively) at only the points where the distinctions are relevant, you offload that tax onto the programming environment.

Imagine an IDE for some Ruby++ language. You can write your Ruby code in the normal way. But you can also annotate various methods with fully optional types. Methods that read from an HTTP connection mark their values as tainted. Methods that write to the database require untainted values. Arithmetic methods mark their inputs as numeric. And even better, there are ways (handwaving furiously) to define new kinds of traits and types; so you can implement complex numbers in the language, and use them without needing to explicitly declare their type throughout the program, and the system can track their value flow and create optimized code for their use. The end result: an extensible language with far more ease of use than extended-Java or extended-C++, and far more static safety than basic Ruby.

Maybe it is possible to have the best of all worlds. And in fact, if we want to truly achieve the dream of extensible statically-analyzed languages with usable syntax, maybe it's not only possible but necessary.

Up with pay-as-you-go! Down with the static declaration tax! Up with extensible languages! Down with pure dynamism!

Maybe I'm a bit more religious than I thought.

What would such a language look like? Stay tuned!

...After writing the above, I surfed around for pluggable types -- which was Gilad's terma for the concept -- and found several people who have suggested basically this: that static typing could be considered optional, and that extensible type systems might well require that viewpoint. There has been some real progress in the direction of viable partial type systems, and Javascript 4 is looking at adding strict mode.

Anyway, I'm leaving the body of the post above as I originally wrote it, because the slightly uninformed enthusiasm gave it some good energy :-)

Wednesday, October 10, 2007

BDDBDDB - Pointer alias analysis comes of age

One of the most exciting program analysis techniques I've ever seen (I love having my own geek blog, where I can say things like that with a straight face) is the BDDBDDB system, from Whaley and Lam at Stanford. (I will capitalize BDDBDDB here, though they don't.)

Pointer alias analysis is tracking the movement of object references through a large piece of code, across method calls and through heap references. I've been reading programming research papers for almost two decades now, and for much of that time, pointer alias analysis was a chimera. It just wasn't possible to make it scale. The number of potential paths in even a moderately large piece of code is dizzying -- numbers like 10 to the 14th power aren't uncommon. And for most of the last two decades, this has been an uncracked nut.

Well, Whaley and Lam seem to have finally cracked it, by building on other research done using binary decision diagrams (BDDs) to compactly encode the movement of values through variables in a program, in a way that maximizes the amount of shared structure in the variable flow graphs. BDDs were originally created for hardware logic analysis, to track the movement of signals through a huge circuit. So it's elegant that they can also apply to software analysis, tracking the movement of value through a large code base.

The further elegance in Whaley and Lam's technique is the query language laid on top of the BDD-based analysis. Binary decision diagrams can be used to encode relations; they can be joined and queried much like relational tables. And there is a query language, Datalog, that is well suited for expressing queries over a BDD-based relational model. Datalog is a relational query language, similar to a more abstract version of SQL, except it supports recursive queries (which SQL does not). Whaley and Lam are able to encode a Datalog-based query and relationship-generation engine on top of their BDD aliasing structures. Hence, BDDBDDB -- Binary Decision Diagram-Based Deductive DataBase.

One of the more amazing results in their paper is their discussion of how their analyses improved once they had the Datalog layer. originally they wrote a context-specific alias analysis on top of a BDD engine without any intervening abstract query language. This took them a few months, the code was thousands of lines long, the performance was not good, and there were a number of subtle bugs. Once they wrote the Datalog engine and reformulated their analysis as Datalog formulas, the total size of the analysis dropped to just around a dozen lines (!) of Datalog, and the performance was one to two orders of magnitude faster.

Once you have a query engine that can efficiently query an entire software system to track value and type flow throughout, in a reasonably accurate way, there are many systems you can build on top of it. For instance, you can write an Eclipse plugin to track the exact lines of code that expose a web application to SQL injection attacks. This can be done by notating particular methods as sources of input data from the Internet, and other methods as destinations ("sinks") of query text for databases, and then evaluating whether any values flow from source methods to sink methods. Any paths through the code that allow unfiltered input to travel from the Internet to your database are security holes. The BDDBDDB analysis can tell you exactly what code paths allow that value flow. And it can do it purely as a static analysis!

They use Java as the basis of their system, for unstated but obvious reasons. (Namely, strong static typing makes their value flow analysis much more efficient, since it can eliminate many value flows that are not dynamically permitted; and lack of pointer arithmetic means their analysis can be trusted, with no need to limit to the "safe" subset of the language without pointer math.) Many people have been wooed away from Java by the flexibility of dynamic language programming, where you needn't encumber yourself with types; but those of us who remain in the static typing camp are pretty much hooked on the tools. And BDDBDDB is the harbinger of a whole new level of Java-based analysis tools that are going to take Java programming and safety checking to the next level.

One thing not mentioned in any of the BDDBDDB papers I've yet seen is reflection. I can't see how reflection can be handled in the BDDBDDB system, since tracking reflective method invocations requires not just tracking object references through variable sites, but actually tracking method values (and even specific method name strings!) through reflective invocations. Certainly their system as described doesn't address reflection in any way. This also seems to me to limit the applicability of BDDBDDB to languages like Ruby, which make heavy use of code generation (in Rails, for instance). Though I'm sure that some enterprising Rubyists will undertake to prove me wrong.

In general it's very challenging to consider the interactions of static analysis with metaprogramming frameworks -- runtime bytecode generation is becoming more common in Java systems, but BDDBDDB currently runs on a source analysis, without access to the runtime-modified bytecode. Metaprogramming can blur the line between compile-time and runtime, and BDDBDDB is solidly on the compile-time side... at least for the moment. I say this not to denigrate the importance of this technology, but rather to indicate an even further level to which it can go.

I look forward to what's next for these analyses. Good times in the static typing world!

Religious Wars Considered Harmful

The blogging community is rife with flamewars about Java versus Ruby, static typing versus dynamic typing, Apple versus Microsoft, and on and on.

I am going to mostly avoid those debates here. Some days I feel that all these flame wars are more about personal taste than anything -- and taste, being subjective, is definitely not worth huge flamewars. Other days I feel that all the flaming is simply missing an opportunity to ponder how to make things better overall.

I would like to have a language that combined the succinctness of Ruby with the static typing and IDE-assistance of Java. I would like to have a system that combined all the best qualities of OS X and Windows (for my personal values of "best"). I think competition is great, since it leads to opportunities for learning and improvement. And besides, the less dogmatic you are, the more able you are to appreciate all the many kinds of appeal in these different approaches... Ruby's "it tells a story" syntax (well, aside from the @fields) is truly a thing of beauty, and yet who couldn't be taken with the power of a C++ image library with near 100% code efficiency?

Too much flaming can wind up backing you into a corner, where you reject everything about the other side. And that way lies calcification and stagnation. Flamewars are the cholesterol of the technology world... they clog you up and can lead to heart attacks. (Although a little flaming can add spice -- even cholesterol can be tasty in moderation! But some blogs serve up so much of it that it'd be superfluous here.)

So if you want incendiary language, there are plenty of other places to go, some of which are on my blogroll. But here I'm going to be all boring by comparison. Or... not... depending on your perspective!

And to try to move the conversation on a bit, stay tuned -- my next few posts are on that very theme.

Cough, Cough, Cough

Very sorry for the hiatus, everyone (all twenty-some of you, according to Feedburner -- it was forty-plus until Oct 1, then Feedburner's stats rolled over and HALF OF YOU DISAPPEARED!!!). I know I just blew my "every two weeks" rule. And my hoary old excuse -- "I've been sick" -- has only one redeeming quality: boy, is it ever true. Our two-year-old brought home a cold that went straight to my throat and destroyed my vocal cords on the way to my lungs, where it transformed into a baker of the worst cookies ever. EVER. Combine that with some epic work-related stress immediately prior, and you have the perfect off-the-air recipe.

I'm not 100% yet, but I've gotten well enough to be able to think about things other than child care and how sick I am. So, my apologies. I'm about to go into high gear to make up for it -- I've actually got about a dozen posts saved up and they're going to start hitting every few days for a couple of weeks. Stay tuned, there's a veritable manifesto on the way!

Ultimately I'm shooting for more of a weekly rhythm (if not more often), since biweekly is just too seldom.

Anyway, thanks for your patience, and I'll try to stay healthy for a good long while. (Though with a baby AND a toddler in the house, and winter coming, good health is a guaranteed impermanent condition....)

Thursday, September 20, 2007

Messages and Transactions, Part 1: Erlang and its Discontents

With the rise of multi-core chips, everyone's talking about how best to do concurrency. I think the two most interesting emerging paradigms are software memory transactions and message-based programming. This two-post series is about these two paradigms.

(Cedric Beust's recent post about Erlang inspired me to accelerate this post a bit. Soon: the final Conscientious Software post.)

Message-oriented programming environments are structured around the dictum that all state is decomposed into modular elements that have no direct interaction. All state is local in a message-oriented system. Each stateful entity -- processes in Erlang, or vats in E, for example -- interacts with other stateful entities solely through asynchronous messaging. Entities can send messages to other entities, and receive messages from other entities. Message receipt happens one at a time; each entity is essentially a single-threaded message loop. Therefore, concurrency conflicts as such cannot happen, because no two messages can ever be operating on the same state concurrently.

There is much beauty to this model. Erlang heavily leverages the fault isolation properties of such a system; Joe Armstrong's Erlang thesis (which seems to be offline at the moment?) describes basic patterns of constructing structures of cooperating processes with very high reliability. The basic idea seems to be recursive restartability. A process which receives an invalid message or whose state becomes corrupted is expected to immediately throw a fatal exception and cease to exist. Other supervisor processes monitor their children and detect these failures, restarting the children as necessary.

In E, the same ideas are used to push forward the frontier of interaction between mutually suspicious code. The core concept here is that on the open internet, particularly as software becomes more sophisticated in its interactions with other software, there is a need to allow systems to interact via messages while remaining able to protect themselves against misbehavior on the part of other systems. Message-oriented programming again serves to isolate, but in this case it is not isolation from unintentional bugs or from external failures, but isolation also from malicious behavior. Really this deserves a whole separate post, so I'll leave it at that for now.

The great thing about systems like this, in theory, is that they scale very gracefully. Add more entities (processes / vats), and you get scale. Their inherent reliance on distributed messaging means that they work better with more processing power, no matter whether that's multi-core or multi-machine. You can also upgrade a running system with new code, while maintaining full operation. Erlang is getting a lot of interest right now for this reason.

The thing that frustrates me about the current Erlang wave, however, is that most of the core Erlang papers -- such as Joe Armstrong's thesis -- do relatively little to describe how one actually takes a real-world problem and expresses it as a message-passing system. For example, the thesis describes a very high-reliability ATM switch. It gives maybe a half-page sketch of some of the core modules of the switch, and it briefly describes which (not how, only which) some of the generic Erlang patterns are used in the switch's code. But when it comes to the details -- how requests are handled and handed off; how overload is dealt with; how failures are recovered at different levels of the system; how new features are implemented on top of existing running code -- there's basically nothing.

I really want to see some much more detailed case studies of actual Erlang systems that implement high reliability solutions, walking in detail through the whole code and process architecture, and talking about how upgrades and extensions are implemented on top of the base product. Without that level of explanation, it's not at all obvious how an Erlang system is actually built. One small example: consider a bank account transfer, implemented as a single transaction between two account objects. How would that be built in an Erlang system? Would each bank account be a separate process? If so, how would you achieve transactional semantics across the withdrawal, which involves both accounts? Would you need some kind of transaction manager pattern? Are there higher-level patterns in Erlang that address these issues? Most of the Armstrong Erlang papers focus on failure and fault recovery and supervision, but not on how distributed data structure updates are performed.

The most detailed discussion about an implemented Erlang system that I've been able to find is the overview of the Mnesia database. But even here, the focus is primarily on the ways in which the database resembles or leverages other database technologies, and not so much on the ways it leverages Erlang for fault tolerance or online upgrade. The very things that seem to make Erlang unique are the hardest things to get detailed descriptions of! Frustrating!

Cedric's post also makes a good point that Erlang seems lacking in the areas of error reporting and code structuring. Message-oriented languages in general have an uneasy relationship with traditional object-oriented reuse structures -- message-oriented languages in some sense are inherently decoupled, tending towards structural typing (in which messages are described primarily by their contents); whereas traditional class-based languages are oriented towards nominal typing (in which the inheritance lineage of a class determines its compatibility with other classes). There are some hybrid approaches to bridging the two, for example the (somewhat defunct) HydroJ messaging framework for message-oriented programming in Java, but the two paradigms are definitely tricky to couple.

Software transactions and message-oriented systems seem to fall on two ends of a language spectrum. In the transactional case, it's easy to see how to compose previously-sequential code and achieve some amount of concurrent scaling. In the messaging case, it's not so easy to know how to take a distributed problem and decompose it into messages. I look forward to much more explanation of the design process for message-oriented systems, since only thus will they be competitive with the software transactional model, which is more immediately intuitive to sequential programmers (especially those who already have database experience).

Coming soon: the software transaction end of the spectrum. (And then back to finish off conscientious software!)

Edit after original post: Looks like Joe already answered my transaction question. Solution: bundle up your updates into a single message that you send to the database server. Dang, that's a foreign concept to me, being a Java weenie used to the Hibernate wrap-your-serial-code-in-a-thread-local-transaction pattern. Erlang definitely demands you structure your code to its patterns from the ground up. A bad thing? Not necessarily... but what about composable concurrency? Better get my next post done soon!

Monday, September 17, 2007

Conscientious Software, Part 2: Provable Robustness

Second post in a series. Part 1.

Onwards! Now, what's really interesting to me is the further implications of this autopoietic / allopoietic dichotomy throughout the software stack. The Conscientious Software paper talks about potential means of implementing autopoietic software in terms of more vaguely defined membranes and other sorts of perception-oriented interfaces between software components, along the lines of Jaron Lanier's work.

To me, that is a lot harder to think about than the Google example, in which a lot of allopoietic engineering (very clear definition of subsystems, logical contracts, operational proofs, and other extremely "rigid" software engineering techniques) was used to create a system that, to a large extent, is reliably self-sustaining.

In other words, it seems to me that we want our autopoietic systems to be as reliable as we can make them. And the best techniques we have for designing something to be reliable are essentially allopoietic techniques.

Let's look at this another way. A lot of programming language research right now is devoted towards integrating proof systems into various languages. In some sense, every strongly typed language has a proof system built into it. A type system is a proof system (see Wadler's paper on Proofs As Programs). The power of these proof systems and type systems is steadily increasing -- a lot of work lately is going into logics for proving file system safety, data structure heap correctness, distributed messaging consistency, and on and on.

We are continually developing conceptual techniques to increase the accuracy and correctness with which we can describe the desired properties of a system, and prove that the system in fact has those properties as implemented.

It is a fascinating question, to me, whether all those tools can be turned to the purpose of developing autopoietic systems. Let's take one example: the above-linked work on logically proving that a file system is reliable.

The intention of this work is to model the state of the computer's memory relative to the known state of the computer's disk, in terms of what the memory believes to be true versus what the disk's contents declare as true. Logical statements relate changes in the memory beliefs to changes in the disk's beliefs. Overall, there is an error if at any point one can prove that the system's disk beliefs are inconsistent with memory.

In a sense, this proof work is intended to make the file system robust in the face of any and all environmental failures. The robustness comes from construction, not necessarily from recovery. This is building reliability into the system at the very bottom.

It is not too difficult to postulate that such logics could be further developed, for example, to characterize the consistency properties of the Google File System, or of other large-scale distributed software infrastructures. (Boy, what a handwaving sentence! Yes, it's not too difficult to postulate that. Actually doing it, on the other hand, is likely to be extremely difficult indeed. But over time even hard problems can get solved....)

Another direction in which to deliver increased reliability is internal self-protection. Almost all major organisms on Earth are composed of cellular structures. A cell is defined by its boundary -- without a boundary, a cell does not exist. Some programming languages have similarly oriented themselves around the concept of bounded, encapsulated units which are composed in large numbers to create a greater whole. Possibly the most industrially tested and publicly available example of such a language is Erlang, which composes large and immensely reliable systems out of many small software components, which communicate only by asynchronous message passing. There is no global state in an Erlang program, and no shared state among independent objects. This makes Erlang innately fault-tolerant, insofar as the failure of any one component cannot immediately cause other components to fail. This also allows the system to be upgraded while continuing to run. A really successful autopoietic system eventually reaches a point where it needs to provide continuous availability, with no externally visible downtime at all, ever. Systems decomposed into pure message-passing subcomponents inherently support this, since individual components can be replaced cleanly while the rest of the system continues normal operation.

Of course, autopoietic layers need to be built as other Erlang objects (a common Erlang idiom is to have a number of worker objects overseen by a smaller pool of supervisor objects -- this is exactly an autopoietic software layer), but the innate properties of the language contribute to the enforced internal modularity which leads not only to improved stability but also to ease of distribution. (Autopoietic systems will necessarily also be distributed and parallel systems, since any individual hardware unit may fail, and if the overall system is to survive it must have redundant parallel hardware.)

Large-scale autopoietic systems currently, and in the near future, will combine these properties. They will be systems that provide a general-purpose computing environment, with well-defined and consistent contracts, which are engineered with as much built-in and provable correctness as we can give them, which are modularized into independent and well-protected components, and which provide sufficiently reliable guarantees to the allopoietic programs running in those environments.

What is especially interesting is that this view flies in the face of the original paper's claim that what's needed is more vagueness, more inconsistency if you will, more tolerance between components for partial failures or partial communication. I'm not convinced by this part of their argument, at all. I think that we currently lack the ability to engineer in that style. Perhaps ultimately we'll make greater use of artificial evolution for engineering design, and I would expect such design tools to be far better than we are at leveraging interfaces that are ambiguous or otherwise not rigidly defined.

Next: what happens if software can get sick?

Thursday, September 6, 2007

Conscientious Software, Part 1: Allopoiesis and Autopoiesis

Recently some Sun researchers wrote a very interesting paper about the future evolution of software. The paper seeks to characterize a new type of software, which they term "conscientious". I want to dig into this concept a bit.

This will be a multi-post series; this first post lays the groundwork, then the subsequent posts will delve into some further implications of the basic concepts.

This post will make much more sense if you read their paper first; otherwise, hang on :-) Their paper is very much a manifesto, in the sense that it aggressively commingles concepts from biology, engineering, and computer science to propose a very ambitious and not-fully-understood thesis. So, this post is going to do that too, and conceptual rigor be damned. It's brainstorming time.

Conscientious software might be described (paraphrasing their paper heavily) as software which has a currently unknown degree of self-awareness, in the sense that it has the ability to test itself, to analyze its own functioning, to address external or internal malfunctions appropriately, and to maintain its operation under a variety of adverse conditions. And, of course, to actually do some job that we consider valuable.

There are two kinds of systems they talk about extensively. These are allopoietic systems and autopoietic systems. The technical definitions here come from biology, but I'll paraphrase heavily and just say that an autopoietic system creates itself, sustains itself, and produces itself, whereas an allopoietic system is externally created and produces something other than itself.

An example they give is the distinction between a living amoeba and a watch. An amoeba consists of a large number of cellular structures and chemical processes. A living amoeba is a dynamic system -- its continued existence depends on its active metabolism, by which it takes in nutrients, processes them, excretes material toxic to it, seeks out hospitable environments, and avoids dangerous ones. All the metabolic cycles and other processes that constitute the amoeba are primarily oriented towards self-sustainment.

A watch, on the other hand, has almost no ability to sustain itself. Its sole purpose is to tell time, and that purpose is only useful to the users of the watch, not to the watch itself. If the watch malfunctions, it must be repaired externally.

What does this have to do with software? They claim -- and I think I agree -- that software needs to develop in a more autopoietic direction. Fragile software, with frequent malfunctions, tricky configuration, and inadequate self-monitoring and self-repair capabilities, is increasingly hard to scale to larger and more complex functions. Yet we clearly want software to be capable of more than it currently is. Autopoiesis is a very thought-provoking concept when considering how to design such self-sustaining systems.

They make another point, which is that a purely autopoietic system -- one which is only concerned with self-sustainment -- is of little value to us. We want a system which can both sustain itself and produce something valuable. Their paper draws a picture of an autopoietic system containing an allopoietic core -- the self-sustaining system devotes a large part of its resources to solving specific, externally defined problems.

This is all terribly abstract. What would such a system look like? Are we talking about some kind of muzzy fuzzy genetically engineered software? What are some examples of autopoietic software today?

They start with garbage collection as an example of an autopoietic software function. In some sense, memory is the environment in which software exists. As software runs, it populates that memory. Without lots of careful management, the memory can easily become cluttered with no-longer-used but not-yet-deallocated objects. It's no big stretch to consider garbage objects as waste products of the software's metabolism. (Well, OK, it is a stretch, but like I said, we're brainstorming here.)

Garbage collection was originally developed to enable LISP programs to be developed in the way that was conceptually cleanest -- in other words, in a largely functional, allocation-intensive style -- without complicating them with memory reclamation logic. Externalizing the garbage collection code meant that the LISP program itself could more or less just do its job, with the underlying metabolism of its environment responsible for cleaning up its waste.

Another example (mine, not theirs), and probably one of the most well-known (or at least well-documented) large-scale examples, is Google's infrastructure. Google has developed numerous systems -- the Google File System, the BigTable large-scale database and the Chubby lock system -- that are inherently widely distributed and fault-tolerant. These systems actively monitor themselves, replicate their data around failed components, and generally self-manage on top of a widely distributed pool of faulty servers. It is even less of a stretch to think of these systems as implementing a metabolic environment for user programs that is composed from cells (servers) that routinely die and are replaced, while the entire system has a dynamic, self-sustaining continuity. (I'm sure many Google admins would laugh very hard at the phrase "self-sustaining" here, but nonetheless these systems do internally monitor and repair themselves.)

The allopoietic component of the Google environment is the applications that actually run on the underlying autopoietic Google services. These applications can (mostly) ignore the details of the hardware they run on and the storage they use; they can simply consume the resources they need (in both persistent storage and CPU) to perform their job.

OK. So that's autopoiesis and allopoiesis as applied to software as we currently know it. Next post: allopoiesis as the foundation of autopoiesis.