Blah blah, not done yet

The latest version of this document may be found at:
http://parseerror.com/pizza/doc/On-Code-as-Data.html

On Code as Data

by pizza

Welcome to the 21st century. For less than a day's salary I can buy an electronic machine capable of performing more simple operations in one second than I could perform in a year. With the wide availability of fast arithmetic workhorses it is within the capabilities of an intelligent human to perform tasks of polynomial complexity nearly instantaneously. And yet, for the most part, we can't and we don't. Why is that?

Our software is hopelessly bug-ridden; large projects comprise millions of parts and have thousands of bugs. Programmers spend hours chasing down obscure syntax errors, hard-to-find runtime errors and version incompatibility issues. The more we work on software the more complex and inaccessible it becomes; the exact opposite of most other engineering fields. Why is that?

The reason is so simple that most popular contemporary languages seem strangely to have overlooked it. The vast majority of popular, contemporary programming languages can easily digest, filter, sort and modify any data except for their own source. This is such a simple, fundamental concept that its importance cannot be overstated. This one seemingly esoteric feature is the cause of all the software industry's headaches. Why is that?

Overcomplexity

Like any machine, software is comprised of simple individual parts, each straight-forward in isolation. Software is much less limited by the physical world than its physical counterparts, however. For example, a Rube Goldberg device's charming overcomplexity is obvious because its structure and operation are clearly observable. Not so for software; indeed software's overall structure and operation is not clearly observable. This opaqueness leads to a chronic lack of **** details, and that leads to uninformed decision-making, and that leads to all sorts of predictable problems.

Why is software's overall structure and operation not easily observable? It is a negative consequence of software's principle strength: its malleability. Software is valuable because it allows for modification within costs and timeframes that would be impossible in a physical machine. Constrast changing the color of your application vs. changing the color of a bridge. But with great power comes great responsibility; the perception of unlimited cheap and easy modification invariably leads to reduction in planning ("we'll (fix it/figure it out) later").

How Software Grows

The problem of insufficient planning is compounded by another real-world factor. Like a mighty tree, software does not appear fully-formed overnight; a mature product takes years of incremental growth. This is because of cost; a project able to earn income while under development requires less up-front capital and ultimately costs less. Active development also allows for customer feedback. As software grows features are overhauled as necessary to support new features or new performance requirements. This means that meta-level diagnostics and monitoring features are almost never included in software's original design [cite Linux, etc.].

Necessary Complexity

Indeed, software is necessarily more complex than its outputs. Put another way, in the software world the solutions are more complex than the problems they solve. Which begs the question: if the original problem is complex enough to require a software solution, then how does one manage the solution? How can I be sure it's correct? How can I be sure it will always work? How can I be sure it fails properly? How can I be sure it doesn't include huge amounts of superfluous logic? How can I be sure it even runs?

Human Overhead

The traditional method of addressing these questions is good old-fashioned human mental labor. Specifically: testing, documentation and peer review in various forms. However, these methods

  1. can never definitively answer all the questions we have about our software and
  2. humans are orders of magnitude slower than computers
Indeed, as a software project grows the administrative overhead necessary to maintain it quickly outstrips the work on the software itself.

Code as Data

The answer to managing software is the same as the answer to the original question: software. Just as we have programs to search, sift and classify documents, products, contacts and customers we need software to perform those tasks on software. We must allow software itself to reap the benefit

That being said, software including meta-software capabilities has existed for decades.

Lisp

The venerable Lisp family of programming languages has long been able to parse itself and generate its own code since the 1950s. Most unfortunately, Lisp's "macro" features appear to be used more to modify and generate source on-the-fly, as opposed to helping to understand it. Sadly this use has the opposite effect of increasing transparency or reducing software complexity.

Static source analysis

"Static checker" software such as Coverity is a step in the right direction, but against overwhelming odds. The languages Coverity supports are not designed to be treated as data, and will always contain . Even more crucially is the fact that a single company with a closed product, no matter how resourceful, will never have as much mental firepower as an open platform.

DTrace

Most promising is DTrace from Sun Microsystems. It is the software analog to tracing the flow of radioactive tracers through a patient's body, and provide a scripting language to filter and format results. It is an excellent piece of software, but still does not go far enough. It does solve the hardest piece of the puzzle, which is making software's behavior observable. While it excels at identifying and even diagnosing problematic behavior within a running system, it yields little insight into any given software's design nor its correctness.

Each of these tools addresses a facet of the problem of making

  1. Exposing source code as data
  2. Implementing tools to analyze said exposed source in some way
    • Hueristic checking
    • Design visualization
  3. Tracing execution of

Challenge

I posit this challenge to the programming community:

Implement a source-code querying framework for an existing programming language in the language itself. Some suggestions are C, Python, Ruby or Lua, but anything goes. The interface would include the ability to:

  1. load source from a file, optionally recursively following all "included" dependencies/libraries
  2. load source from a string, optionally recursively following all "included" dependencies/libraries
  3. return a list of all functions descending by number of parameters
  4. return a list of all fully-qualified functions called from said code and which function they are called from
  5. return a list of all functions containing loops descending by the maximum depth of loop nesting present
  6. return a list of all unused functions
  7. return a list of all variable assignments
  8. return a list of all functions that directly produce output
  9. return a list of all functions that indirectly produce output
  10. return a list of all functions that never, directly or indirectly, produce output
  11. allow an xpath-style query for function nesting, such as "foo//bar", which finds any instance of function call nesting where the function "bar" is called within any function that is called by "foo"
  12. return a list of instances where a variable is potentially written to more than once
  13. return a list of instances where a variable is assigned a constant value and then potentially overwritten
  14. allow a trace of a variable's content as it is read/written in scope, passed as a parameter to other functions recursively, etc. show where the data from a variable ends up.

References

  1. P_(complexity)
  2. Eating One's Own Dogfood
  3. Rube Goldberg device
  4. Homoiconicity
  5. Kolmogorov complexity
  6. DTrace