Tejlgaard: The characteristics of data persistence

Data is structured information. Persistence is steadyness: A persistent state is a steady state; this is particularly true when the word is used in the context of data. Here, I will be addressing persistence with regard to a particular type of data: Digital Text.

So...Persistent data is data you expect to stay the same.

Well duh, you say. Don't. Let me go over a concrete example of how the implementation of persistence may vary:

Programming languages deal with persistence in different ways; the most elegant way is, without a doubt, the functional programming languages, such as Scheme and F#. Functional languages do not accept mutation of data; that is, once data has been entered, it's there, forever. You may, at some point, lose the ability to address it; but you'll never be able to change the data at the end of an address.

This may seem like a bigger burden at first than it really is: Functional programming languages also allow you to create the (seemingly) same adress multiple times, so rather than changing the data at the address FOO, you can create a new address FOO and put the changed data there instead.

So even if you do not allow data at various addresses to mutate, you do not necessitate creativety in coming up with new meaningful address names; you can just reuse old ones.
This does make it more tricky to get at the old one (you need to get at it by writing something similar to "FOO; THE OLD ONE" in the address field), and it does make for confusing programs when the address FOO does not always refer to the same data (remember, all references to the old FOO within your program still point to the old data, because that's where they pointed before you introduced the new FOO).

The point is, however, that it's perfectly plausible to never erase anything, and to never change anything that anybody else has been working on, in the context of any advanced system. Contrast this to java, a complex programming language for modeling complex systems.

Here objects are passed by reference; that is, if a Salesman object passes a Car object reference to a Customer object, and the Customer object then makes a change (mutation) in the Car objects Wheeltype attribute, that change will also be apparent to the Salesman if he still has a reference to the Car object.

This behaviour can be destructive if the Salesman object was counting on the contents of the Wheeltype attribute to stay the same in the car object: If persistence was somehow important, then in this case, it would have been broken.

Alright, now that I've given some concrete examples of persistence, so you hopefully have a grasp of why it matters (at all), let's look at the how and why of the whole thing:

There are two defining questions to data persistence:

- Why persist?
- How do we persist?

The answer to the first one is: Because you might want to use it later.
The answer to the second one is: By making sure that we are able to address the data at the point where we want to use it.

Understand, then, that the relationship between not persisting and being able to address data is crucial: If you have 100 data articles, and you choose to only persist 50 of them, then you only have to keep track of 50 data addresses. Perhaps this seems like it's not that different from keeping track of 100 data addresses, but try to consider the receipts you get after purchasing groceries in the place of data articles. If you save 100 of those rather than the 50 most important ones, when the time comes to dig out the receipt for your new television set, you'll have to rummage through twice as many articles before coming upon it.

Not persisting the grocery receipts - or the data addresses - makes it quicker and easier to discern the useful ones from the useless ones; it makes it quicker and easier to address that which you are most likely to want to address, and as a concequence, it makes your collection of data more powerful and user friendly.

This is all well and good in a single user environment, but in a multi user environment, things tend to change. Whats important to one person could seem useless to another, and suddenly an important article is missing from the collection.

The answer is, of course, that deletion of any kind is a very primitive type of ordering in a persistent data set. It's actually hiding an article very very well; rarely does i actually phase permanently out of existence, it just becomes so hard to get at that it's no longer worth it to retrieve it. It probably only has such appeal to humans because we can safely forget about things we delete, just like things we throw out.

The same could be accomplished by using a less effective hide operation than deletion - instead of throwing the grocery receipts out, you could toss them all in an old shoebox, so that they're there at least, even if hidden behind a decidedly user-hostile interface. But even the shoebox will repressent a useless article to some, though, so it might get deleted eventually...unless it's impossible to delete, as it would be if it were an object in a functional programming language.

So now that we understand that the goal of persistence is the ability to use later, and that deletion is a type of ordering that simply makes us more _likely_ to use a collection later at all (because it becomes more powerful), we can toss the idea of deletion out the window for good...at least when it comes to small data.

Simply put, there are other types of maintenence that yeilds better results than deletion, often in much less time because there's far less finality to it (ie. it makes the collection more powerful, faster), so as long as size is not a factor, there should be no deletion. Since we're dealing with digital text in this article, it's verifiable that size is, indeed, not a factor.

This makes for a conclusion to this article: Everything should be persisted. Everything that is likely to be useful should be systematically ordered. And finally, efficient algorithms for ordering and retrieval are essentiel to maintain the power of a collection.

Tejlgaard

torsdag den 2. april 2009

The characteristics of data persistence

Ingen kommentarer:

Blog-arkiv

Om mig