With ColdStore you can
- Write or run code which allocates memory into a large file (large: over a gigabyte on an Intel Linux box.)
- Construct objects within a program (structs, classes, etc) which persist, so when the program stops and is restarted the objects are there and available to the program as if they'd been there all the time: pointers and references to them still work.
- Change library code implementing objects (so long as the object layout and virtual method table doesn't change) without having to reconstruct the store.
- Allocate memory in `extents' or `neighborhoods' such that the allocations are clustered onto a small (not to say minimal) set of hardware pages. This means code referencing those neighborhoods is more likely to find the objects it needs already swapped into RAM from the store.
- Optionally use a whole library of classes designed or adapted to work well with extent-based allocation: array/lists, Tuples, dynamic strings, dictionary (content addressable array), BTree, Symbols, Namespaces, big integers, arbitrary precision float, regular expresisons, more added all the time.
- Leave out the parts of the system you don't need. The system's modularly layered so it's at least possible to substitute a new class library for the ColdStore one, a different allocation scheme for qvmm, EPCKPT for persistence, and still get the functionality of the parts you choose to use. I don't know about you, but I'm tired of software libraries which come with normative `lifestyle' assumptions. We've tried (as much as possible) to minimise this with ColdStore.
Rationales
What we think is needed (if not inevitable) is a way to store heterogenous (unruly, unpredictable, complexly structured) data so it can be accessed quickly for processing by programs. The quickest way to access data for processing is to store it in the form it's used. By using mmap(), under Unix, you can keep the data on disk in precisely the form it's processed in, transparently, and persistently.
Very few programs run once, produce only what you see now, and then go away. A whole class of applications produce data which is intended to be around for a long time (relative to a given run of the program.)
Currently, conventionally, the data's explicitly written out to disk. Usually in a special storage hardened form. Of course, it has to be read back in again next time you want to process it.
- Programmer effort is expended in structuring data to fit into homogenous (fixed size, fixed structure) containers (such as records, tuples, indices, tables, databases).
- Program execution time is spent serialising (pickling) data for long term storage, and deserialising (parsing) it back into the program's address space so it can be operated on.
- Some data (most data) just doesn't fit well into fixed records. Even something as well undestood as a Customer record conventionally reserves worst-case space for several address lines (this is called internal fragmentation in storage allocation thinking) and still can't cope if there's an even worse case nobody foresaw.
- Increasingly, data (in the old database sense) is losing importance relative to text (emails, web pages, SGML/XML documents.) Trouble is, databases containing text are hard to index and process, because text just doesn't fit well into fixed sized records.
- Textual data is becoming more highly structured: SGML/XML/HTML are examples, RFC822 formatted messages (email), MIME encapsulated files all have rich and hard to represent/store/process formats. Serialising and deserialising textual data is going to consume more processing time.
- Object Request Brokers (CORBA, ILU, etc.) focus on communication and interoperability, which may or may not be important in a given application, but which slows them down. You would never consider using CORBA to store your data for processing ... well, perhaps you would, but I wouldn't, it has to spend too much time finding it and translating it. Coldstore data is not encoded at all.
Audience
This system is likely to be of use to you if you're looking for a way to store heterogenous data, access and process it quickly, and you're a programmer. While it's possible to present this kind of facility to an end user, it's unlikely to make much difference to them: after all, that's what users expect of everything anyway. This system is designed to help programmers deliver it with most convenience (to the programmers, of course :)
Blue Sky
Virtual Reality |
Improve the performance of VRML systems by storing VRML objects without having to parse them: you can render and operate on them directly out of memory without reparsing. Store your Quake Levels the same way. |
---|
Workspaces |
(Anyone else remember APL?)
Your tcl/Python/Java/Scheme/PERL program can store over a gigabyte of data `natively', live, in variables, arrays, dictionaries.
In future versions of interpreters, programs/scripts would never terminate, but merely freeze, their current state captured in the coldstore. Thawing such an interpreter would bring it back to the precise state it was in, at the time of its freezing. EPCKPT supports this right now, but tacitly assumes that implementations will outlive objects (if .so files change, the results are undefined.) ColdStore assumes the opposite.
You don't need a database interface when everything's already got a name, or stored in something that does. The name we give to an object in a programming language is a unique key, which translates to the data contained within it ... this is just like a database index returning a record, but there's no need for a `select', it's already there. |
---|
You don't need files |
(well, except for backup.)
- Editors that never die.
- Editors that work directly on the parse-trees of the languages you're composing code for.
- Parsers that take input from those editors, feed back to them, and feed forward via directly stored Abstract Syntax Trees, to compilers generating code on the fly.
|
---|
MUDs MOOs MUSHes M* |
Good way to store those objects. ColdStore could almost have been designed to implement a M* :) |
---|
Document Stores |
Detail
ColdStore is a gigabyte-scale persistent object store which provides:
- Extent-based allocation (for maximal spatial locality of reference, minimal working set)
- Interning of Elf symbols (so your class implementation may change without the necessity to rebuild the store.)
- A rich set of Container and Basic classes optimised with respect to the QVMM allocator.
- A toy language - Chaos, designed to provide low-level access to objects and regression testing of the store and its application classes.
Possible Uses
We're planning to use it to make large network-capable interoperable scripting-language workspaces restartable (possibly migrateable).
You might use it as:
- A replacement for a database: there's a Dict class which provides BTree indexing.
- A cache for highly structured objects (eg. decoded Web pages)
- A back end for any program which generates large quantities of heterogenous data which needs to stick around for a long time.
Future
The near future of ColdStore holds:
- Fine-grained Multithreading under pthreads/LinuxThreads.
- Deferred balanced BTree memory allocation (at the moment it uses a BFL.)
- Objects designed to support generic virtual machines.
- Objects designed to support structured text (SGML,XML trees/graphs).
- Network objects - handling tcp/ip connections of various kinds.
- STL port - persistent STL objects.
- At least one serious language: C--, a Self-like language
- More languages: Python, Tcl, possibly even Java.
Status
Rapidly Approaching Beta (Tue Aug 10 12:02:39 EST 1999).
Requirements
- A good C++ compiler. Preferably gcc 2.95.
- Linux (Intel) - may change - Sun's a possibility.
- glibc2.1 or later - mandatory, 'cos symbol interning is tricky.
- libelf - you can get it here.
- Gnu's GMP package (I got mine from Debian, but it's also on prep.ai.mit.edu somewhere.)
- Suggest kdoc to generate detailed implementation docs.
Installation
- Get it from here.
- Unpack it: tar xzvf coldstore.tgz
- Make it: cd coldstore; make
- Try it out: cd chaos; ./chaos
Note: flex++ should have come with flex. It's sufficient to just ln flex flex++.
Contacts
More Info
We've got a Whitepaper. It goes into some detail as to design, but it's sadly out of date.
License
Short Form: It's all GPL.
All component licenses are GPL, except where noted (a couple of marked derived works.)
License terms are contained in each directory, in a file named LICENSE. All references to LICENSE in the source are accompanied by an MD5 checksum.
The complete work, considered as a portmanteau, is licensed under the terms contained in coldstore/LICENSE (MD5 f5220f8f599e5e926f37cf32efe3ab68), and Copyright Colin McCormack and Philippe Hébrais.
Acknowledgements
A lot of people have advanced the work, some unwittingly, some by merely standing and waiting (and waiting,) some actively:
Directions, Concepts and Influences.
- Colin McCormack
- Co-Founder. Obsessing, coding, obsessing some more.
- Phillipe Hébrais
- Co-Founder. Turned vague rumblings about Spatial Locality of Reference into a breakthrough, designed and coded QVMM unbelievably quickly and well. Contributed his experience from other experimental language developments to the design of this toolkit.
- Jordan B. Baker
- Enthusiasm and Chaos
- Jeremy FitzHardinge
- Crucial assistance with Elf and mmap (he wrote it, after all.)
- Li2CO3
- Visual Artist in residence.
- Andrew Morton
- Gentle enquiries.
- Ryan Daum
- Suggesting we look carefully at Python. Needling us until we released.
- Nick Sweeney
- Countless conversations, numberless insights.
- Bill Drury
- The space in which we converse (xao.com 7777)
- Greg Hudson
- His epochal coldmud.
- Hans Reiser
- His epochal reiserfs.
- Guido von Rossum
- His orthogonal Python API.
- Brian Eno
- Oblique Strategies and Soundtrack