Digest: Computer Related: March 2008

March 20, 2008

MapReduce: A major step backwards

On January 8, a Database Column reader asked for our views on new distributed database research efforts, and we'll begin here with our views on MapReduce. This is a good time to discuss it, since the recent trade press has been filled with news of the revolution of so-called "cloud computing." This paradigm entails harnessing large numbers of (low-end) processors working in parallel to solve a computing problem. In effect, this suggests constructing a data center by lining up a large number of "jelly beans" rather than utilizing a much smaller number of high-end servers.

For example, IBM and Google have announced plans to make a 1,000 processor cluster available to a few select universities to teach students how to program such clusters using a software tool called MapReduce [1]. Berkeley has gone so far as to plan on teaching their freshman how to program using the MapReduce framework.

As both educators and researchers, we are amazed at the hype that the MapReduce proponents have spread about how it represents a paradigm shift in the development of scalable, data-intensive applications. MapReduce may be a good idea for writing certain types of general-purpose computations, but to the database community, it is:

A giant step backward in the programming paradigm for large-scale data intensive applications
A sub-optimal implementation, in that it uses brute force instead of indexing
Not novel at all -- it represents a specific implementation of well known techniques developed nearly 25 years ago
Missing most of the features that are routinely included in current DBMS
Incompatible with all of the tools DBMS users have come to depend on

First, we will briefly discuss what MapReduce is; then we will go into more detail about our five reactions listed above.

What is MapReduce?

The basic idea of MapReduce is straightforward. It consists of two programs that the user writes called map and reduce plus a framework for executing a possibly large number of instances of each program on a compute cluster.

The map program reads a set of "records" from an input file, does any desired filtering and/or transformations, and then outputs a set of records of the form (key, data). As the map program produces output records, a "split" function partitions the records into M disjoint buckets by applying a function to the key of each output record. This split function is typically a hash function, though any deterministic function will suffice. When a bucket fills, it is written to disk. The map program terminates with M output files, one for each bucket.

In general, there are multiple instances of the map program running on different nodes of a compute cluster. Each map instance is given a distinct portion of the input file by the MapReduce scheduler to process. If N nodes participate in the map phase, then there are M files on disk storage at each of N nodes, for a total of N * M files; F_i,j, 1 ≤ i ≤ N, 1 ≤ j ≤ M.

The key thing to observe is that all map instances use the same hash function. Hence, all output records with the same hash value will be in corresponding output files.

The second phase of a MapReduce job executes M instances of the reduce program, R_j, 1 ≤ j ≤ M. The input for each reduce instance R_j consists of the files F_i,j, 1 ≤ i ≤ N. Again notice that all output records from the map phase with the same hash value will be consumed by the same reduce instance -- no matter which map instance produced them. After being collected by the map-reduce framework, the input records to a reduce instance are grouped on their keys (by sorting or hashing) and feed to the reduce program. Like the map program, the reduce program is an arbitrary computation in a general-purpose language. Hence, it can do anything it wants with its records. For example, it might compute some additional function over other data fields in the record. Each reduce instance can write records to an output file, which forms part of the "answer" to a MapReduce computation.

To draw an analogy to SQL, map is like the group-by clause of an aggregate query. Reduce is analogous to the aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute.

We now turn to the five concerns we have with this computing paradigm.

1. MapReduce is a step backwards in database access

As a data processing paradigm, MapReduce represents a giant step backwards. The database community has learned the following three lessons from the 40 years that have unfolded since IBM first released IMS in 1968.

Schemas are good.
Separation of the schema from the application is good.
High-level access languages are good.

MapReduce has learned none of these lessons and represents a throw back to the 1960s, before modern DBMSs were invented.

The DBMS community learned the importance of schemas, whereby the fields and their data types are recorded in storage. More importantly, the run-time system of the DBMS can ensure that input records obey this schema. This is the best way to keep an application from adding "garbage" to a data set. MapReduce has no such functionality, and there are no controls to keep garbage out of its data sets. A corrupted MapReduce dataset can actually silently break all the MapReduce applications that use that dataset.

It is also crucial to separate the schema from the application program. If a programmer wants to write a new application against a data set, he or she must discover the record structure. In modern DBMSs, the schema is stored in a collection of system catalogs and can be queried (in SQL) by any user to uncover such structure. In contrast, when the schema does not exist or is buried in an application program, the programmer must discover the structure by an examination of the code. Not only is this a very tedious exercise, but also the programmer must find the source code for the application. This latter tedium is forced onto every MapReduce programmer, since there are no system catalogs recording the structure of records -- if any such structure exists.

During the 1970s the DBMS community engaged in a "great debate" between the relational advocates and the Codasyl advocates. One of the key issues was whether a DBMS access program should be written:

By stating what you want - rather than presenting an algorithm for how to get it (relational view)
By presenting an algorithm for data access (Codasyl view)

The result is now ancient history, but the entire world saw the value of high-level languages and relational systems prevailed. Programs in high-level languages are easier to write, easier to modify, and easier for a new person to understand. Codasyl was rightly criticized for being "the assembly language of DBMS access." A MapReduce programmer is analogous to a Codasyl programmer -- he or she is writing in a low-level language performing low-level record manipulation. Nobody advocates returning to assembly language; similarly nobody should be forced to program in MapReduce.

MapReduce advocates might counter this argument by claiming that the datasets they are targeting have no schema. We dismiss this assertion. In extracting a key from the input data set, the map function is relying on the existence of at least one data field in each input record. The same holds for a reduce function that computes some value from the records it receives to process.

Writing MapReduce applications on top of Google's BigTable (or Hadoop's HBase) does not really change the situation significantly. By using a self-describing tuple format (row key, column name, {values}) different tuples within the same table can actually have different schemas. In addition, BigTable and HBase do not provide logical independence, for example with a view mechanism. Views significantly simplify keeping applications running when the logical schema changes.

2. MapReduce is a poor implementation

All modern DBMSs use hash or B-tree indexes to accelerate access to data. If one is looking for a subset of the records (e.g., those employees with a salary of 10,000 or those in the shoe department), then one can often use an index to advantage to cut down the scope of the search by one to two orders of magnitude. In addition, there is a query optimizer to decide whether to use an index or perform a brute-force sequential search.

MapReduce has no indexes and therefore has only brute force as a processing option. It will be creamed whenever an index is the better access mechanism.

One could argue that value of MapReduce is automatically providing parallel execution on a grid of computers. This feature was explored by the DBMS research community in the 1980s, and multiple prototypes were built including Gamma [2,3], Bubba [4], and Grace [5]. Commercialization of these ideas occurred in the late 1980s with systems such as Teradata.

In summary to this first point, there have been high-performance, commercial, grid-oriented SQL engines (with schemas and indexing) for the past 20 years. MapReduce does not fare well when compared with such systems.

There are also some lower-level implementation issues with MapReduce, specifically skew and data interchange.

One factor that MapReduce advocates seem to have overlooked is the issue of skew. As described in "Parallel Database System: The Future of High Performance Database Systems," [6] skew is a huge impediment to achieving successful scale-up in parallel query systems. The problem occurs in the map phase when there is wide variance in the distribution of records with the same key. This variance, in turn, causes some reduce instances to take much longer to run than others, resulting in the execution time for the computation being the running time of the slowest reduce instance. The parallel database community has studied this problem extensively and has developed solutions that the MapReduce community might want to adopt.

There is a second serious performance problem that gets glossed over by the MapReduce proponents. Recall that each of the N map instances produces M output files -- each destined for a different reduce instance. These files are written to a disk local to the computer used to run the map instance. If N is 1,000 and M is 500, the map phase produces 500,000 local files. When the reduce phase starts, each of the 500 reduce instances needs to read its 1,000 input files and must use a protocol like FTP to "pull" each of its input files from the nodes on which the map instances were run. With 100s of reduce instances running simultaneously, it is inevitable that two or more reduce instances will attempt to read their input files from the same map node simultaneously -- inducing large numbers of disk seeks and slowing the effective disk transfer rate by more than a factor of 20. This is why parallel database systems do not materialize their split files and use push (to sockets) instead of pull. Since much of the excellent fault-tolerance that MapReduce obtains depends on materializing its split files, it is not clear whether the MapReduce framework could be successfully modified to use the push paradigm instead.

Given the experimental evaluations to date, we have serious doubts about how well MapReduce applications can scale. Moreover, the MapReduce implementers would do well to study the last 25 years of parallel DBMS research literature.

3. MapReduce is not novel

The MapReduce community seems to feel that they have discovered an entirely new paradigm for processing large data sets. In actuality, the techniques employed by MapReduce are more than 20 years old. The idea of partitioning a large data set into smaller partitions was first proposed in "Application of Hash to Data Base Machine and Its Architecture" [11] as the basis for a new type of join algorithm. In "Multiprocessor Hash-Based Join Algorithms," [7], Gerber demonstrated how Kitsuregawa's techniques could be extended to execute joins in parallel on a shared-nothing [8] cluster using a combination of partitioned tables, partitioned execution, and hash based splitting. DeWitt [2] showed how these techniques could be adopted to execute aggregates with and without group by clauses in parallel. DeWitt and Gray [6] described parallel database systems and how they process queries. Shatdal and Naughton [9] explored alternative strategies for executing aggregates in parallel.

Teradata has been selling a commercial DBMS utilizing all of these techniques for more than 20 years; exactly the techniques that the MapReduce crowd claims to have invented.

While MapReduce advocates will undoubtedly assert that being able to write MapReduce functions is what differentiates their software from a parallel SQL implementation, we would remind them that POSTGRES supported user-defined functions and user-defined aggregates in the mid 1980s. Essentially, all modern database systems have provided such functionality for quite a while, starting with the Illustra engine around 1995.

4. MapReduce is missing features

All of the following features are routinely provided by modern DBMSs, and all are missing from MapReduce:

Bulk loader -- to transform input data in files into a desired format and load it into a DBMS
Indexing -- as noted above
Updates -- to change the data in the data base
Transactions -- to support parallel update and recovery from failures during update
Integrity constraints -- to help keep garbage out of the data base
Referential integrity -- again, to help keep garbage out of the data base
Views -- so the schema can change without having to rewrite the application program

In summary, MapReduce provides only a sliver of the functionality found in modern DBMSs.

5. MapReduce is incompatible with the DBMS tools

A modern SQL DBMS has available all of the following classes of tools:

Report writers (e.g., Crystal reports) to prepare reports for human visualization
Business intelligence tools (e.g., Business Objects or Cognos) to enable ad-hoc querying of large data warehouses
Data mining tools (e.g., Oracle Data Mining or IBM DB2 Intelligent Miner) to allow a user to discover structure in large data sets
Replication tools (e.g., Golden Gate) to allow a user to replicate data from on DBMS to another
Database design tools (e.g., Embarcadero) to assist the user in constructing a data base.

MapReduce cannot use these tools and has none of its own. Until it becomes SQL-compatible or until someone writes all of these tools, MapReduce will remain very difficult to use in an end-to-end task.

In Summary

It is exciting to see a much larger community engaged in the design and implementation of scalable query processing techniques. We, however, assert that they should not overlook the lessons of more than 40 years of database technology -- in particular the many advantages that a data model, physical and logical data independence, and a declarative query language, such as SQL, bring to the design, implementation, and maintenance of application programs. Moreover, computer science communities tend to be insular and do not read the literature of other communities. We would encourage the wider community to examine the parallel DBMS literature of the last 25 years. Last, before MapReduce can measure up to modern DBMSs, there is a large collection of unmet features and required tools that must be added.

We fully understand that database systems are not without their problems. The database community recognizes that database systems are too "hard" to use and is working to solve this problem. The database community can also learn something valuable from the excellent fault-tolerance that MapReduce provides its applications. Finally we note that some database researchers are beginning to explore using the MapReduce framework as the basis for building scalable database systems. The Pig[10] project at Yahoo! Research is one such effort.

References

[1] "MapReduce: Simplified Data Processing on Large Clusters," Jeff Dean and Sanjay Ghemawat, Proceedings of the 2004 OSDI Conference, 2004.

[2] "The Gamma Database Machine Project," DeWitt, et. al., IEEE Transactions on Knowledge and Data Engineering, Vol. 2, No. 1, March 1990.

[4] "Gamma - A High Performance Dataflow Database Machine," DeWitt, D, R. Gerber, G. Graefe, M. Heytens, K. Kumar, and M. Muralikrishna, Proceedings of the 1986 VLDB Conference, 1986.

[5] "Prototyping Bubba, A Highly Parallel Database System," Boral, et. al., IEEE Transactions on Knowledge and Data Engineering,Vol. 2, No. 1, March 1990.

[6] "Parallel Database System: The Future of High Performance Database Systems," David J. DeWitt and Jim Gray, CACM, Vol. 35, No. 6, June 1992.

[7] "Multiprocessor Hash-Based Join Algorithms," David J. DeWitt and Robert H. Gerber, Proceedings of the 1985 VLDB Conference, 1985.

[8] "The Case for Shared-Nothing," Michael Stonebraker, Data Engineering Bulletin, Vol. 9, No. 1, 1986.

[9] "Adaptive Parallel Aggregation Algorithms," Ambuj Shatdal and Jeffrey F. Naughton, Proceedings of the 1995 SIGMOD Conference, 1995.

[10] "Pig", Chris Olston, http://research.yahoo.com/project/90

[11] "Application of Hash to Data Base Machine and Its Architecture," Masaru Kitsuregawa, Hidehiko Tanaka, Tohru Moto-Oka, New Generation Comput. 1(1): 63-74 (1983)

By David DeWitt

[Note: Although the system attributes this post to a single author, it was written by David J. DeWitt and Michael Stonebraker]

March 19, 2008

Intel hopes to unwire the world with long-range WiFi

Improving the number of people worldwide who have access to the Internet has been a stated goal of nations, international bodies, and NGOs for years, but practical attempts to do so have often faced daunting challenges. Hostile terrain, periodic extremes of temperature and weather, limited/intermittent electrical power, and the ever-present danger of being eaten by lions, bears, or giant fire-breathing iguanas are all issues that must be solved, and that's before the human factors are dealt with. It does no good, after all, to run a copper cable into a town to provide Internet access, only to have someone dig it up next week and sell it for scrap.

Intel's Rural Connectivity Platform (RCP) is designed to push traditional 802.11 wireless connections out to much greater distances than would normally be possible, thereby reducing the amount of cable that must be laid between any two points. Using the 802.11 standard means that spectrum availability is not usually an issue, and Intel's RCP supports frequencies of 900MHz, 2.4GHz, and the 5.2-5.8GHz spectrum.

The idea of focusing and transmitting a wireless signal in order to boost its range is nothing new, but Intel's RCP uses a modified TDMA (time division multiple access) system. A standard wireless system broadcasts a message and waits for a reply. If the reply is not acknowledged within a certain period of time, the station broadcasts again. Intel's RCP TDMA system does away with the acknowledgment phase, and gives each radio a set block of time to send data and a set block of time to receive it. According to Jeff Galinovsky, a senior platform manager at Intel, removing the acknowledgment phase substantially increases the amount of bandwidth that's available at any given point, thus increasing the wireless system's range. Intel's specifications for RCP deployment state that towers may be up to 60 miles (100km) apart. Towers can be deployed in several different configurations, as shown below:

Intel isn't ready for full commercial deployment at present—sales should begin in India later this year—but the company's RCP has a number of potential advantages. At less than $500 per unit and an estimated $1,000 for two nodes and the associated backend infrastructure, the RCP should deliver Internet access at a much lower cost-per-user then more traditional access methods can manage. Battery replacement or adequate power delivery is also a non-issue—Intel's RCP only draws 5-6W at most, which means the unit's needs can be met by solar power.

The RCP's estimated 6.5Mbps of total bandwidth won't exactly turn Asian ~~gold~~ farmers or remote village-dwellers into BitTorrent freaks, but it should provide plenty of speed to allow for basic e-mail and Internet connectivity. Developing nations have reportedly expressed a high degree of interest in the RCP. If the Indian deployment goes well later this year, general availability is likely to follow in short order.

Source and credits.

High Capacity Color Barcode

High Capacity Color Barcode technology, developed within Microsoft Research, advances the state of the art in storing digital information on analog media.

Higher Density Storage Capability

The High Capacity Color Barcode barcode format takes advantage of advanced computer imaging devices along with processing power to enable higher density storage of data on analog printed media. The format achieves this by using a different barcode symbol shape in combination with multiple colors per symbol.

Currently laboratory tests have yielded using eight colors, 2,000 binary bytes, or 3,500 alphabetical characters per square inch in its highest density form using a 600dpi business card scanner. This equates to two pages of a novel. The symbol size can be changed to accommodate the differing fidelities of imaging devices. The barcode can be printed using a regular inkjet or laser jet printer.

Digital Signing Capability

The High Capacity Color Barcode barcode has integrated digital signing features using state-of-the-art Elliptic Curve Cryptography, Public Key Infrastructure (PKI) techniques. The purpose of digital signing authenticates the integrity of the data stored in the barcode, ensuring the data is unmodified from the original author.

Microsoft's advanced signing techniques employ industry known proven techniques achieving RSA-1024 strength security in only 20 bytes payload overhead.

Multiple Payloads - Small File System

Typical existing barcode formats have the capability of storing just one item of data or requiring the reader software to have knowledge of the data stored within the barcode. The High Capacity Color Barcode barcode can store multiple individual items of data each being compressed as optimally as possible within the physical code, implementing essentially a small file system.

Robust Mobile Device Performance

The High Capacity Color Barcode barcode runs on any processor or operating system platform, the lightweight mobile phone implementation can decode barcodes in realtime from a video stream, for example 30ms on a 200Mhz ARM processor. As the format uses half the symbols of its black and white counter parts, decode success rates are higher for the same black and white physical barcode size as the symbols are larger. The smallest capacity barcode can be read on cellphones without modified lenses at a print size of just 3/4" square, outperforming QRCode and Datamatrix formats. Advanced segmentation and color processing techniques ensure good decode rates across a range of image and lighting qualities: From sharp to blurred; From light to dark.

Source and credits.

Off-topic: Synesthesia

A documentary about Pat and his synesthesia, a rare neurological condition. Synesthesia essentially means "the joining of the senses". People with this condition often explain sensations in abstracts way such as "hearing colors" or "seeing music".

March 18, 2008

DataGlyphs: Xerox's technology for embedding data into documents


	Basic DataGlyphs are a pattern of forward and backward slashes representing ones and zeroes. This pattern forms an evenly textured field.

PARC DataGlyphs® are a robust and unobtrusive method of embedding computer-readable data on paper surfaces.

Unlike most barcodes, DataGlyphs are flexible in shape and size. Their structure and robust error correction also make them suitable for curved surfaces and other situations where barcodes fail.

PARC invented DataGlyphs in 1993, and has licensed the basic software patents to Microglyph Technology GmbH to form the foundation of their microglyph® code. While PARC developed DataGlyphs for document management systems, Microglyph Technology has developed additional, proprietary code structures and algorithms to enable parts-marking for the manufacturing industry, to enable the embedding of computer-readable data on surfaces such as plastic, glass, or metal.


	Roll mouse over image to zoom in. Original size: 3.3" x 3.3" @ 600dpi. Glyphtones, DataGlyphs of varying weights, emulate the look of a grayscale image.

Features:

Flexibility -- adjustable size, shape, color
High data density
Robustness
Adjustable error correction
Compatible with cryptography

At 600dpi, DataGlyphs offer up to 1KB per square inch of data. At this density, the Gettysburg Address fits in a block the size of a small United States postage stamp.

Applications:

document management
fraud prevention
inventory tracking
ID cards
parts marking
product tagging

DataGlyphs have been used in several Xerox products, licensed to a major manufacturer of airplane parts, licensed to Progressive Casualty Insurance for use in turnaround documents, and more. Other markets include financial services, software, government, health care, and pharmaceuticals.


Roll mouse over image to zoom in. Original size: 5.9"x4.8" @600dpi. Color DataGlyphs provide a similar functionality as Glyphtones but extend the applications to color images.	Roll mouse over image to see "invisible" glyphs as seen by the blue channel of a scanner. "Invisible" DataGlyphs are fine yellow glyphs printed on white. This drawing shows them at 200% and 1000% enlargement.

Combining different types of DataGlyphs increases the options for encoding digital information.

Technical Overview of DataGlyphs®

PARC DataGlyphs are a robust and unobtrusive method of embedding computer-readable data on surfaces such as paper, labels, plastic, glass, or metal.

How Data Is Encoded

DataGlyphs encode information - text, data, graphics - in thousands of tiny glyphs. Each glyph consists of a 45-degree diagonal line, as short as one one-hundredth of an inch or less, depending on the resolution of the printing and scanning that's used. Each glyph represents a single binary 0 or 1, depending on whether it slopes to the right or left.

Leaning either forward or back, DataGlyphs represents the ones and zeroes in binary digital data.

The glyphs are laid down in groups on a regular, finely spaced grid forming unobtrusive, evenly textured gray areas. Even when individual glyphs are large enough to be resolved by the human eye, in groups they form a pleasing pattern that is not distracting.

Robustness and Error Correction

In addition to the data, each DataGlyph contains an embedded synchronization lattice or skeleton - a repeating fixed pattern of glyphs that marks the DataGlyph's boundaries and serves as a clocking track to improve the reliability of reading. Groups of glyphs, representing bytes of data, are laid down within this frame.

The data is grouped into blocks of a few dozen bytes each, and error-correction code is added to each block. Individual applications determine the amount of error correction necessary. Of course, higher levels of error correction require larger overall DataGlyphs for a given amount of data, but improve the reliability with which the information can be read back - often a worthwhile trade-off, especially when the DataGlyph will sustain a high level of image noise (for example, during fax transmissions) or when the glyph block will be subjected to rough handling.

For reliability, each DataGlyph contains a measure of error correction appropriate to the application. Glyphs are also randomized to sustain the integrity of the data through damage to the document and laid into a synchronization frame.

As a final step, the bytes of data are randomly dispersed across the entire area of the DataGlyph. Thus, if any part of the DataGlyph is severely damaged, the damage to any individual block of data will be slight, and the error-correcting code will easily be able to recover all the information encoded, despite the damage.

Together, built-in error correction code and data randomization give DataGlyphs a very high level of reliability, even in the face of damage from ink marks, staples, coffee spills, and the other vicissitudes of a paper document's life.

Superior Data Density

The amount of data that can be encoded in a DataGlyph of a given size will vary with the quality of the imprinting and scanning equipment to be used.

DataGlyphs offer a data density nearly twice that of PDF417, one of the most popular forms of 2d barcodes.

For example, with one- and two-dimensional bar codes, the minimum feature size that can be used is 0.0075inch - three dots at 400 dpi. At that density, and with minimal height, Code 39 (the most commonly used general-purpose linear bar code) can only achieve a density of about 25 binary bytes per square inch. Code 128 can achieve about 40 bytes per square inch.

The two-dimensional bar codes, such as PDF417, do much better. achieves a maximum data density of 2,960 bits (or 370 binary bytes) per square inch, with no error correction, at 400 dpi. But with realistic error correction of 27%, the effective data rates for PDF417 are about 270 bytes per square inch.

At the same resolution and level of error correction, DataGlyphs can carry nearly 500 bytes per square inch.

As with other visual encoding schemes, the density of DataGlyph encoding is determined by four factors:

The resolution at which the encoding is created and scanned. High-resolution devices such as office laser printers and document scanners permit denser marking patterns, and thus denser encoding, than low-resolution devices such as dot-matrix printers and fax machines.
The amount of error correction used. The process of printing and scanning unavoidably degrades image-encoded data. In high-density encoding, where the print and scan defects are likely to be large compared to the encoding feature size, more of the encoding features will be lost or misread as a result of such degradation. As a countermeasure, some system of redundancy must be used to keep the failure rate within reasonable bounds - that is, for error correction. And redundant coding consumes extra space, reducing the effective data density. Again, how much error correction must be employed will vary from application to application. But there must always be some in any real-world application of any encoding scheme. The data densities that can be achieved using no error correction are theoretical upper bounds, unlikely to be of practical use.
The data compression used. Data can be compressed as it's encoded, For example, if all the data is numeric, there's no need to use one byte (8 bits) per digit, 3.32 bits will suffice. When text is encoded, it can be compressed by factors of two or more by means of character encoding or other compression techniques. For example the full text of the Gettysburg Address, often used to demonstrate high-density encoding contains 268 words, or about 1,450 characters. But the entire speech can easily be represented in less than 900 bytes.
The fixed overhead of the synchronization frame and header. For DataGlyphs, the synchronization frame is a fixed proportion of the data area. DataGlyphs also have a very small fixed header.

The size of the DataGlyph required to encode 80 bytes of information depends on the device(s) to be used for printing and/or scanning. DataGlyphs for faxing are often drawn disproportionately large for added reliability in the face of the "noise" that frequently affects fax images.

DataGlyphs is a registered trademark of Palo Alto Research Center Incorporated.

Source and credits.

Online demonstration.

Technology's FAQ

March 17, 2008

Infosecurity Magazine - March 2008

Programming high-performance applications on the Cell BE processor (Playstation 3)

Programming high-performance applications on the Cell BE processor, Part 1
Programming high-performance applications on the Cell BE processor, Part 2
Programming high-performance applications on the Cell BE processor, Part 3
Programming high-performance applications on the Cell BE processor, Part 4
Programming high-performance applications on the Cell BE processor, Part 5
Programming high-performance applications on the Cell BE processor, Part 6

Hacker Uses Sony PlayStation 3 to Crack Passwords

Nick Breese, a senior security consultant at Auckland-based Security-assessment.com, has come up with a way to drastically increase the processing capability of cracking passwords, using a PS3.

By implementing common ciphers and hash functions using vector computing, Breese has pushed the current upper limit of 10--15 million cycles per second -- in Intel-based architecture -- up to 1.4 billion cycles per second.

Breese, who has been working on the project, called "Crackstation", for the past six months, used the Sony PlayStation 3 gaming console for his break-through research. PS3's Cell Broadband Engine technology was created by IBM, Toshiba and Sony. The companies collaborated to create the CBE, commonly known as Cell, processor, which consists of one scalar processor and eight vector processors.

By design, PS3 is very suitable for cryptography, says Breese. Intel processors are designed to do all kinds of complex calculations, whereas the PS3 is good at doing simple things very quickly. "And believe it or not, cryptography really is simple," he says. "Lots of simple operations being done one at the time."

The strength of cryptography implementations is usually based on its cracking time -- how long it would take for someone to sit down and crack it, says Breese. His discovery has demonstrated that the capability of cracking encryption algorithms has multiplied by 100.

Breese's discovery "will unfortunately make cryptography cracking faster", he says. However, he hopes that his research will help drive the need for stronger cryptography to be used, and push for better implementations of cryptography.

The big implication for the industry is the fact that using Intel processors as a benchmark just is not good enough anymore, he says.

Within PS3, in Breese's case running Linux, there are six SPU (Synergistic Processing Unit) processor cores. Each core is able to do four calculations -- so across all of the cores it is possible to do 24 calculations at the same time, he says. The simplistic design of the processor architecture also helped increase the speed, he says.

Breese was looking for a way to optimize processing to make MD5 calculations go very quickly, he says. MD5 (Message-Digest algorithm 5) is one of the most used cryptographic hash functions. The PS3 managed to conduct over 1.4 billion MD5 calculations a second, he says.

But the speed increase relates to the use of SIMD (Single Instruction, Multiple Data) computing, rather than solely the PS3, he says. "It's just that the Cell processor cluster within the Playstation 3 is very good at it," he says.

Vector, or SIMD, computing involves performing calculations against a data group, rather than against a single piece of data, which is known as scalar computing, says Breese. Using vector computing allowed him to run cryptography calculations in parallel, he says.

Breese presented his findings at the Kiwicon hacker conference, held in Wellington earlier this month.

"We seem to have a world's first here, with potentially huge implications around the validity of some encryption algorithms going forward," says Security-assessment.com chief executive Peter Benson. "While we have not currently worked on distributing load across PS3s, the theory is there to increase this level of performance further."

The team rewrote some of the code to run under the vector-based methodology, and as soon as they did that they started getting "some pretty spectacular results", says Benson. It took a while before the company decided to release the numbers "because we just didn't believe them," he says.

Breese also increased the speed in x86 processors by using the same method on the x86-equivalent technology known as SSE (Streaming SIMD Extensions), but the increase was not as significant. He found that the x86 SSE2 implementation could conduct over three times the number of MD5 calculations than the scalar equivalent.

Breese says the initial reason for embarking on the research project was to get the company to buy him a PS3.

Source and credit.

Tool uses Firewire to take over a 'locked' Windows machine

No screwdriver required: A researcher has released a plug-and-go physical hacking tool that uses a Firewire cable to “own” a Windows machine within seconds.

Winlockpwn, originally built two years ago, bypasses Windows’s authentication system and lets an attacker take over a “locked” Windows machine without even stealing its password. Adam Boileau, a researcher with Immunity Inc., says he decided it was finally time to make his tool publicly available. (See 'Cold Boot' Attack Tool Surfaces.)

Similar Firewire hacks have been demonstrated on Linux and OS X as well.

With Winlockpwn, the attacker connects a Linux machine to the Firewire port on the victim’s machine. The attacker then gets full read-and-write memory access and the tool deactivates Windows’s password protection that resides in local memory. Then he or she has carte blanche to steal passwords or drop rootkits and keyloggers onto the machine.

“This is just a party-trick demo script thats been lying around my homedir for two years gathering dust,” Boileau blogged this week. “I'm not releasing this because Microsoft didn't respond (they did; it’s not a bug, it's a feature, we all know this). It just seemed topical, with the RAM-freezing thing, and it's a pity to write code and have no one use it.”

Firewire’s abuse should come as no surprise, security experts say. The peripheral bus connection technology lets you read and write to memory, so the weakness is not a true vulnerability, but a feature of the technology.

“That Firewire port is, as designed, literally there to let you plug things into your laptop memory banks,” says Thomas Ptacek, principal with Matasano Security. “When you think of Firewire, you really should just think of a cable coming directly out of your system's DRAM banks. That's basically all Firewire is.”

Ptacek says this tool raises the bar in physical hacking. “People think about physical hacking as something you have to do with a screwdriver and 20 minutes, under cover of darkness. Attacks like Adam's can be done in the time it takes you to pick up a sheet of paper off the office printer,” he says.

Not all machines have Firewire ports, of course, but other researchers have already found ways to get around that, using a PCMCIA Firewire card. (See No Firewire for Hack? No Problem.) And Vista is not immune to such an attack, either: Austrian research firm SEC Consult had previously written a proof of concept for Windows Vista that disables password authentication in the default login routine, so the attacker can log in with an arbitrary password, according to the researchers.

Ptacek says the best defense is to disable Firewire. “I think that enterprises who care about security should make sure they don't issue laptops with enabled Firewire ports,” he says.

Source and credit.

An overview of the NSA's domestic spying program

In Wednesday's Wall Street Journal, Siobhan Gorman pulled together the disparate threads of reporting on what's known of the NSA's secret domestic spy program, and combined them with some of her own reporting to confirm, once again, that the NSA's program is another incarnation of the Pentagon's erstwhile Total Information Awareness program. Gorman also describes how Carnivore, the SWIFT database snooping program, and basically every other "Big Brother" database and data snooping program that the executive branch has developed over the past two administrations* feed information into the NSA's TIA-like system, which then looks for suspicious patterns in the data.

Gorman's article provides a great overview of how these programs fit together in the architecture of the modern, post-9/11 surveillance state, and it's required reading because it comes at a critical time in our national debate about privacy and the limits of executive power. However, if you've been following this topic closely then you know that most of the information in the article has been public since 2006.

In this post, I'm going to walk back through some of the previous reporting on the topic, both my own work and that of others, and offer corrections and adjustments where necessary based on the WSJ piece. My hope is that readers and reporters who are so inclined can dig through the details and links and follow up on any leads that others may have missed.

(*Note: Infamous codenames like "Carnivore" and ECHELON first cropped up in Bill Clinton's second term, and I covered them when Ars launched in mid-1998. In terms of the presidential orders he signed and the programs that were inaugurated on his watch, Clinton laid some of the groundwork for the Bush administration's pre- and post-9/11 surveillance-related lawbreaking. Or, perhaps a more accurate metaphor is that he blazed a trail that the Bush gang then paved over and turned into a six-lane highway.)

A look back at the role of the TIA in the NSA's surveillance activities

Back in December of 2005, when the NSA warrantless wiretapping story story broke in the New York Times, I took a close look at what was then known about the program and suggested that the NSA's program probably shared some technological DNA with the short-lived (2002-2003) Total Information Awareness program. In the years since TIA first appeared under the Pentagon's roof, the program has moved from agency to agency in the Executive branch, as Congress catches wind of each new incarnation of it and shuts it down only to see it reemerge again with a different acronym on a different department's budget.

In April of 2006, the MIT Technology Review published a piece by Mark Williams that moved the story forward by fleshing out the relationship between TIA and the NSA's domestic spying program. Williams reported that elements of TIA had indeed been moved from the Department of Defense to the NSA, and he suggested that this technology was almost certainly in use as part of the domestic spying program that the New York Times had uncovered.

One month later, a very important article on the role of "transactional information"—a term that originally referred to the phone company's call logs but has since been stretched to fit a widening array of communication types—appeared in USA Today. The article made clear that this "communication metadata" was the real target of the NSA's vast data collection efforts. Also that May, Wired's Ryan Singel released critical technical documents that had been sealed under court order, showing some of the nuts and bolts of how the NSA snoops Internet traffic on AT&T's backbone.

In terms of my own understanding of the NSA's program, the USA Today article made clear that my initial assessment of the NYT's piece had missed the mark on an important and central point: the new surveillance technology at the heart of the TSA's warrantless wiretapping program was not, as I had conjectured, an automated voice recognition system that sampled calls looking for "hits," and then escalated of-interest calls to higher levels of scrutiny and, ultimately, to a human monitor. This call monitoring is almost surely going on somewhere in the intelligence pipeline, but the core of the NSA's program really is the aggregation and analysis of communications metadata.

Based on the USA Today piece and on a number of other sources, I suggested in "TIA (aka Topsail) unveiled: the real scope of the NSA's domestic spying program" that "the original revelations about the NSA's SIGINT vacuum were just the tip of the iceberg," and that "it appears there's probably more that we've yet to see. Much more."

I then put the pieces together and asked the following rhetorical question: "Now, does anyone seriously think that the NSA is not collecting transactional data (at a minimum) for Web, email, FTP and other IP-based communications, and/or that they're not tying all of this data to individual users?"

Gorman's WSJ piece provides sourced confirmation that the NSA is doing exactly what I and others suspected they were doing, i.e., they're collecting e-mail headers, Web surfing histories, cell phone call logs, and every other trace of the digital and analog connections that we make to the world, and they're synthesizing this into complete informational portraits of individuals.

Network effects

In my "TIA unveiled" piece, I appear to have overstated the scope of the NSA's profiling by suggesting that the agency is building such informational pictures of everyone in the US. But Gorman's article provides an important correction by suggesting that the TIA driftnet works in a much more focused fashion.

According to Gorman, counter-terrorism officials must seed the system with leads—like the name or phone number of an individual with suspected terrorist ties. The system then begins monitoring the aforementioned types of transactional data in order to build an informational profile. The system also works outward through the individual's social network by turning its information vacuum on everyone that that person contacts, and then on their contacts in turn, in an ever-expanding web of surveillance. This way, the system can build of profiles of individuals and groups, and monitor their interactions for suspicious activity.

The fact that the driftnet is seeded by first giving it a single target—a target that is ostensibly drawn from some type of human-generated intelligence—makes it less of a lost cause than a massive, nationwide driftnet would be, but only marginally less so, depending on how far out in the suspect's social network the surveillance extends. As I explained in this article on why the NSA's program is a bad idea from a national security perspective, the main problem with these driftnet or "dragnet" systems is that the rate of false positives is typically so high that they produce an overwhelming flood of bogus leads that tie up law enforcement resources.

Even if the more targeted driftnet approach described by Gorman does result in fewer false positives, the constitutional, privacy, and oversight questions still remain. Let's hope that we, the people, eventually get a shot at answering those questions for ourselves.

Source and credit.

From BFS to ZFS: past, present, and future of file systems

Why we care about file systems

Computer platform advocacy can bubble up in the strangest places. In a recent interview at a conference in Australia, Linux creator Linus Torvalds got the Macintosh community in an uproar when he described Mac OS X's file system as "complete and utter crap, which is scary."

What did he mean? What is a "file system" anyway, and why would we care why one is better than another? At first glance, it might seem that file systems are boring technical widgetry that would never impact our lives directly, but in fact, the humble file system has a huge influence on how we use and interact with computers.

This article will start off by defining what a file system is and what it does. Then we'll take a look back at the history of how various file systems evolved and why new ones were introduced. Finally we'll take a brief glance into our temporal vortex and see how file systems might change in the future. We'll start by looking at the file systems of the path, then we'll look at file systems used by individual operating systems before looking at what they future may hold.

What is a file system?

Briefly put, a file system is a clearly-defined method that the computer's operating system uses to store, catalog, and retrieve files. Files are central to everything we use a computer for: all applications, images, movies, and documents are files, and they all need to be stored somewhere. For most computers, this place is the hard disk drive, but files can exist on all sorts of media: flash drives, CD and DVD discs, or even tape backup systems.

File systems need to keep track of not only the bits that make up the file itself and where they are logically placed on the hard drive, but also store information about the file. The most important thing it has to store is the file's name. Without the name it will be nearly impossible for the humans to find the file again. Also, the file system has to know how to organize files in a hierarchy, again for the benefit of those pesky humans. This hierarchy is usually called a directory. The last thing the file system has to worry about is metadata.

Metadata

Metadata literally means "data about data" and that's exactly what it is. While metadata may sound relatively recent and modern, all file systems right from the very beginning had to store at least some metadata along with the file and file name. One important bit of metadata is the file's modification date—not always necessary for the computer, but again important for those humans to know so that they can be sure they are working on the latest version of a file. A bit of metadata that is unimportant to people—but crucial to the computer—is the exact physical location (or locations) of the file on the storage device.

Other examples of metadata include attributes, such as hidden or read-only, that the operating system uses to decide how to display the file and who gets to modify it. Multiuser operating systems store file permissions as metadata. Modern file systems go absolutely nuts with metadata, adding all sorts of crazy attributes that can be tailored for individual types of files: artist and album names for music files, or tags for photos that make them easier to sort later.

Advanced file system features

As operating systems have matured, more and more features have been added to their file systems. More metadata options are one such improvement, but there have been others, such as the ability to index files for faster searches, new storage designs that reduce file fragmentation, and more robust error-correction abilities. One of the biggest advances in file systems has been the addition of journaling, which keeps a log of changes that the computer is about to make to each file. This means that if the computer crashes or the power goes out halfway through the file operation, it will be able to check the log and either finish or abandon the operation quickly without corrupting the file. This makes restarting the computer much faster, as the operating system doesn't have to scan the entire file system to find out if anything is out of sync.

Jurassic file systems

DEC PDP-11 minicomputer with DECTape

In the early days of computers, when dinosaurs roamed freely and IBM stood above the earth like a Colossus, file systems were born. Back then, they rarely had names, because they were simply considered part of the operating system that ran the computer, and in those days operating systems themselves were rather new and fancy. One of the first file systems to have a name was DECTape, named after the company that made it (Digital Equipment Corporation) and the physical system the files were stored on (those giant, whirring reel-to-reel tape recorders that you often saw in movies made in the 1960s). The tapes acted like very slow disk drives, and you could even run the whole operating system on them if you were desperate enough.

DECTape stored an astoundingly small 184 kilobytes (kilo, not mega) of data per tape on the PDP-8, DEC's popular early minicomputer. It was called a minicomputer only because, while the size of a refrigerator, it was still smaller than IBM's behemoth mainframes that took up entire rooms. Of course, the invention of the transistor and integrated circuit (or silicon chip) allowed another whole round of miniaturization. DEC slowly became extinct while the rest of the world moved to microcomputers, or as most of us call them, computers. (IBM, miraculously, survived, but nobody is entirely sure how.)

CP/M

Gary Kildall working at his computer

Gary Kildall invented CP/M in 1973 because he was lazy. He had a job programming some big dinosaur computers, but he didn't want to have to drive into work every day. So he wrote a program called "Control Program for Microcomputers" that would allow him to store files and run programs from an 8-inch Shugart floppy drive on his little computer at home.

CP/M had a file system, but it didn't have a name. If it was called anything it was the "CP/M file system." It was very simple, mostly because Gary's needs were simple. It stored files in a completely flat hierarchy with no directories. File names were limited to eight characters plus a three-character "extension" that determined the file's type. This was perfectly sensible because it was exactly the same limitation as the big computer Kildall was working with.

Gary Kildall and the company he founded to sell CP/M, Intergalactic Digital Research, soon became very wealthy. It turned out that a lot of microcomputer companies needed an operating system, and Gary had designed it in a way that separated all the computer-specific bits (called the BIOS) from the rest of the OS. Kildall also did this out of laziness because he didn't want to keep rewriting the whole of CP/M for every new computer. The moral of this story is that being lazy can sometimes be a spectacularly smart thing.

Unfortunately for Kildall, other people soon got the same idea he had. A programmer named Tim Patterson wrote his own OS called "QDOS" (for Quick and Dirty Operating System) that was a quick and dirty clone of everything CP/M did, because he needed to run an OS on a fancy new 16-bit computer, and Gary hadn't bothered to write a 16-bit version of CP/M yet. (Here's where being lazy can work against you!) QDOS had a slightly different file system than CP/M, although it did basically the same thing and didn't have directories either. Patterson's file system was based on a 1977 Microsoft program called Microsoft Disk Basic, which was basically a version of Basic that could write its files to floppy disks. It used an organization method called the File Allocation Table, so the file system itself was given the incredibly imaginative name of FAT.

Bill Gates then applied the Theory of Laziness in an even more spectacular way, bought Tim Patterson's QDOS lock, stock and barrel for $50,000, and renamed it MS-DOS. He now was able to sell it to IBM and every company making an IBM clone, and poor Gary found himself quickly escorted from the personal computing stage. From now on the world would have to deal with FAT in all its gory. Did I say gory? I meant glory.

FAT times at Ridgemont High

Given that it was originally a quick and dirty clone of a file system designed for 8-bit microcomputers in the 1970s that was itself a quick-and-dirty hack that mimicked the minicomputers of a decade earlier, FAT was not really up for very much. It retained CP/M's "8 and 3" file name limit, and the way it stored files was designed around the physical structure of the floppy disk drive, the primary storage device of the day.

The File Allocation Table described which areas of the disk were allocated to files, which were free space, and which were damaged and unusable (called "bad sectors"). Because each floppy disk had very little space (the first single-sided disks in the IBM PC could store only 160 kilobytes) the table itself needed to be very small. To keep it small, the disk was divided into clusters—groups of sectors stored next to each other on the disk. The first version of FAT was called FAT-12, because it used a 12-bit number to count the clusters (2 to the power of 12 is 4096, and each cluster was 8KB, so the maximum volume size was 32MB).

The massive storage of the 5.25 inch floppy

The original IBM PC 5.25-inch floppy disk, with superimposed tracks/sectors

FAT had no clue about directories or optimizing file storage, and just threw all the bits of files on the first part of the disk where it found any space. For a floppy disk, this didn't matter much, because you typically only stored a few files on it anyway. However, IBM was getting ready to release their PC-XT with an optional 20MB hard disk. This (for the time) enormous space meant that FAT would need a way to store files in a proper hierarchy. For MS-DOS 2.0, Microsoft added nested directories, labeled with the backslash (\) as the separator, so a file might be stored in, say, C:\MYFILES\NOTES. Why a backslash and not a forward slash, as God (and Unix) intended? Well, the forward slash was already used by some MS-DOS 1.0 programs as a modifier argument (say, FORMAT A: /S to add system files after formatting) and it was just too much work to change it now. Why C:? Well, A: and B: were reserved for the first and second floppy disk drives that everyone had to have before hard disks became standard.

Obsolescence

The introduction of hard disks soon made FAT-12 obsolete, so Microsoft came up with a 16-bit version in DOS 3.31, released in 1987. It had 32KB clusters and could access 2¹⁶ of them, for an astounding 2GB maximum disk size.

And what about the problem of storage optimization? With a hard drive, a user was always copying and deleting files, and this left holes that new files were crammed into (since FAT just looked for the first space it could store stuff) which led to real problems with fragmentation. This made hard drives work harder than ever, jumping around willy-nilly trying to find all the bits of a file that were strewn about the drive. Did Microsoft fix this problem with FAT? Not at all. They let other companies make programs like Norton Utilities and PC-Tools that would defragment the whole disk in one go, giving users happy evenings to remember forever, sitting watching the screen move little rectangles around.

By 1995, hard disks were getting larger than the 2GB limit, and the 8.3 file name limit seemed even more archaic than it had been 20 years ago. To solve both these problems, Microsoft introduced FAT-32, with an 8TB (terabyte) limit and a special magic ability called VFAT that gave FAT the ability to have long file names without really having them.

VFAT took a long file name (up to 255 characters) and created a short version that fit into the 8.3 straitjacket. For example, "Super-long file name.txt" would be stored as "SUPER-~1.TXT". The remaining letters were thrown into a very strange kind of nonexistence, like living on an astral plane. They were sliced up into 13-letter chunks and stored as phantom directories that were marked with the metadata attributes of Volume Label, System, Hidden, and Read-Only, a combination that confused older versions of MS-DOS so much that they flat-out ignored them. However, if a long file name was deleted from DOS, it considered the associated phantom directory full of Volume Label entries to be empty and deleted it as well. Even stranger tricks were added to check whether the long file name matched the 8.3 one.

Some of these tricks still haunt us today, thanks to shaky third-party implementations of VFAT. For example, while writing this article I had the fun experience of watching a directory on my USB flash drive spontaneously change from "Williams files" to "WILLIA~1" when transferring a file to my iBook.

So did Microsoft address the horrible fragmentation problems of FAT in this version either? The answer is: not so much. But it did at least include a home-grown defragmenting program with Windows 95. Thanks to the power of multitasking, you could even do other things while it was defragging! Except you couldn't, because it would complain that "files had changed" and keep starting over.

This strange VFAT solution was a ghastly hack, but it did allow seamless upgrades of DOS and Windows 3.1 systems to Windows 95 and long file names, while still letting poor DOS users access these files. It also caused Macintosh users to laugh and make "MICROS~1" jokes that nobody ever understood. They probably shouldn't have been laughing, however, because the Macintosh file system had even stranger limitations.

Hello, HFS!

Most people remember the 1984 Macintosh with a kind of romantic haze, forgetting the huge limitations of the original unit. The Mac came with a single floppy drive back when PC users were starting to get used to hard disks. The original file system was called the Macintosh File System, or MFS, and had a limit of 20MB and 4,096 files. It both had directories and didn't have them. The user could create graphical "folders" and drag files into them, but they would not show up in the Open/Save dialog boxes of applications. Instead, all the file and directory information was stored in a single "Empty Folder" that would disappear if modified in any way, only to be replaced by a new "Empty Folder." This worked well for floppy disks, but really slowed down performance with hard drives. File names could be 63 characters long.

The original 128K Macintosh, shown with optional second floppy drive

MFS was replaced in 1985 by a system with proper hierarchical directories. Because of this, it was called the Hierarchal File System, or HFS. For some reason, file names were now limited to 31 characters, which was just short enough to be annoying. The file, directory, and free space information was stored in a B-Tree, a type of binary storage structure that allows for fast sorting and retrieval of information. HFS used 512KB clusters with a 16-bit pointer, so the maximum size of a drive was 32GB. Later versions upped the pointer to 32 bits and could thus access 2TB at once.

MFS and HFS introduced an innovative way of handling files, called "forks." Instead of storing metadata in a separate place (such as the place directories are stored), HFS made each file into two files: the file itself (the "data fork") and an invisible "resource fork" that contained structured data, including information about the file, such as its icon. Resource forks were used for far more than metadata, though—for example, they held details of an application's interface and executable code in pre-PowerPC macs. Like prongs on a fork, the data and resource traveled around together all the time, until the file was sent to another type of computer that didn't know about forks. Fortunately, back then computers were very snobby and never talked to each other, so this was rarely a problem.

Instead of using a puny three-letter file extension to determine the file type, HFS used a massively huge four-letter "type code" and another creator code, which were stored in the file system's metadata, treated as a peer to information such as the file's creation date.

HFS didn't mess around with slashes or backslashes to separate directory names. Instead, it used a colon (:) and then made sure that the humans would never get to see this letter anywhere in the system, until they tried to include one in a file name.

All kidding aside, HFS was the first instance in history where a file system was designed specifically to adapt to the needs of the then-new graphical user interface. The whole philosophy of the Macintosh's GUI design was to hide unimportant details from the user. This "human-centric" design was intended to help people focus more on their work than the technical details of the file system.

Of course, nothing is perfect, and all systems that try to abstract away the nasty technical bits occasionally run afoul of what Joel Spolsky calls the Law of Leaky Abstractions. When something broke, such as the loss of the resource fork when a file was sent to another computer and back to the Macintosh, it was not always clear what to do to fix the problem.

HFS had some other technical limitations that could occasionally leak out. All the records for files and directories were stored in a single location called the Catalog File, and only one program could access this file at once. There were some benefits of this approach, like very fast file searches. The first Macintoshes did not have multitasking, so this was not a problem, but when multitasking was added later, it caused problems with certain programs "hogging the system." The file could also become corrupt, which could potentially render the entire file system unusable. Other file systems stored file and directory information in separate places, so even if one directory became corrupt the rest of the system was still accessible.

As was the case for other file systems covered so far, the number of clusters (known as "blocks") was fixed to a 16-bit number, so there could only be 65,535 blocks on a single partition, no matter what size it was. Because HFS (like most file systems) could only store files in individual blocks, the larger block size meant a lot of wasted space: even a tiny 1KB file would take up a full 16K on a 1GB drive. For an 8GB drive the problem got eight times worse, and so on.

This last problem was fixed with HFS+ by Apple in 1998, which came bundled with the release of Mac OS 8.1. HFS+ used a 32-bit number for numbering blocks, and also allowed 255-character file names, although versions of "classic" Mac OS (9.2.2 and earlier) only supported 32-character file names. Oddly, Microsoft Office for the Mac would not support higher than 32-character names until 2004.

Amiga file systems

The Amiga, released a year after the Macintosh, had advanced multimedia capabilities that seemed to come from a decade in the future. However, due to intense time pressures to get the original AmigaOS out the door, the file system was one of its weakest parts. It came from TripOS, an operating system developed by MetaComCo. It used 512KB blocks, but reserved 24KB of each block for metadata. The file system was dubbed OFS for "Old File System" by the Amiga engineers, who replaced it as quickly as they were able to.

The Amiga 1000

FFS, or Fast File System, was released in 1987 along with the Amiga 500, 2000, and AmigaOS 1.3. The major difference was the that metadata was moved from each individual block into a separate space. Like HFS, it was limited to 32-character file names.

To support both OFS and FFS, the Amiga operating system was redesigned so that it could accept multiple file systems in a plug-in format, and this format was documented so that anyone could write his or her own file system if desired. Many people did that, and some of the results, such as the Professional File System (PFS) and Smart File System (SFS), are still used by Amiga fans to this day. The Amiga OS4 operating system for PowerPC Amigas also supports FFS2, a minor rewrite that added the ability to support file names 255 characters long.

Unix and Linux file systems

UFS (FFS)

Unix started out its life as a pun on MULTICS, a very serious multiuser time-sharing system that didn't like being made fun of, but then went on to leave its serious rival in the dustbin of history. Unix almost completely dominated the market for scientific workstations and servers before being neatly replaced by a work-alike clone called Linux, which started out as a pun on Unix. The moral of the story? Puns can be powerful things.

Along the way, Unix set all sorts of standards for how users would store their files. The Unix File System (UFS), also known as the Berkeley Fast File System, became the standard when researchers at the University of California at Berkeley developed a much improved version of the original Unix file system.

Unix grew out of the hacker culture rather than the commercial sector, and its file system reflects those roots. For one, the system is case-sensitive, meaning that README.TXT is a completely different file from readme.txt which is also different from Readme.Txt. Most other file systems preserve the case of a file name, but don't care how the user accesses it. This rather arcane and computer-centered view of file names has survived to this day because changing it now would break software that relies on it. Occasionally, these two worlds can collide violently: moving a web site from a Windows server to a Linux one can sometimes result in broken links when the server asks for some_file.htm and the last person who edited it saved it as Some_File.htm instead.

Aside from being picky about case, the file system exposes its hacker roots in other ways. The pointers that UFS uses to locate files on a disk are called inodes, and unlike other operating systems, Unix is quite happy to show this inode data to the end user. UFS did not start out in life having journaling, so a crash or power outage could leave corrupted inodes and lost files. The solution to this was to run a utility called fsck (for File System Check) and watch as the OS told you all its dirty inode secrets.

As in other file systems, data is stored in discrete components called blocks. The standard block size kept increasing as disks became larger, but was eventually standardized at 8KB. Because this large a block size would waste a lot of disk space, BSD implemented a clever algorithm called block suballocation where partially-filled blocks from several files could be stored in a single block.

UFS also tried to minimize excess movement of hard drive heads by storing file and metadata close to each other in groups called cylinders and attempted to keep all the files from a single directory in adjacent cylinders.

Many Unix vendors implemented their own versions of UFS, often making them incompatible with other Unices. Sun added journaling to its version in Solaris 7, and NeXT created its own version of UFS for NeXTstep.

ext2

ext2 was "inspired" by the UFS design and is essentially a work-alike clone of UFS in the same way as Linux is a work-alike clone of Unix. This allowed the easy porting of many Unix utilities such as fsck, which sounds a bit like profanity if you say it fast enough. Like UFS, ext2 lacked journaling, but it also eschewed some other safety checks that UFS had in order to run faster. Because of this, saying "fsck" over and over again was sometimes necessary.

Linux, which started as a hobby project in 1991 to replace Andrew Tanenbaum's teaching OS Minix, originally had a clone of the Minix file system, but it was limited to 64MB and was quickly replaced by ext. ext2 was developed in 1993 to address some of the limitations of the original ext and survived for many years afterwards. ext2 has the same "cylinder" system as UFS but calls them block groups instead.

ReiserFS

Because ext2 lacked journaling and Linux was built on the spirit of open source and contributions from anyone, there were plenty of people who wanted to Build A Better Linux File System. One person who actually succeeded was Hans Reiser, who modestly titled his work ResierFS.

ReiserFS added not only journaling, but attempted to improve on many other aspects of ext2. B-Tree indexing and advanced block suballocation routines sped up the file system significantly, particularly when dealing with small files on the order of 1KB.

ReiserFS garnered much praise and even major industry support from Linux distributions such as SuSE, until the wheels started to come off for reasons that were primarily nontechnical.

First, Hans Reiser decided that he was no longer going to support or update ReiserFS, preferring to work on its successor, dubbed Reiser4. The new version performed well, but it was not a clean update from ReiserFS and required users to reformat. There were some questions about the reliability and stability of Reiser4, but these could have been dealt with in time.

What really threw the community for a loop was something nobody could have foreseen. In 2006, Hans Reiser's wife, Nina, was declared missing. Blood matching her DNA profile was found on a sleeping bag in Hans' car. Hans pleaded not guilty, and his criminal trial is currently under way.

ext3

The popularity of ReiserFS had began to wane, and in the confusion over ReiserFS's future, many of those still using it switched to the safer ext3, which was essentially ext2 with journaling support added on. While not likely to win any speed derbies, ext3 retains its predecessors' legacy of time-tested reliability.

JFS

JFS was IBM's entry into the Unix file system game and added journaling to UFS. It is an open-source reimplementation of the older proprietary system used by AIX, IBM's peculiar version of Unix. JFS used B-Trees to speed up file access and also introduced the concept of extents, which are 128-byte chunks of inodes. This prevents the file system from having to dedicate fixed amounts of space for inodes the way ext2 and ext3 do.

XFS

XFS came from Silicon Graphics' version of Unix, dubbed Irix. First introduced in 1994 with Irix 5.3, XFS has been optimized for speed and reliability, winning many speed comparison tests. It is a 64-bit file system with a maximum volume size of 8 exabytes. It uses extents and has many advanced features such as being optimized for multithreading—multiple threads can operate on the same file system simultaneously.

In 2000 SGI released the XFS source code under the GNU Public license, and in the following years, many Linux distributions have added support for this file system.

IBM and Microsoft duke it out

OS/2 and HPFS

Most kids won't remember it today, but IBM once briefly toyed with the idea of competing directly with Microsoft for the prize of personal computer operating system dominance. Even more unusual was the fact that this competition was originally a partnership.

IBM's OS/2 Warp (Version 3.0)

Even IBM, with its 10,000 layers of management and more bureaucracy than the Soviet Union, realized that DOS was badly in need of a replacement. IBM decided that it was going to design a successor—brilliantly named OS/2—which it would then fully own, but which Microsoft would do all the work of actually writing. Steve Ballmer, back before he was known for jumping up and down and throwing chairs, once described how IBM was viewed by the computing industry back then. "They were the bear, and you could either ride the bear, or you could be under the bear!" So Microsoft went along with this crazy plan.

OS/2 was to be a multitasking operating system, with a fancy GUI that was to be bolted on later. It took forever to arrive, had difficulty running DOS applications, and required more RAM than most computer users could afford in their lifetimes, so it went over about as well as New Coke. For version 1.2, which was released in 1987, IBM wanted a new file system to replace the awful FAT. Thus was born HPFS, for High Performance File System, written by a small team led by Microsoft employee Gordon Letwin.

HPFS used B-Trees, supported 255-character file names, and used extents. The root directory was stored in the middle of the disk rather than the beginning, for faster average access times. It did not support journaling, but it did support forks and had extensive metadata abilities. These new metadata were called Extended Attributes and could be stored even on FAT partitions by saving themselves in a files called EA_DATA.SF. Extended attributes were also supported in HFS+, but were not exposed in an Apple OS until Mac OS X 10.4.

Microsoft and IBM then went through a rather messy divorce right around the time Windows 3 was ready to be released. (IBM wanted to own that, too, and Microsoft really really didn't want that.) Microsoft refocused its efforts on Windows after version 3 became a smash success, while IBM kept the code it had and added a bunch of extra user interface code they had lying around from various dalliances with Apple and NeXT. Thus was born OS/2 2.0 and the object-oriented Workplace Shell, which had a brief day in the sun (and was even advertised, bizarrely, at the Fiesta Bowl) before Windows 95 arrived and crushed it into the ground. IBM later ported JFS to OS/2, much to the delight of the three people who still used it.

NTFS

Windows NT 3.1 install CD. Note supported platforms!

Microsoft also knew that DOS needed a replacement, but was soured on its experience with IBM. In Bill Gates' second spectacular application of the Theory of Laziness, he hired Dave Cutler, the architect of DEC's rock-solid VMS operating system, just as DEC was going into a downward spiral from which it would never recover.

Dave Cutler took his team with him, and despite lawsuits from the dying DEC, implemented a clean-room implementation of a brand new operating system. Along with a brand-new OS came a brand-new file system, which initially didn't have a name, but was later dubbed NTFS when the OS itself was named Windows NT. NT was a marketing name that stood for New Technology, but it was still an amusing coincidence that WNT was VMS with each letter replaced by the next one.

NTFS was an all-out, balls-to-the-wall implementation of all the best ideas in file systems that Cutler's team could think of. It was a 64-bit file system with a maximum file and volume size of 2⁶⁴ (16 exabytes) that stored all file names in Unicode so that any language could be supported. Even the file date attributes were stretched to ridiculous limits: Renaissance time-travelers can happily set their file dates as early as 1601 AD, and dates as late as 60056 AD are supported as well, although if humanity is still using NTFS by that time, it will indicate something is seriously wrong with our civilization. It first was unveiled to the public with the very first release of Windows NT (called version 3.1 for perverse marketing reasons) that came out in 1993.

NTFS used B+Trees (an enhanced and faster version of B-Trees also supported in HFS+), supported journaling from day one, had built-in transparent compression abilities, and had extremely fine-grained security settings by using Access Control Lists (ACLs were added with NT 3.5, released in 1994). It was designed so that Microsoft could add extra metadata support until the cows came home. Indeed, NTFS's support for metadata was so extensive that sometimes it took Microsoft's operating system team a while to catch up with features that were already there.

For example, super-fast indexed searching of both files and metadata was available in NTFS since the release of Windows 2000, but it took until 2005 before Microsoft released a graphical interface that supported this, and it didn't become a part of the operating system itself until Windows Vista in 2006. Hey, sometimes things get forgotten. Anyone up for redesigning the Add New Font dialog box?

Additional features were added to NTFS in later versions of Windows NT. Version 3.0, released with Windows 2000, added the aforementioned indexed metadata searching, along with encryption and disk quotas so that students could no longer fill up the file server with pirated MP3s.

NTFS stores all of its metadata information in separate files, hidden from the regular operations of the OS by having filenames that start with the $ character. By storing everything as a file, NTFS allows file system structures to grow dynamically. The file system also resists fragmentation by attempting to store files where there is enough contiguous space, not just in the first available space on the drive.

To ease the transition from FAT-based operating systems to NTFS ones, Microsoft provided a handy tool for users of Windows 98 and earlier that would safely convert their FAT16 and FAT32 partitions to NTFS. It wouldn't go the other way around, but honestly, would you want to?

The only thing anyone could really find to complain about NTFS was that its design was a proprietary secret owned by Microsoft. Despite this challenge, open-source coders were able to reverse-engineer support for reading—and, much later, writing—to NTFS partitions from other operating systems. The NTFS-3G project allows any operating system to read and write to NTFS partitions.

Dead ends and thoroughfares

BeOS and BFS

In 1990, Jean-Louis Gasee, the ebullient former head of Apple France, had an epiphany. He decided that the problem with current desktop operating systems like Mac OS and Windows was that they were weighed down with too much baggage from supporting legacy versions of themselves. His solution was to start from scratch, creating a brand new hardware and operating system platform using all the best ideas that were available. He was inspired by the Amiga team—which had done the same thing back in 1982—and even got "AMIGA96" vanity license plates for his car.

The hardware team ran into troubles almost immediately when their choice of processor—the AT&T Hobbit—was discontinued by its parent company. Switching to the PowerPC, the team delivered a geek's dream: a fast, affordable, multiprocessor computer with tons of ports and slick Cylon-esque blinking lights. Unfortunately, the BeBox, which shipped in late 1995, sold fewer than 500 units in total. As Steve Jobs discovered with NeXT, the reality of the desktop market was that there was no room for a new hardware platform any more. Be, Inc. quickly ported its operating system to the Power Macintosh platform, and then to the much larger realm of x86-compatible PCs.

The PPC BeBox, running BeOS

The BeOS needed a file system, and its initial goals were grand indeed. The original hierarchical file system on the BeBox (dubbed OFS) linked directly to a relational database, allowing for all kinds of flexibility and power. Unfortunately, the database-driven design was too slow and there were problems keeping the database and file system in sync. With the move to PowerPC Macs the BeOS now had to support HFS as well, so the file system infrastructure needed to change. Thus was born the Be File System, or BFS.

BFS, written by Dominic Giampaolo and Cyril Meurillon, was a 64-bit file system that supported journaling and used B+Trees, just like NTFS. It also needed to support the database features that the original BeBox's OS had used, which allowed users to store any file information they wanted in a series of records and fields. This information was put into metadata, and users were allowed to add as many fields as they liked to each file.

This extensible metadata idea seemed novel at the time, but it's important to recognize that NTFS already supported basically the same thing. The major difference was in the user interface: the BeOS directory windows supported adding, editing, sorting, and indexed searching by any metadata field. This was extremely handy for organizing one's MP3 collection.

Back in 1996, few people were using Windows NT and NTFS, so BFS's 64-bit support, journaling, and extensible metadata all added to the impression of BeOS as being an advanced operating system. Unfortunately for Be, Inc. the company could not find a sustainable business model for selling the OS, and after a disastrous "focus shift" to dedicated Internet Appliances like the crazily-named Sony eVilla, the company ran out of money and was sold to Palm in 2001.

Mac OS X and HFS+

One of the reasons that Be, Inc. went out of business is that Jean-Louis Gassee was banking on his old company, Apple, bailing him out by buying the BeOS to use for their next operating system. Apple's traditional Mac OS had not aged very well. Its lack of memory protection, automatic memory allocation, and preemptive multitasking was starting to hurt the company very badly, and when Apple's internal replacement, Copland, fell apart in a pile of sticky entrails, the company went shopping for a replacement. Solaris, BeOS, and even the hated Windows NT were all in contention for the prize, and BeOS was the leading candidate. The ever-colorful Gassee said he "had Apple by the balls and was going to squeeze until it hurt."

He squeezed a little too hard. Someone inside Apple made a phone call to Steve Jobs at NeXT, and the rest is history. Steve came back to Apple riding on a white charger and brought his NeXTstep team with him. After some confusion over whether Apple was going to offer both the traditional Mac OS and a high-end version called Rhapsody, Steve took full control and decided that he was going to merge the two. Thus was born Mac OS X.

Macintosh OS X 10.0, released in 2001.

The merging was not entirely pretty. NeXT's core was based on Mach, which itself had become messily entangled with BSD Unix back in its academic past. Unix lived for the command line, whereas MacOS had shunned the CLI as being unfriendly. Unix used traditional file extensions to identify file types, whereas MacOS hid this information in a resource fork. And finally, Unix's file system was the UFS, whereas Mac OS ran on the HFS+. Deciding which side to support was always a battle.

For some of these battles, the NeXTies won. File extensions were championed as the "new way" to identify file types. For the first release of OS X, the user was asked to choose between UFS and HFS+ when installing the OS on a new hard drive partition. For compatibility reasons, however, HFS+ was chosen as the default. Choosing UFS was also not recommended because it was case-sensitive, and would therefore break some third-party applications that were sloppy about file name capitalization.

Over time, the influence of the NeXT people waned, and some of their hard-line decisions were revisited. UFS support was finally dropped in Leopard, the latest version of OS X. File extensions, while still supported, were no longer mandatory.

There was still the issue of bringing HFS+ up to more modern standards. In OS X 10.2 (Jaguar), journaling was added to the file system, although it was turned off by default and could only be enabled via the command line. In 10.3 (Panther) it was enabled by default. Apple also hired Dominic Giampaolo, co-creator of BFS, and he worked to add extensible metadata, journaling, the initial implementation of Spotlight, and FSEvents. Lastly, NTFS-style fine-grained file permissions were added in Mac OS X Server 10.4.

ZFS and the future of file systems

Many people wondered why Apple didn't just ditch HFS+ and replace it with something newer and sexier, like Sun's ZFS. The Zettabyte File System, announced in 2004, is a 128-bit file system that supports files of ridiculous file sizes (16 exabytes) and an absolutely ludicrous limit of 256 zettabytes (2 to the power of 78 bytes) for the total accessible size of the file system. Project leader Jeff Bonwick said that "Populating 128-bit file systems would exceed the quantum limits of earth-based storage. You couldn't fill a 128-bit storage pool without boiling the oceans." It would literally take a computer made of pure energy, emitting enough energy to bring the entire world's oceans to a boiling point, to fill up the limits of a 128-bit file system. It seems unlikely that anyone is going to build a 256-bit file system any time soon.

ZFS also has a novel way of dealing with the old bugbear of multiple partitions. ZFS builds on top of virtual storage pools called zpools, so all connected drives appear to be part of a single gigantic partition. Drives can be seamlessly strung together in various virtual RAID configurations, which can then automatically "self-heal" if data on one mirror is damaged. ZFS can also automatically take snapshots of every single change made to a file, saving only the differences, so no data can ever be lost.

ZFS Storage Pools. Come on in, the water's fine! It's not even boiling yet!

There are other fancy ZFS features too numerous to list here. It will basically do everything but cook you dinner, so why doesn't Apple just put it into Mac OS X?

Part of the problem is that ZFS is still maturing, and Sun is still working out the kinks. However, the greater issue is that even if all the bugs are all fixed, moving a whole user base over to a new file system is an uphill task.

File systems are expected to be completely reliable, and users often keep old data lying around on drives formatted with traditional file systems. Microsoft managed to move over some FAT users to NTFS with the conversion utility built into Windows, but primarily the shift came about through attrition, as old Windows 98-era computers were thrown away and replaced by new machines with XP pre-installed. FAT still haunts us to this day, as most flash drives are formatted with FAT32. Why? Because as one of the oldest file systems available, it is also the most understood and easiest to implement.

Often it is easier to stick with well-established file systems, even when the cost of switching is ostensibly "free." The example of Linux, which is completely open source and allows anyone to write a new file system for it, is a useful one. Despite valiant attempts to establish ReiserFS as a new standard, and the measurable superiority of systems like XFS, most Linux users are still using ext3. ext3 is not new. It's not super fast. It's not sexy. It won't cook your dinner. But it is tried and true, and for many people, that is more important.

Microsoft recently tried to resurrect the original BeOS database-driven file system idea with WinFS, which was originally scheduled to be included with Windows Vista. However, delays in releasing the operating system caused Microsoft to take WinFS out of the operating system and instead move it as an optional part of its SQL database product. The future of WinFS remains murky, but Microsoft may try to resurrect it for a future release of Windows.

NTFS is likely to stick around for many years in the future, simply out of sheer inertia. HFS+ may kick around for a few years longer as well. Even FAT may still be on our thumb drives, haunting us with the ghost of CP/M long after everyone has forgotten what that even was.

While file systems may not, by themselves, seem exciting, their history tells us the story of how computers and operating systems have evolved over the years. "By his works, shall ye know him" is true for both humans and file systems. By knowing how the OS stores a humble file, one is provided a glimpse into the limitations and aspirations of its designers.

Source and credits.

March 20, 2008

March 19, 2008

March 18, 2008

March 17, 2008

A look back at the role of the TIA in the NSA's surveillance activities

Network effects

Why we care about file systems

What is a file system?

Metadata

Advanced file system features

Jurassic file systems

CP/M

FAT times at Ridgemont High

Obsolescence

Hello, HFS!

Amiga file systems

Unix and Linux file systems

UFS (FFS)

ext2

ReiserFS

ext3

JFS

XFS

IBM and Microsoft duke it out

OS/2 and HPFS

NTFS

Dead ends and thoroughfares

BeOS and BFS

Mac OS X and HFS+

ZFS and the future of file systems

Previous entries: