Distributed File Systems
A distributed file system (DFS) is a distributed implementation of the classical time-sharing model of a file system, where multiple users share files and storage resources. The purpose of a DFS is to support the same kind of sharing when the files are physically dispersed among the various sites of a distributed system.
In this section, we discuss the various ways a DFS can be designed and implemented. First, we discuss common concepts on which DFSs are based. Sprite, and Locus DFSs. We take this approach to the presentation because distributed systems is an active research area, and the many design tradeoffs we shall illuminate are still being examined. By exploring these example systems, we hope to provide a sense of the considerations involved in designing an operating system, and to indicate current areas of operating-system research.
A distributed system is a collection of loosely coupled machines interconnected by a communication network. We use the term machine to denote either a mainframe or a workstation. From the point of view of a specific machine in a distributed system, the rest of the machines and their respective resources are remote, whereas the machine's own resources are referred to as local.
To explain the structure of a DFS, we need to define the terms service, saw, and client. A service is a software entity running on one or more machines and providing a particular type of function to a priori unknown clients. A server is the service software running on a single machine. A client is a process that can invoke a service using a set of operations that forms its client interface. Sometimes, a lower-level interface is defined for the actual cross-machine interaction, which we refer to as the inter-machine interface.
Using the terminology we have defined above, we say that a file system provides file services to clients. A client interface for a file service is formed by a set of primitive file operations, such as create a file, delete a file, read from a file, and write to a file. The primary hardware component that a file server controls is a set of local secondary-storage devices (usually, magnetic disks), on which files are stored, and from which they are retrieved according to the client requests.
A DFS is a file system whose clients, servers, and storage devices are dispersed among the machines of a distributed system. Accordingly, service activity has to be carried out across the network, and instead of a single centralized data repository, there are multiple and independent storage devices.
As will become evident, the concrete configuration and implementation of a DFS may vary. There are configurations where servers run on dedicated machines, as well as configurations where a machine can be both a server and a client. A DFS can be implemented as part of a distributed operating system, of alternatively by a software layer whose task is to manage the communicational between conventional operating systems and file systems. The distinctive features of a DFS are the multiplicity and autonomy of clients and servers.
Ideally, a DFS should look to its clients like a conventional, centralized file system. The multiplicity and dispersion of its servers and storage device should be made transparent. That is, the client interface of a DFS should not distinguish between local and remote files. It is up to the DFS to locate the file and to arrange for the transport of the data. A transparent DFS facilitates use mobility by bringing over the user's environment (that is, home directory) wherever a user log's in.
The most important performance measurement of a DFS is the amount of time needed to satisfy various service requests. In conventional systems, time consists of disk access time and a small amount of CPU processing time.
In a DFS, however, a remote access has the additional overhead attributed the distributed structure. This overhead includes the time needed to deliver the request to a server, as well as the time for getting the response across network back to the client. For each direction, in addition to the actual time of the information, there is the CPU overhead of running the communication protocol software. The performance of a DFS can be viewed as another dimension of the DFS's transparency.
The fact that a DFS manages a set of dispersed storage devices is the DFS's key distinguishing feature. The overall storage space managed by a DFS is composed of different, and remotely located, smaller storage spaces. Usually, there is correspondence between these constituent storage spaces and sets of files. We use the term component unit to denote the smallest set of files that I can be stored on a single machine, independently from other units. All files belonging to the same component unit must reside in the same location.
Naming and Transparency
Naming is a mapping between logical and physical objects. For instance, users deal with logical data objects represented by file names, whereas the system manipulates physical blocks of data, stored on disk tracks. Usually, a user refers to a file by a textual name. The latter is mapped to a lower-level numerical identifier that in turn is mapped to disk blocks. This multilevel mapping provides users with an abstraction of a file that hides the details of how and where on the disk the file is actually stored.
In a transparent DFS, a new dimension is added to the abstraction: that of hiding where in the network the file is located. In a conventional file system, range of the naming mapping is an address within a disk. In a DFS, this range is augmented to include the specific machine on whose disk the file is stored. Going one step further with the concept of treating files as abstractions leads to the possibility of file replication. Given a file name, the mapping returns a set of the locations of this file's replicas. In this abstraction, both the existence of multiple copies and their location are hidden.
Naming Structures
There are two related notions regarding name mappings in a DFS that need to be differentiated:
- Location transparency: The name of a file does not reveal any hint of the file's physical storage location.
- Location independence: The name of a file does not need to be changed when the file's physical storage location changes.
Both definitions are relative to the level of naming discussed previously, since files have different names at different levels (that is, user-level textual names, and system-level numerical identifiers). A location-independent naming scheme is a dynamic mapping, since it can map the same file name to different locations at two different times. Therefore, location independence is a stronger property than is location transparency.
In practice, most of the current DFSs provide a static, location-transparent mapping for user-level names. These systems, however, do not support file migration; that is, changing the location of a file automatically is impossible. Hence, the notion of location independence is quite irrelevant for these systems.
Files are associated permanently with a specific set of disk blocks. Files and disks can be moved between machines manually, but file migration implies an automatic, operating-system initiated action. r requests, without changing either the user-level names, or the low-level
There are a few aspects that can further differentiate location independence and static location transparency: Divorcing data from location, as exhibited by location independence, provides better abstraction for files. A file name should denote the files. Most significant attributes, which are its contents, rather than its location. Location-independent files can be viewed as logical data containers that are not attached to a specific storage location. If only static location transparency is supported, the file name still denotes a specific, although hidden, set of physical disk blocks.
Static location transparency provides users with a convenient way to share data. Users can share remote files by simply naming the files in a location-transparent manner, as though the files were local. Nevertheless, sharing the storage space is cumbersome, because logical names are still statically attached to physical storage devices. Location independence promotes sharing the storage space itself, as well as the data objects. When files can be mobilized, the overall, system-wide storage space looks like a single, virtual resource. A possible benefit of such a view is the ability to balance the utilization of disks across the system.
Location independence separates the naming hierarchy from the storage-devices hierarchy and from the inter-computer structure. By contrast, if static location transparency is used (although names are transparent), we can easily expose the correspondence between component units and machines. The machines are configured in a pattern similar to the naming structure. This may restrict the architecture of the system unnecessarily and conflict with other considerations. A server in charge of a root directory is an example of a structure that is dictated by the naming hierarchy and contradicts decentralization guidelines.
Once the separation of name and location has been completed, files residing on remote server systems may be accessed by various clients. In fact, these clients may be diskless and rely on servers to provide all files, including the operating-system kernel. Special protocols are needed for the boot sequence, however. Consider the problem of getting the kernel to a diskless workstation. The diskless workstation has no kernel, so it cannot use the DFS code to retrieve the kernel. Instead, a special boot protocol, stored in read-only memory (ROM) on the client, is invoked. It enables networking and retrieves only one special file (the kernel or boot code) from a fixed location. Once the kernel is copied over the network and loaded, its DFS makes all the other operating-system files available. The advantages of diskless clients are many, including lower cost (because no disk is needed on each machine) and greater convenience when an operating-system upgrade occurs, only the server copy needs to be modified, rather than all the clients as well. The disadvantages are the added complexity of the boot protocols and the performance loss resulting from the use of a network rather than of a local disk.
The current trend is toward clients with local disks. Disk drives are increasing in capacity and decreasing in cost rapidly, with new generations appearing every year or so. The same cannot be said for networks, which evolve every 5 to 10 years. Overall, systems are growing more quickly than are networks, so extra efforts are needed to limit network access to improve system throughput.
Naming Schemes
There are three main approaches to naming schemes in a DFS. In the simple approach, files are named by some combination of their host name and local name, which guarantees a unique system wide name. In Ibis, for instance, file is identified uniquely by the name host-.local-name, where local-name is UNIX-like path. This naming scheme is neither location transparent nor location- independent. Nevertheless, the same file operations can be used for both local and remote files. The structure of the DFS is a collection of isolated component units that are entire conventional file systems. In this first approach, components remain isolated, although means are provided to refer to a remote file
Implementation Techniques
Implementation of transparent naming requires a provision for the mapping of a file name to the associated location. Keeping this mapping manageable calls for aggregating sets of files into component units, and providing the mapping on a component unit basis rather than on a single file basis. This aggregation serves administrative purposes as well. UNIX-like systems use the hierarchical directory tree to provide name-to-location mapping, and to aggregate files recursively into directories.
To enhance the availability of the crucial mapping information, we can use methods such as replication, local caching, or both. As we already noted, location independence means that the mapping changes over time; hence, replicating the mapping renders a simple yet consistent update of this information impossible. A technique to overcome this obstacle is to introduce low level, location-independent file identifiers. Textual file names are mapped to lower-level file identifiers that indicate to which component unit the file belongs. These identifiers are still location independent. They can be replicated and cached freely without being invalidated by migration of component units. A second level of mapping, which maps component units to locations and needs a simple yet consistent update mechanism, is the inevitable price.