Distributed File System Nfs Afs Gfs Information Technology Essay
In last half decade, there is tremendous growth in the network applications; we are experiencing an information explosion era. Due to which large amount of distributed data being managed and stored. To handle this type of data applications uses distributed file system.
Eligibility for AFSA’s 2017 National High School Essay Contest is open to students in grades nine through twelve in any of the fifty states, the District of Columbia, the U.S. territories, or if they are U.S. citizens attending high school overseas
. Students may be
attending a public, private, or parochial school. Entries from home-schooled students are also accepted. Previous first-place winners and immediate relatives of directors
or staff of the AFSA, the U.S. Institute of Peace, National Student Leadership Conference and Semester at Sea are not eligible to participate. Previous honorable mention recipients are eligible to enter.
Advantages of DFS are increased availability and efficiency. Generally some parameters like scalability, reliability, transparency, fault tolerance and security are considered while making DFS. Some open challenges are still there in DFS like fault tolerance in various conditions, optimized architecture of DFS, Synchronization, consistency and replications.
This year, we throw the spotlight on a clear and present challenge: The growing number of refugees and internally displaced persons around the world. The contest challenges students to imagine themselves as a member of the Foreign Service posted to one
of four countries that are directly affected by this crisis, and to propose solutions. Click here for more details on the topic.
Goal of this paper is study evolution of DFS from the history, current state of the art design & implementation of the DFS and proposed new architecture which can improve throughput of data transfer for file transfer.
It consists of four parts: An overview of the background knowledge about distributed file system, issues in distributed file system and comparison of various distributed file system like Network File System (NFS), Andrew File System (AFS), Google File System (GFS), XtreemFS and Hadoop Distributed File System (HDFS).
My name is Anahera Taitoko, I was born on the 25th of June 1994, I was raised amongst a middle class family that gave priority to the education of 5 Children. I am the youngest member in my family. I am currently living a relatively affluent
life thanks to my mother who has a very considerate and well-paid job. She has been a teacher for over 20 years and I’m so sure she’s bored with it. Well i would be anyway. My siblings and I learned sincerity and modesty from our mother who is so
strict and crazy yet encourages us in every way possible to aim for the highest achievement level. She has always emphasized that I became active and respectful, also always states I was a curious and malicious child back when we were living in Christchurch. I think it is true, even now, I cultivate an eagerly
desire of knowledge and curiosity. I am currently a student at Nga Taiatea
Wharekura. I am passionate to a language that i am very proud of, and can speak fluently. Maori has been my major language and i can verbalize it in a mastery level.
At last, we propose our architecture which can improve throughput of data transfer for file transfer and its related work with future work.
Keywords - Distributed File System, NFS, AFS, GFS, XtreemFS, HDFS.
A distributed file system is a client/server-based application that allows clients to access and process data stored on the server.
In other words I can say Distributed file system consists of software residing on network servers and clients that transparently links shared folders located on different file servers into a single namespace for improved load sharing and data availability.
My sister, who is now a Process Worker, used to go overseas a lot for her work. She got the inestimable chance to travel throughout the world and meet other cultures. I promised myself to follow her path, and since then, I nourish the dream to dive under foreign skies. As ambitions alone are seldom enough, I dedicated a lot of my time to concretize my achievement level
. I studied hard to get good marks, and continue to do so in the process
. I describe myself as a person who sees the class half full rather than half empty, engaged in the path of personal development.
When the client device retrieves a file from the server, the file appears as a normal file on the client machine, and the user is able to work with the file in the same ways as if it were stored locally on the workstation.
Katie developed a volunteer project called Kids Tales, a writing workshop for children. She and 11 other students lived with host families in Colombia and taught creative writing at a local school. At the end of the program
, Kids Tales published a book of short stories written by Katie’s students called The
Land of Enthusiasm.
When the user finishes working with the file, it is returned over the network to the server, which stores the now-altered file for retrieval at a later time. As shown in figure 1.
Distributed file system organizes resources in a tree structure, starting with a root located on a base server.
I hope to gain valuable experience that would make me into a better person who can be of service to my countryserve my country well and also to understand the American culture better.
From the root, you can define links to shared folders distributed throughout local or wide area networks, without regard to their physical location. Instead of seeing a physical network of dozens of file servers, each with a separate directory structure, users now see a few logical directories that include all of the important file servers
and shared folders.
Students whose parents are not in the Foreign Service are eligible to participate if they are in grades nine through twelve in any of the fifty states, the District of Columbia, the U.S. territories, or if they are U.S. citizens attending high school overseas. Students may be attending a public, private, or parochial school. Entries from home-schooled students are also accepted. Previous first-place winners and immediate relatives of directors or staff of the AFSA and Semester at Sea are not eligible to participate. Previous honorable mention recipients are eligible to enter. $2,500 to the writer of the winning essay, in addition to an all-expense paid trip to the nation’s capital from anywhere in the U.S. for the winner and his or her parents, and an all-expense paid educational voyage courtesy of Semester at Sea.
Each shared folder appears in the most intuitive place in the directory, no matter where the folder actually resides. Distributed file system provides a consistent naming convention and mapping for collections of servers, shared folders, and files.
Sam and his group set out to investigate and document the dying culture of the Udmurt Republic in Russia. They lived with host families and experienced local crafts, cooking, music, and history. At the end of their program
, they shared the stories they collected with the world by creating a website.
Distributed file systems can be advantageous because they make it easier to distribute documents to multiple clients and they provide a centralized storage system so that client machines are not using their resources to store files.
My name is Swee Lyn. I am known for my bubbly and sociability persona. I am a sport house captain and the vice-president of the English Society. I have participated in various activities such as debate,filming competitions and
leadership scout camps. Furthermore, I have contributed to the needy in spastic centres and have done
. Helping others is important to me because it keeps me happy.
The structure of the paper includes designing issues of distributed file system in section II. In section III, it includes overview of different Distributed File System. In section IV, it includes comparison between those file system on the basis of various factors that discussed in section II. At last in section V, it includes related work with findings & improvement in current work and future work.
When problems arise, the AFS Learning Objectives identify those skills that the participant needs work on to resolve the problem or improve the situation.
Finally, Conclusion of this paper was outlined.
II. Designing Issues in Distributed File System
While designing DFS, we need to keep in mind various issues regarding it such as scalability, reliability, flexibility, transparency, fault tolerance, security, architecture, process type, naming, synchronization, consistency & replication.
The 2017 National High School Essay Contest has begun! This is the 19th year of this prestigious contest, which encourages high school students to think about important international issues and learn about one of America's best kept
secrets: The United States Foreign Service
In this section we will discuss about all these issues but due to lake of time not in too much deep.
Now a day, more and more users or client are getting connected with system. Hence, we require an efficient system to handle those users or client.
Then, If I had to describe the individual I am today, those are an outline traits of my personality.. I’m an honest person, energetic and passionate and is the man at every sport known to mankind. The simple and beautiful
things of life are what I enjoy the most. For instance, I’m fond of reading books who are authorised by Paulina Simmons and Stephanie Meyer. I love the...
For that at any point of time we may require to scale up our system. If we follow the centralize architecture at the time of designing of DFS then it would require more administration to scale up the DFS but if we use decentralize architecture then it can be easily managed by administrator.
We can add more CPUs to our DFS with less administration overhead.
Reliability means that data should be available when and where it required on time without alteration or modification and without any errors. If we use replication to provide reliability, all the replicas must be consistent with its content.
AFS Project: Change is a student essay contest providing scholarships for AFS Global Prep programs.
To achieve flexibility in the DFS, we must decide that whether we want to manage activities like memory management, process management and resource management or not. If we want to manage all these things as it requires then we can use micro kernel approach. Otherwise, we can use monolithic kernel approach in which kernel does all things by its own.
While designing a DFS, we should provide some degree of transparency in the system. Through transparency system can achieve increased availability, effective use of resource and simplifies file system model.
Saighuhan (“Sai”) and his group brought his volunteer project idea to life through a program focused on hunger, nutrition, and education in India. Working together with local teachers and students, they created a curriculum to teach nutrition in schools
. Along the way, they learned the basics of the Hindi language and explored some of the ancient and modern splendors of New Delhi.
There are various ways to achieve it. Some of them are as discussed below.
A. Location: Client or user should unaware about actual file location and even they should not get file location from its name.
B. Access: Client or user should feel that the files which are distributed that is accessed locally.
C. Replication: Client or user should not aware about file replication which is used to provide availability at the time when main requested file is unavailable.
Erin will put her vision into action this summer on a customized volunteer program in Guatemala. She and her peers will effect change by empowering school children to make educated decisions about their environmental impact
D. Migration: Client or user should not aware about file migration which is used to improve system performance at the time when file is accessed.
E. Concurrency: All clients or users have the same view of the file system. if one process is modifying a file, any other processes on the same system or remote systems that are accessing the files will see the modifications in a coherent manner.
5. Fault Tolerance
Because of hardware or software failure in distributed-file systems, these systems have to provide a fault-tolerant capability so as to tolerate faults and to try to recover from these faults. There are techniques like Replication and Redundant Arrays of Inexpensive Disk (RAID) available to provide redundancy for fault tolerance. Traditionally, distributed-file systems have relied on redundancy or high availability. In general, file systems replicate at the server-level, directory-level, or file-level to deal with processor, disk, or network failures. Redundancy allows these systems to operate easily and continuously despite partial failure at the cost of maintaining replicas in the file system.
AFSA welcomes the continuing support of our fantastic contest partners: The United States Institute of Peace, Semester at Sea, and the National Student Leadership Conference. They make possible the fantastic prizes for our winner
and runner-up. The winner receives $2,500, a trip to Washington, D.C. to meet the Secretary of State, and full tuition for a Semester at Sea educational voyage
. The runner-up receives $1,250 and full tuition for the NSLC's International Diplomacy summer program.
To achieve a security in any system, we should focus on mainly three aspects i.e. Confidentiality, Integrity and Availability. Confidentiality can protect our system from unauthorized access, Integrity can identify and protect our data against corruption, and Availability avoids situations like failure of system.
AFSA collects your information for this contest and for AFSA partners. You may be signed up to receive updates or information from AFSA and our partners. You will receive confirmation from AFSA that your submission has been received and a notification if you are the winner or an honorable mention in May. You may also receive a message from our sponsor regarding their program offerings.
To achieve Confidentiality we can use authentication techniques, using message digest we can achieve integrity.
There are mainly five types of architecture used to design a DFS, which are mainly describes goal of file system. If we uses client - server type of architecture we can achieve high performance with low latency. Same as every type describes its own advantage. So according to our requirement we can use any one of them.
Client - Server Architecture: As we discussed client - server architecture is used when more performance required.
A counseling session and a written Plan for Success is the next step when discussions, verbal commitments and possibly liaison facilitated family meetings to attempt resolution have been unsuccessful. The need for a written plan occurs when major problems threaten the placement, might lead to an early return, or would give the participant a greater chance for success in a second placement.
It provides standardized view of local file system.
Parallel Architecture: Across multiple storage devices on multiple storage servers, data blocks are striped in parallel so that it can provide concurrent read and write operations more efficiently.
Past AFS Project: Change winners have done amazing work across the globe. Their ideas sparked volunteer projects that left a lasting impact on the communities they touched.
Centralized Architecture: Only one central master server with multiple chunk servers and divided into multiple chunks.
Decentralized Architecture: There are more than one master server which maintains meta data and there are also multiple chunk servers.
In distributed file service, file servers processes can be stateless or stateful. Stateless file servers do not store any session state and every client request is treated independently. While Stateful servers, do store session state and keep track of which clients have opened which files, current read and write pointers for files, which files have been locked by which clients, etc. The main advantage of stateless servers is that they can easily recover from failure. Because there is no state that must be restored, a failed server can simply restart after a crash and immediately provide services to clients as though nothing happened. Furthermore, if clients crash the server is not stuck with abandoned opened or locked files. Another benefit is that the server implementation remains simple because it does not have to implement the state accounting associated with opening, closing, and locking of files. The main advantage of stateful servers, on the other hand, is that they can provide better performance for clients. Because clients do not have to provide full file information every time they perform an operation, the size of messages to and from the server can be significantly decreased. Likewise the server can make use of knowledge of access patterns to perform read-ahead and do other optimizations. Stateful servers can also offer clients extra services such as file locking, and remember read and write positions.
While designing a DFS, we should consider whether all machines and processes should have the exact same view of the directory hierarchy or not. We should provide Location transparency with location independence and access transparency. Location transparency means path name of a file gives no hint to where the file is located. We under stood it with example i.e. we may refer to a file as //server1/dir/file. The server can move anywhere without the client caring, so we have location transparency. However, if the file moves to server2 things will not work. If we have location independence, the files can be moved without their names changing. Hence, if machine or server names are embedded into path names we do not achieve location independence. It is desirable to have access transparency, so that applications and users can access remote files just as they access local files. To facilitate this, the remote file system name space should be syntactically consistent with the local name space. Solution is to use a file system mounting mechanism to overlay portions of another file system over a node in a local directory structure. Mounting is used in the local environment to construct a uniform name space from separate file systems (which reside on different disks or partitions) as well as incorporating special-purpose file systems into the name space (eg. /proc on many UNIX systems allows file system access to processes). A remote file system can be mounted at a particular point in the local directory tree.
More than two users share same file at that time it is necessary to maintain semantics of reading and writing of file to avoid consistency problems. There are various ways to provide synchronization of files. One can use file locking system, but administration of the locking system can be handled by either client or server. We can use hybrid approach also. Another alternative is to use atomic transactions. To access a file or a group of files, a process first executes a begin transaction primitive to signal that all future operations will be executed indivisibly.
Subscribe here to receive updates about Erin's project and future contest details.
When the work is completed, an end transaction primitive is executed. If two or more transactions start at the same time, the system ensures that the end result is as if they were run in some sequential order. All changes have an all or nothing property 11. Consistency & Replication
In DFS, to maintain a consistency we use caching either on server side or client side. We perform caching to improve system performance. There are four places in a distributed system where our data can be held: On the server's disk, Cache in the server's memory, In the client's memory, On the client's disk The first two places are not an issue since any interface to the server can check the centralized cache.
Check out the winners from prior years and get inspired to come up with your own world-changing idea!
It is in the last two places that problems arise and we have to consider the issue of cache consistency. Several approaches may be taken: Write-through What if another client reads its own cached copy? All accesses would require checking with the server first (adds network congestion) or require the server to maintain state on who has what files cached. Delayed writes Data can be buffered locally (where consistency suffers) but files can be updated periodically. A single bulk write is far more efficient than lots of little writes every time any file contents are modified. Write on close, it means that the file system uses session semantics. Centralized control Server keeps track of who has what open in which mode. We would have to support a stateful system and deal with signaling traffic.
III. Overview of Various Distributed File Systems
Network File System (NFS)
The Network File System developed by Sun Microsystems is the most widely used distributed file systems in the UNIX world. Upon releasing the first versions of NFS in 1985, SUN made public the NFS protocol specification [SUN94] which allowed the implementation of NFS servers and clients by other vendors. NFS servers are, by definition, stateless, i.e., they do not store information about the state of client accesses to its files. Should a server crash, no such information is lost and the recovery can be immediate. On the other hand, stateless servers cannot control the concurrent accesses to its files and therefore NFS does not assure the consistency of its file system. Different clients can have different and conflicting copies of the same file or directory in their local cache. When a file is updated by one client, these modifications may not be noticed by other clients during a period of up to 6 seconds. When a file is created or deleted, this fact can take up to 60 seconds to be perceived by other clients. If one needs a coherent sharing of information throughout the distributed system, some other mechanism like message passing must be used. Also because of its stateless servers, NFS is neither capable of managing locks nor atomic transactions. The name space on each client can be different but, from the user point of view, it is location transparent. It is the job of the system administrator to determine how each client will view the directory structure. This can be done by editing the /etc/fstab file which statically binds mount points to server directories and by editing the automounter configuration files which allows dynamic bindings and some degree of replication on read-only directories. Below figure 2 shows architecture of Network File System.
E:\MTech\18 10 2012\Presentation1\NFS.JPG
A Network File System (NFS) allows remote hosts to mount file systems over a network and interact with those file systems as though they are mounted locally. This enables system administrators to consolidate resources onto centralized servers on the network. Currently, there are three versions of NFS. NFS version 2 (NFSv2) is older and is widely supported. NFS version 3 (NFSv3) has more features, including variable size file handling and better error reporting, but is not fully compatible with NFSv2 clients. NFS version 4 (NFSv4) works through firewalls and on the Internet, no longer requires portmapper, supports ACLs, and utilizes stateful operations. All versions of NFS can use Transmission Control Protocol (TCP) running over an IP network, with NFSv4 requiring it. NFSv2 and NFSv3 can use the User Datagram Protocol (UDP) running over an IP network to provide a stateless network connection between the client and server. When using NFSv2 or NFSv3 with UDP, the stateless UDP connection under normal conditions minimizes network traffic, as the NFS server sends the client a cookie after the client is authorized to access the shared volume. This cookie is a random value stored on the server's side and is passed along with RPC requests from the client. The NFS server can be restarted without affecting the clients and the cookie remains intact. However, because UDP is stateless, if the server goes down unexpectedly, UDP clients continue to saturate the network with requests for the server. For this reason, TCP is the preferred protocol when connecting to an NFS server. The only time NFS performs authentication is when a client system attempts to mount the shared NFS resource. To limit access to the NFS service, TCP wrappers are used. TCP wrappers read the /etc/hosts.allow and /etc/hosts.deny files to determine if a particular client or network is permitted or denied access to the NFS service.
Andrew File System (AFS)
Andrew is a distributed workstation environment that has been under development at Carnegie Mellon University since 1983. The primary data-sharing mechanism is a distributed file system spanning all the workstations. Using a set of trusted servers, collectively called Vice, the Andrew file system presents a homogeneous, location transparent file name space to workstations. Clients and servers both run the 4.3 BSD version of Unix. Scalability is the dominant design consideration in Andrew. Many design decisions in Andrew are influenced by its anticipated final size of 5000 to 10000 nodes, Careful design is necessary to provide good performance at large scale and to facilitate system administration. Figure 3 shows Andrew File System Architecture.
E:\MTech\18 10 2012\data\Presentation1\AFS.JPG
The file name space on an Andrew workstation is partitioned into a shared and a local name space. The shared name space is location transparent and is identical on all workstations. The local name space is unique to each workstation and is relatively small. It only contains temporary files or files needed for workstation initialization. Users see a consistent image of their data when they move from one workstation to another, since their files are in the shared name space. Andrew uses 96-bit file identifiers to uniquely identify files. These identifiers are not visible to application programs. Files in the shared name space are cached on demand on the local disks of workstations. A cache manager, called Venus, runs on each workstation. When a file is opened, Venus checks the cache for the presence of a valid copy. If such a copy exists, the open request is treated as a local file open. Otherwise an up-to-date copy is fetched from the custodian. Read and write operations on an open file are directed to the cached copy. No network traffic is generated by such requests. If a cached file is modified, it is copied back to the custodian when the file closed. Cache consistency is maintained by a mechanism called callback. When a file is cached from a server, the latter makes a note of this fact and promises to inform the client if the file is updated by someone else. Callbacks may be broken at will by the client or the server. The use of callback, rather than checking with the custodian on each open, substantially reduces client-server interactions. The latter mechanism was used in the first version of Andrew. Andrew caches large chunks of files, to reduce client-server interactions and to exploit bulk data transfer protocols. Earlier versions of Andrew cached entire files. Concurrency control is provided in Andrew by emulation of the Unix flock system call. Lock and unlock operations on a file are performed directly on its custodian. If a client does not release a lock within 30 minutes, it is timed out by the server.
Google File System (GFS)
GFS shares many of the same goals as previous distributed file systems such as performance, scalability, reliability, and availability. The file system consists of hundreds or even thousands of storage machines built from inexpensive commodity parts and are accessed by a comparable number of client machines. A GFS cluster consists of a single master and multiple chunkservers and is accessed by multiple clients, as shown in Figure 4. Each of these is typically a commodity Linux machine running a user-level server process. It is easy to run both a chunkserver and a client on the same machine, as long as machine resources permit and the lower reliability caused by running possibly flaky application code is acceptable. Files are divided into fixed-size chunks. Each chunk is identified by an immutable and globally unique 64 bit chunk handle assigned by the master at the time of chunk creation. Chunkservers store chunks on local disks as Linux files and read or write chunkdata specified by a chunk handle and byte range. For reliability, each chunk is replicated on multiple chunkservers. By default, It store three replicas, though users can designate different replication levels for different regions of the file namespace. The master maintains all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. The master periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state. GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application.
E:\MTech\18 10 2012\data\Presentation1\GFS.jpg
Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers. Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues. Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.
XtreemFS is an object-based file system that has been specifically designed for Grid environments as a part of the XtreemOS operating system. As an object based design, it is composed of clients, OSDs and metadata servers that are also responsible for keeping replica locations (the Metadata and Replica Catalog, MRC, see Fig 5). In addition, a directory service acts as a registry to locate servers and volumes in the system. XtreemFS manages file system volumes that represent mountable file systems. A volume's files and directories share certain default policies for replication and access. To ensure availability, volumes can be replicated to multiple MRCs. In order to be able to accommodate larger file systems on commodity hardware, volumes can also be partitioned across multiple MRCs. Given proper access rights, clients can mount XtreemFS volumes anywhere in the Grid. Volumes are registered in the directory service, where a client can look up one or more MRCs hosting the volume's metadata. XtreemFS integrates with common VO authentication methods to check a user's credentials. The user's operations are subject to access policies. These access policies can implement normal file system policies like Unix user/group rights or full POSIX ACLs. In a federated environment, policies also restrict the range of OSDs to which an MRC will replicate files, or the set of MRCs from which an OSD will accept replicas. Apart from these policy restrictions, our design allows files to be replicated to any OSD. In addition, a file's replica can be striped across a group of OSDs, which allows us to leverage the aggregate bandwidth of these storage devices by accessing them in parallel. As a fully integrated part of XtreemFS, OSDs are aware of the existing replicas of a particular file. This knowledge allows them to coordinate their operations with OSDs that are hosting other replicas of the file.
Contact Perri Green, AFSA’s Awards Coordinator, at email@example.com with questions.
Through this coordination XtreemFS can guarantee POSIX semantics even in the presence of concurrent accesses to the replicas. In order to coordinate operations on file data, OSDs negotiate leases that allow their holder to define the latest version of the particular data without further communication efforts. OSDs also keep version numbers in order to be able to identify which OSDs have the latest version of the file data.
E:\MTech\18 10 2012\Presentation1\XtreemFS.jpg
Policies dictate how many replicas OSD forwards changes to before acknowledging the write operation of the client. The user can for example choose a strict policy, which always keeps at least three replicas up-to-date at different sites, or select a looser policy which updates other replicas lazily or on demand. The awareness of OSDs about replicas also allows us to logically create new replicas very quickly and reliably. From an external perspective a new replica is created as soon as the OSD is aware of being the home for the data. Its versions of replica file objects are marked obsolete. Subsequently, the replica is physically created, either on demand by a client's accesses or automatically when a policy instructs the replica to do so. Replicas are therefore always created reliably as a decentralized interaction between the OSDs. There is no need for extra services that initiate, control or monitor the transfer of the data. In order to be able to create replicas in the presence of failures of some of the OSDs, and to be able to remove unreachable replicas, we have designed a replica set coordination protocol that integrates with the lease coordination protocol. The replica set protocols ensures that even in the worst failure case, the replicated data can never become inconsistent, while still allowing replicas to be added or removed in many failure scenarios. This design allows us to make new replicas available very quickly, even if file data has not been completely copied by the system. When an OSD's client only accesses a certain part of the replica, the replica only needs to keep that particular slice. The remaining data is automatically marked as being obsolete and falls behind other replicas. Because it involves a distributed consensus process that is inherently expensive, the replica lease coordination process does not scale well. When too many OSDs per file are involved, the necessary communication increases excessively. Fortunately, a moderate number of replicas are sufficient for most purposes. If a large number of replicas are required, XtreemFS can switch the file to a read-only mode and allow an unlimited number of read-only file replicas, which fits many common Grid data management scenarios.
Hadoop Distributed File System (HDFS)
The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets. HDFS has master/slave architecture. An HDFS cluster consists of a single NameNode, a master server that manages the file system namespace and regulates access to files by clients. In addition, there are a number of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. HDFS exposes a file system namespace and allows user data to be stored in files. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes are responsible for serving read and write requests from the file system’s clients. The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode. The NameNode and DataNode are pieces of software designed to run on commodity machines. These machines typically run a GNU/Linux operating system (OS). HDFS is built using the Java language; any machine that supports Java can run the NameNode or the DataNode software. Usage of the highly portable Java language means that HDFS can be deployed on a wide range of machines. A typical deployment has a dedicated machine that runs only the NameNode software. Each of the other machines in the cluster runs one instance of the DataNode software. The architecture does not preclude running multiple DataNodes on the same machine but in a real deployment that is rarely the case. The existence of a single NameNode in a cluster greatly simplifies the architecture of the system. The NameNode is the arbitrator and repository for all HDFS metadata. The system is designed in such a way that user data never flows through the NameNode. HDFS supports a traditional hierarchical file organization. A user or an application can create directories and store files inside these directories. The file system namespace hierarchy is similar to most other existing file systems; one can create and remove files, move a file from one directory to another, or rename a file. HDFS does not yet implement user quotas. HDFS does not support hard links or soft links. However, the HDFS architecture does not preclude implementing these features. Figure 6 shows Architecture of HDFS.
E:\MTech\18 10 2012\Presentation1\HADOOP.jpg
The NameNode maintains the file system namespace. Any change to the file system namespace or its properties is recorded by the NameNode. An application can specify the number of replicas of a file that should be maintained by HDFS. The number of copies of a file is called the replication factor of that file. This information is stored by the NameNode. HDFS is designed to reliably store very large files across machines in a large cluster. It stores each file as a sequence of blocks; all blocks in a file except the last block are the same size. The blocks of a file are replicated for fault tolerance. The block size and replication factor are configurable per file. An application can specify the number of replicas of a file. The replication factor can be specified at file creation time and can be changed later. Files in HDFS are write-once and have strictly one writer at any time. The NameNode makes all decisions regarding replication of blocks. It periodically receives a Heartbeat and a Blockreport from each of the DataNodes in the cluster. Receipt of a Heartbeat implies that the DataNode is functioning properly. A Blockreport contains a list of all blocks on a DataNode.
IV. Comparison of Distributed File System
This paper includes study of various common design issues of DFS and overview of Network File System (NFS), Andrew File System (AFS), Google File System (GFS), XtreemFS and Hadoop Distributed File System (HDFS). There are many more DFS but due to space limitation their novelty and functionality we didn’t add. Summarization of comparison is shown in Table 1.
Network File System (NFS)
Andrew File System (AFS)
Google File System (GFS)
Hadoop Distributed File System (HDFS)
Limited, More Overhead
Authentication, Chunk Server side 32-bit check summing
POSIX Permissions, XtreemOS Authentication Provider
Authentication, Client side 32-bit check summing
Central File Server
Central Metadata Server
Central Metadata Server
Central Metadata Server
Central Metadata Server
Write-once-read many, locks on object to client
Write-once-read many, locks on object to client
Write-once-read many, locks on object to client
Write-once-read many, locks on object to client
Consistency & Replication
Client side caching, Server side replication, checksum
Maintain three copies of file,
Server side replication, checksum
Maintain two copies of file,
Server side replication, checksum
V. Related Work
I:\18 10 2012\Presentation1\Proposed System.jpg
 Lin Weiwei, Liang Chen and Liu Bo. A Hadoop-based Efficient Economic Cloud Storage System. PACCS at Wuhan, China - July 2011. IEEE Conference Publication.
 Mahesh Maurya, Chitvan Oza and Prof. Ketan Shah. A Review of Distributed File System. ICAET at Nagapattinam, India - May 2011. CiiT International Journals Conference Publication.
 MARTIN PLACEK and RAJKUMAR BUYYA. A Taxonomy of Distributed Storage Systems. The Cloud Computing and Distributed Systems (CLOUDS) Laboratory, University of Melbourne- July 2006.
 Tran Doan Thanh, Subaji Mohan, Eunmi Choi, SangBum Kim, Pilsung Kim. A Taxonomy and Survey on Distributed File Systems. NCM at Geongju, Korea - September 2008. IEEE Conference Publication.
 Song Guang hua, Chuai Jun na, Yang Bo Wei, Zheng Yao. QDFS - A Quality Aware Distributed File Storage Service Based on HDFS. IEEE-CSAE at Shanghai, China - June 2011. IEEE Conference Publication.
 Debessay Fesehaye, Rahul Malik, Klara Nahrstedt. EDFS - A Semi- Centralized Efficient Distributed File System. Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware Article No. 28 Springer - Verlag, New York, USA - 2009.
 Fabio Kon. Distribute File Systems Past, Present and Future A Distributed File System for 2006. March 1996.
 M Satyanarayanan. A Survey of Distributed File Systems. February 1989. Tech. Rep. CMU-CS-89-116, Pittsburgh, Pennsylvania.
 Design and Implementation or the Sun Network Filesystem by Russel Sandberg , David Goldberg , Steve Kleiman , Dan Walsh , Bob Lyon.