Name: CS111 Project 3A File System Interpretation Solved
SKU: 30282
Price: 35.00 USD
Availability: InStock

Description

5/5 - (1 vote)

INTRODUCTION:

We are all familiar with the characteristics of the files and directories in which we
store all of our data. As with many other persistent objects, their functionality,
generality, performance and robustness all derive from the underlying data
structures in terms of which they are implemented. In this project we will design
and implement a program to read the on-disk representation of a file system,
analyze it, and summarize its contents. In the next project, we will write a program
to analyze this summary for evidence of corruption.

This project can be broken into two major steps:
Understand the on-disk data format of the EXT2 file system. We will mount a
provided image file on your own Linux and explore it with familiar file
navigation commands and debugfs(8).
Write a program to analyze the file system in that image file and output a
summary to standard out (describing the super block, groups, free-lists,
inodes, indirect blocks, and directories).

The second part may involve much more code than we have written in previous
projects, but the first part is (by far) the more difficult. Once the underlying data
structures are understood, the actual code is likely to be fairly simple.

RELATION TO READING AND LECTURES:

This project more deeply explores the filesystem structures described in Arpaci
chapter 39.
The images we will be working with are EXT2 file systems, as described in
sections 40.2-40.5.

PROJECT OBJECTIVES:

Primary: reinforce the basic file system concepts of directory objects, file
objects, and free space.
Primary: reinforce the implemenation descriptions provided in the text and
lectures.
Primary: gain experience researching, examining, interpreting and processing
information in complex binary data structures.
Primary: gain experience with examining interpreted and raw hex dumps of
complex data structures as a means of developing an understanding of those
data structures.

Secondary: gain practical experience with typical on-disk file system data
formats.
●
●
○
○
○
●
●
●
●
●

DELIVERABLES:

A single tarball (.tar.gz) containing:
(at least) one C/C++ source module that compiles cleanly with no errors or
warnings).
A Makefile to build and run the deliverable program. The higher level targets
should be:
default … compile your program to produce an executable named lab3a
dist … create the deliverable tarball
clean … delete all programs and output generated by the Makefile.
a README text file containing descriptions of each of the included files and
any other information about your submission that you would like to bring to
our attention (e.g. research, limitations, features, testing methodology, use of
slip days).

PROJECT DESCRIPTION:

Historically, file systems were almost always been implemented as part of the
operating system, running in kernel mode. Kernel code is expensive to develop,
difficult to test, and prone to catastrophic failures. Within the past 15 years or so,
new developments have made it possible to implement file systems in user mode,
improving maintainability … and in some cases delivering even better performance
than could be achieved with kernel code. All of this project will be done as usermode software.

To ensure data privacy and integrity, file system disks are generally protected from
access by ordinary applications. Linux supports the creation, mounting, checking,
and debugging of file systems stored in ordinary files. In this project, we will
provide EXT2 file system images in ordinary files.

Because they are in ordinary
files (rather than protected disks) you can access/operate on those file system
images with ordinary user mode code.

PREPARATION

To perform this assignment, you may need to study a few things:
debugfs(8). a tool for exploring on-disk file system structures.
pread(2), an alternative to read(2) for random-access file processing.
a comprehensive overview of the EXT2 file system format.
a slightly simplified version of the Linux header file that defines the format of
the EXT2 file system. Please use this header file and include it in your
submission. Do not assume that the standard header file will be available on
the test system.

PART 1: Exploring an EXT2 Image

In order to mount and explore a file system (even one stored in an ordinary file)
you will need the ability to run privileged commands (e.g. mount(8)). Since you will
not have sudo(8) access on departmental servers, you will have to do this
exploration on your own personal Linux system.
Download this (~2MB) file system image, and mount it (read only) onto your own
Linux, with the following commands:
mkdir fs

sudo mount -o ro,loop ext2_image fs
The loop option means that the file system image is stored in a file rather than on
a device. The ro option means read only (to prevent you from accidentally
changing the image).
Note: this file is 200 Mbytes in size.
Now, you can navigate the file system, just like an ordinary directory, with
commands like ls(1), cat(1), and cd(1). After you are done with it, you can unmount
with the following command:
sudo umount fs

Before you start writing your C/C++ program to interpret the diskimage file, you
can explore it further using debugfs(8) (on your own Linux system). You will, in the
process of writing your code, surely encounter many questions about how to
interpret the values in various fields.

Reading the specifications is seldom enough
to enable us to fully understand complex data structures. The supplied images and
exploration tools can be used to examine real instances of super blocks, group
summaries, I-nodes, directories, etc. Learning how to complement research with
experimentation to understand a complex system is a skill that you are expected to
develop and demonstrate in this project.

Some particularly helpful debugfs(8) commands are: stats, stat, bd, testi,
and testb. While debugfs can interpret data structures for you, you may find that a
simple hex dump of the associated block provides you with more detailed
information.

Warnings

If you mount the trivial.img image read/write, even read commands (like ls(1)) will
cause changes to I-node access times. If you want to be able to compare your
analysis with the golden trivial.csv output we have provided, you must work from
an un-modified version of trivial.img.

To ensure you are correctly interpreting the file system image, we have included
●
●
●
●
●
●
●
●
●
●
many unusual things, which might not be properly handled by a naive
implementation:
sparse files
very large files

allocated data blocks full of zeroes
unallocated blocks containing valid data
files with data beyond their length
files with long names
files with syntactically strange or non-ASCII names
directories that span multiple blocks, go beyond the direct blocks, and have
obsolete entries for deleted files

All of these are completely legal and do not represent any sort of corruption.

PART 2: Summarizing an EXT2 Image

In this step, you will write a C/C++ program called lab3a that:
Reads a file system image, whose name is specified as a command line
argument. For example, we may run your program with the above file system
image using the a command like:
./lab3a EXT2_test.img

Analyzes the provided file system image and produces (to standard out) CSV
summaries of what it finds. The contents of these CSV lines described below.
Your program must output these files with exactly the same formats as shown
below. We will use sort(1) and diff(1) to compare your csv output with ours, so
a different format will make your program fail the test.

Please note that, although you cannot mount the provided image file and
run debugfs on departmental servers, your lab3a program should, like previous
assignments, be able to run on departmental servers.
There are six types of output lines that your program should produce, each
summarizing a different part of the file system. Remember, you can always check
your program’s output against debugfs’s output. All the information required for
the summary can be manually found and checked by using debugfs. We have also
included (for testing purposes) a much smaller image as well as a
correct summary.

You are free to produce additional commentary to stderr, but only file system
summary information should be logged to stdout.
superblock summary
A single new-line terminated line, comprised of eight comma-separated fields
f.
g.
h.
i.
j.
k.
l.
m.
f.
g.
h.
i.
j.
k.
l.
m.
n.
f.
g.
f.

(with no white-space), summarizing the key file system parameters:
SUPERBLOCK
total number of blocks (decimal)
total number of i-nodes (decimal)
block size (in bytes, decimal)
i-node size (in bytes, decimal)
blocks per group (decimal)
i-nodes per group (decimal)

first non-reserved i-node (decimal)
group summary
Scan each of the groups in the file system. For each group, produce a new-line
terminated line for each group, each comprised of nine comma-separated fields
(with no white space), summarizing its contents.

GROUP

group number (decimal, starting from zero)
total number of blocks in this group (decimal)
total number of i-nodes in this group (decimal)
number of free blocks (decimal)
number of free i-nodes (decimal)

block number of free block bitmap for this group (decimal)
block number of free i-node bitmap for this group (decimal)
block number of first block of i-nodes in this group (decimal)
Note that most Berkeley-derived file systms (like EXT2) support blocks, and
fragments, which may have different sizes. The block is the preferred unit of
allocation. But in some cases, fragments may be used (to reduce internal
fragmentation loss). Block addresses and the free block list entries are based on
the fragment size, rather than the block size. But, in the images we give you, the
block and fragment sizes will be the same.

It is also the case that most Berkeley-derived file systems put a copy of the
superblock and group summary at the start of each group. But, in the images we
give you, there will only be one group.
free block entries
Scan the free block bitmap for each group. For each free block, produce a newline terminated line, with two comma-separated fields (with no white space).

BFREE
number of the free block (decimal)
Take care to verify that you:
understand whether 1 means allocated or free.
g.
h.
f.
g.
f.
g.
h.
f.
g.
h.
i.
j.
k.
l.
m.
n.
fp.
ff.
fg.
f.
g.

have correctly understood the block number to which the first bit
corresponds.
do not interpret more bits than there are blocks in the group.
free I-node entries
Scan the free I-node bitmap for each group. For each free I-node, produce a newline terminated line, with two comma-separated fields (with no white space).
IFREE
number of the free I-node (decimal)
Take care to verify that you:
understand whether 1 means allocated or free.
have correctly understood the I-node number to which the first bit
corresponds.

do not interpret more bits than there are I-nodes in the group.
I-node summary
Scan the I-nodes for each group. For each valid (non-zero mode and non-zero link
count) I-node, produce a new-line terminated line, with 27 comma-separated
fields (with no white space). The first twelve fields are i-node attributes:
INODE
inode number (decimal)
file type (‘f’ for file, ‘d’ for directory, ‘s’ for symbolic link, ‘?” for anything else)
mode (low order 12-bits, octal … suggested format “0%o”)
owner (decimal)
group (decimal)
link count (decimal)

time of last I-node change (mm/dd/yy hh:mm:ss, GMT)
modification time (mm/dd/yy hh:mm:ss, GMT)
time of last access (mm/dd/yy hh:mm:ss, GMT)
file size (decimal)
number of blocks (decimal)
The next fifteen fields are block addresses (decimal, 12 direct, one indirect, one
double indirect, one tripple indirect).
directory entries
For each directory I-node, scan every data block. For each valid (non-zero I-node
number) directory entry, produce a new-line terminated line, with seven commaseparated fields (no white space).

DIRENT
parent inode number (decimal) … the I-node number of the directory that
contains this entry
h.
i.
j.
k.
l.
f.
g.
h.
i.
j.
k.
●
●
●
●
●
●

logical byte offset (decimal) of this entry within the directory
inode number of the referenced file (decimal)
entry length (decimal)
name length (decimal)
name (string, surrounded by single-quotes). Don’t worry about escaping, we
promise there will be no single-quotes or commas in any of the file names.
indirect block references

The I-node summary contains a list of all 12 blocks, and the primary single,
double, and tripple indirect blocks. We also need to know about the blocks that
are pointed to by those indirect blocks. For each file or directory I-node, scan the
single indirect blocks and (recursively) the double and tripple indirect blocks. For
each non-zero block pointer you find, produce a new-line terminated line with six
comma-separated fields (no white space).
INDIRECT

I-node number of the owning file (decimal)
(decimal) level of indirection for the block being scanned … 1 single indirect, 2
double indirect, 3 tripple
logical block offset (decimal) represented by the referenced block. If the
referenced block is a data block, this is the logical block offset of that block
within the file. If the referenced block is a single- or double-indirect block, this
is the same as the logical offset of the first data block to which it refers.

block number of the (1,2,3) indirect block being scanned (decimal) … not the
highest level block (in the recursive scan), but the lower level block that
contains the block reference reported by this entry.
block number of the referenced block (decimal)
Logical block is a commonly used term. It ignores physical file structure (where
data is actulally stored, indirect blocks, sparseness, etc) and views the data in the
file as a (logical) stream of bytes. If the block size was 1K (1024 bytes):
bytes 0-1023 would be in logical block 0

bytes 1024-2047 would be in logical block 1
bytes 2048-3071 would be in logical block 2
…
You can confirm your understanding of logical block numbers by looking at
the INDIRECT entries in the sample output.
If an I-node contains a tripple indirect block:
the tripple indirect block number would be included in the INODE summary.
INDIRECT entries (with level 3) would be produced for each double indirect
block pointed to by that tripple indirect block.

INDIRECT entries (with level 2) would be produced for each indirect block
●
●
●
●
●
pointed to by one of those double indirect blocks.
INDIRECT entries (with level 1) would be produced for each data block pointed
to by one of those indirect blocks.

Sample Output

We have provided a very simple test file system image (that you can download and
test with) along with correct summary output. Your program should be able to
generate (modulo line ordering) the same output. The grading program will run
your program on a variety of other file system images, and check whether or not
your output is identical to the golden output. Any differences (even white space or
a case error) will result in a loss of all points for that test.

SUMMARY OF EXIT CODES:

0 … analysis successful
1 … bad arguments
2 … corruption detected or other processing errors
SUBMISSION:
Your README file must include lines of the form:
NAME: your name(s)
EMAIL: your email(s)
ID: your student ID(s)

Your name, student ID, and email address should also appear as comments at the
top of your Makefile and each source file. If this is a team submission, the names,
e-mail addresses, and student IDs should be comma-separated.
Your tarball should have a name of the form lab3a-studentID.tar.gz.
You can sanity check your submission with this test script.
Projects that do not pass the test script will not be accepted!
We will test it on a departmental Linux server. You would be well advised to test
your submission on that platform before submitting it.

GRADING:

Points for this project will be awarded:
value feature
Packaging and build (10% total)
3% un-tars expected contents
3% clean build with default action (no
warnings)
2% correct clean and dist targets
2% reasonableness of README contents
Results (85% total)

3% un-tars expected contents
3% clean build with default action (no
warnings)
2% correct clean and dist targets
2% reasonableness of README contents
Results (85% total)

10% superblock summary
10% group summaries
10% free block entries
10% free I-node entries
15% I-node summaries

15% directory entries
15% block references
Code Review (5%)
5% general organization and readability

CS111 Project 3A File System Interpretation Solved