Repository Organization
Introduction
This paper evaluates techniques for organizing files in a reusable
software repository. The ubiquitous hierarchical file system directory
structures are useful for organizing information. Traditionally,
software developers have used directories to organize their
software using containment properties of the file system to keep
together files belonging to a particular software project.
Unfortunately, such an organization, while convenient and familiar, does
not promote software reuse across projects. Treating hierarchical
directory structures as dependency graphs, although a bit
counter-intuitive, allows us to organize source files in a way that
keeps them independent from any particular use, and thus usable in other
project contexts.
Containment
Viewing directory structures as a containment mechanism is a common
practice among software developers. Typically, a root directory
represents the project that often contain the "main" program entry
point. The sub-directories of the project root contain subordinate
files, which contain the subroutines invoked by the main program. This
technique is probably the result of the programmer thinking about file
organization as looking something like the call-graph in a top-down view
of a program.
Unfortunatly, due to the hierarchical nature of the filesystem, such
an organization precludes other projects from using the same files
without first copying the files into the new project. As discussed in my rant on reuse, copying is not reusing, and thus
such an organization is inappropriate within a software reuse context.
Furthermore, a call-graph is a Directed Acyclic Graph (DAG)
structure rather than a simple hierarchical structure since multiple
software modules can reference the same subroutine. This is a common
organizational flaw, and is probably the result of our familiarity with
hierarchical file systems, and our tendency to use that golden hammer as
a solution to all of our organizational problems.
Dependency
If you've read the drivel above, you have probably guessed that I'm
leading up to something. Simply stated, the key to software reuse is
dependency management. To be used in a context other than its original,
a software module must be independent of the original project context.
Reusable software has a dependency graph with a Directed Acyclic Graph
(DAG) structure. That is to say that there are no circular dependencies
in reusable software.
As we discussed above, however, the file systems, where our reusable
source code resides, are hierarchical. Since a DAG is a more generalized
structure, it is not possible to use the hierarchical file system
directly. Rather, it is necessary to find a way to at least prevent the
file organization from creating dependency restrictions. To do this, we
will analyze the dependency properties of hierarchical file systems.
A hierarchical file system is essentially a name space. A file is
identified uniquely by specifying its absolute path name. Relative path
names are a convenience that limit the namespace scope to a particular
sub-tree. Any file or directory contained in another directory is
dependent on the directory that contains it. As evidence, consider what
happens if you attempt to delete the parent directory of a file. Either
a warning is issued, requiring you to delete the contents of the
directory first, or the directory and its contents are deleted. Stated
again, the content of a directory cannot exist without its directory,
and thus the content of the directory is dependent on the directory.
Note that this in contrary to the way that software developers typically
organize their files, where the sub directories of a project contain
files upon which the project depends.
Pathnames in #include
Specifications
For various reasons, many C/C++ software organizations refrain from
using directory path specifications in "#include
"
statements. Some of the restrictions (e.g. absolute paths) are
reasonable, while others are rooted in history and mythology.
The biggest problem with absolute paths in #include
specifications is that they limit the location and number of private
build trees, and are thus rightly avoided. For examle, if a developer is
to have two views of the same application, with each view having a
different version of the same included file, this would be impossible if
absolute path-names are used. Yes, if you use the Clearcase virtual file
system, this is possible, however, with most other "snapshot" style
repositories, this is a significant problem.
It is, however, quite easy for a build environment to support paths
that are relative to a single variable root. This is easily done using a
common compiler option to specify search paths for include files (e.g.
-I /home/mike/sandbox.)
Another reason that pathnames are avoided in #include
specifications have to do with the character used for directory
seperators. For example, UNIX operating systems use a forward slash '/',
Microsoft operating systems use the backward slash '\', and older Apple
operating systems use a colon ':'.
With the advent of Apple's OSX series of operating system that are UNIX
based, this issue is improving. Even so most compilers for the Apple
Macintosh can handle forward slashes as directory separators in #include
statements. However, if I recall correctly, at least one Macintosh
compiler in the past simply ignored the directory prefix in an #include
specification.
The Microsoft operating systems remain one of the largest impediments
to compatibility. However, most modern compilers (and arguably all significant modern compilers) are
able to work with either forward or backslash separators. The primary
exceptions to this rule being in some compilers that are focused on
small (e.g. 8-bit), or quite specialized processors. However, any
processor supported by the GNU GCC compiler has at least one source of
hope. Thus, while I acknowledge that there still exist circumstances in
which directory separators are still an issue, I will assert that for
common 32-bit processors and even for many 16-bit and 8-bit processors,
directory separators in path-names are no longer an issue.
Lets have a look at the what happens if we do not allow pathnames in #include
specifications. In this case, at least one of two things must be done.
- Copy all header files to a single directory.
- Copy all header files to one of a few directories.
- Add an option to the compiler command for every directory that
has a header file.
First, we could copy all header files to a single location.
- All namespace information must then be encoded into the name of
the header file.
- Historically, we may be limited to 8 character filename sizes.
Approximately (26+10+~10)^8 possible filename combinations, not all of
which are readable/viable/reasonable (imagine all numbered header file
names.) This is a real problem when we consider integration with 3rd
party code whose file names we cannot control. Not everyone is as namespace conscious as
we are.
Posix/UNIX interfaces and the Linux kernel are two major places where
pathnames are used in the source code. For example, #include
<sys/types.h>
is required for the open
system
call.
As discussed above, the hierarchical file system is a name-space, which
can be useful in maintaining a repository for software reuse.
Organizational Examples
The sections that follow present common useful organizational patterns.
Alternative Interface Implementation
This example of file organization arises from a particular type of
abstraction involving simple compile-time binding. It is common, for the
purpose of module portability across platforms, for an interface to be
defined (e.g. a set of functions and types) and placed in a header file.
The interface is then implemented for various other platforms in one or
more source (e.g. .c) files.
It is a common practice and mistake to place all of the files for a
module in a single directory. However, such an organization leads to an
undesirable dependency created by the files being located in the same
directory.
There are issues with compiling implementation (.c) files in the same
directory with the header file describing the interface. Assume that
there is a build file for each directory listing the files to be
compiled that is shared by all project clients, for example a makefile
in the directory with the sources. If more than one implementation of
the interface is located in the directory, how will the build system
choose which variant to build and link with the application. Although
it may be possible to perform magic in the build environment, such
solutions become quickly untenable and will need some amount of
customization for each instance.
A better solution is to have the interface header located in one
directory, and each alternate implementation located in its own
directory. The implementation directories may be located in a sub
directory of the interface header file, or in any other directory not
in the dependency path of the interface header file (i.e. a parent
directory of the interface header file.)
[FIXME: finish this thread]
[mention the notion of "module" with respect to a single directory and
reuse]
Creational Interface Segregation
[Make this a separate paper]
This is a rant about the importance of separating the creational
interfaces of a module from its operational interfaces.
Forward declared types in the operational interface, actual/concrete
types known by the creational interface.
mike@mnmoran.org