VaMP's Basic Requirements for Valid Files and Directories



April 19th, 2022 by Diana Coman

Given a directory1 as input, VaMP will loudly complain and promptly exit if it finds any invalid files or directories in it, meaning specifically any of the following: empty files, duplicate files, non-ordinary files, files not ending with a newline, non-hierarchical structure of directories (for instance via a symlink that creates a loop on any level), empty directories. While some might find VaMP to be therefore quite picky or even *too* picky about what it accepts as valid input, there are informed reasons (see below) for everything included in that list of refused items and this makes all the difference: VaMP is not picky but discerning and in the most helpful way, too, as its explicit requirements enforce and help maintain a clean structure and the clarity that comes with it.

Why no empty files?
Because empty files are worse than meaningless: they are, by definition, void of content, pure form if there ever was such a thing. First of all, if a file is empty, there is nothing to include anyway, so there's no loss to delete it, is there? Second, since VaMP is concerned with *content*, it follows quite directly that files lacking any content whatsoever are not a valid input for it.

Why no duplicate files?
Because there's no acceptable reason to duplicate *content* rather than reference it in a civilised and considerate manner. And there is a high cost to duplicating content, made all the higher for it being generally paid at a later time and by someone other than the person lazily copying instead of appropriately referencing. Yes, VaMP is by design anti-copying and pro-referencing and I firmly intend it to remain this way.

Why only ordinary files?
Because in the "everything is a file" model, this is the only currently available means - if a feeble and unreliable one - for marking in any way the intent of focusing on text meant to be read and comprehended by people first of all. Ideally this filter would be stronger to exclude directly binary files for instance but the reality is that there is no reliable way for the machine to decide 100% accurately on what is "binary" and what isn't: from the machine's point of view, *all* content is binary anyway and so it's always and out of necessity, the user's decision to make.

Why only files ending with a newline?
Because a file not ending with a newline is not a valid file according to the very definition of the "file" format2. Moreover and in more practical terms - VaMP always creates correctly formatted files and as a result, whenever it applies a patch, it *will* end the file with a newline, meaning that you risk ending up with a different file just because the "original" was an invalid file to start with. It's not VaMP's job to correct files nor is it desirable that such correction occurs silently anyway.

Why only strictly hierarchical directories?
Because any non-hierarchical structure in the given context is utterly broken and ultimately confusing for someone trying to read and understand the content. VaMP is firmly pushing for clarity and meaning at all levels and this starts with an unambiguous structure of the given directories, with as many levels as you require and can handle perhaps but certainly *without* any loops. If your content requires for whatever reason such horror as loops in *how the directories or files are grouped together* then I dare say you have bigger problems to sort out first before worrying about VaMP at all.

Why no empty directories?
Because empty directories are not content and so don't belong under VaMP. Note that a directory containing only other directories but no files is still considered to be empty and thus rejected by VaMP. The reason for this is that content is in fact strictly stored in files, while the whole set of directories (hence, paths) is merely a temporary mapping of content into some structure that is designed - hopefully - for aiding comprehension. As such, a path that doesn't lead to a file is worse than meaningless as it is a burden on the reader, since it adds complexity without being justified by any part of the existing content. Certainly, it can and it does happen that the content justifying that part of the structure is simply not yet available but in this case, the correct options are to either collapse the structure until the content becomes available (thus *not* burdening the reader with it at all) or otherwise providing at least a placeholder file with a brief description of the intended content (thus providing at least a hook for the reader to hang that bit of added structure on, until such time when the full content is available).

More as a continuation than an ending, here's an example run with complaints:

$ vamp diff oldver newver
STATUS: Diff-ing directories oldver and newver to file listing_oldver_newver.diff with hash length -1
STATUS: Entering Walk_Dir with dirname=t1
STATUS: Entering Walk_Dir with dirname=t1/t3
STATUS: Entering Walk_Dir with dirname=t2
STATUS: Entering Walk_Dir with dirname=t2/t4
STATUS: Entering Walk_Dir with dirname=t2/t4/t5
ERROR: empty files/dirs are not acceptable, sorry. Found to be empty:
newver/empty.file
newver/t2/t4
newver/t2
ERROR: VaMP main: Directory newver contains duplicates and/or empty directories and/or special files that are not acceptable, sorry. See the list(s) provided. Kindly clean these up and run VaMP again.

$ tree oldver
oldver
├── manifest.vamp
└── t1
    ├── file.b
    └── t3
        └── file.a

2 directories, 3 files
$ tree newver
newver
├── empty.file
├── manifest.vamp
└── t2
    └── t4
        └── t5
            └── file.a

3 directories, 3 files

  1. Note that the content of this can be anything whatsoever: as long as the content and its authors are important to you, you'll benefit from having it under VaMP. "Code" is at most a *subclass* of text, not the other way around and as such, the first and foremost requirement for it should be that it's clear and readable and that its authors are people you actually trust as competent and reliable for the task, as per your definition of trust, competency and reliability. 

  2. This also means, of course, that an "empty file" is even more nonsense than mentioned already since it doesn't even match the format requirements to be called a file in the first place. 

Comments feed: RSS 2.0

3 Responses to “VaMP's Basic Requirements for Valid Files and Directories”

  1. Jacob Welsh says:

    Although there's no fully accurate mechanical way to decide what's "binary" - much less what one might really like to ensure, like whether a text is meaningful or correct - why not at least exclude non-printable-or-whitespace characters? Because it seems to me that those certainly cannot be text; and you have the notion of "character" already rather than just dumb bytes once special meaning is assigned to the newline.

    Are symlinks recognized? I would have thought them excluded as non-ordinary files, but the explicit forbidding of loops seems to suggest otherwise.

    What is that definition of "file" you mention? Because footnote 2 wouldn't have followed from my existing notions, eg 3.395 Text File in POSIX.1-2008 allows for zero lines.

    Is "executable" status recognized, or file mode bits more generally?

    Finally as to the stricter notion of empty directories where, I gather, each directory must *directly* contain at least one regular file, a couple examples came to mind from prior practice:

    1. Java projects containing "package" code, where the idea seemed to be to fit into some global namespace which would nonetheless never be more than partially populated in a given tree, and "import com.sun.foo" would map to "com/sun/foo". I found this annoying.

    2. Most of the TMSR V projects, where the top level was empty but for a "project name" subdirectory. Evidently I found this dubious since I didn't do it at first!

    3. musl provides at least two examples: a. the "arch" directory, which contains the architecture-specific parts of the header collection, organized into 16 subdirs (one per arch plus one generic); b. the "src" directory, organized into 43 logical submodules (perhaps reflecting different levels and areas of the standards, or things that were traditionally separate libraries altogether). The makefiles and whatnot that pull them together are at top-level only. This seems the most interesting case because I'd expect to see more projects looking like this except that most aren't as focused on tidiness as musl! I have trouble arguing that they did something wrong here or that there would be value in a dummy placeholder file just to fill the space. But what do you think?

  2. Diana Coman says:

    why not at least exclude non-printable-or-whitespace characters?

    Mainly because I see it as a partial solution at best and so I think it belongs more as a filter than as a basic requirement as those listed here (the main difference being that filters can be selected and applied by the user as/when they see fit, while the basic requirements are fixed and simply enforced at all times). Specifically, while I don't see any case in which I would actually want to sign any patches with such characters, I don't particularly see why would the diff for instance refuse to run on such files, it's comparing byte with byte after all.

    Are symlinks recognized? Is "executable" status recognized, or file mode bits more generally?

    Currently at least, not specifically, although I'm on the fence for it. The part of "what is an ordinary file" is the fuzziest at the moment and left like this until practice pushes for more clarity really. It's part of the list in this article though precisely because there are some hard limits anyway and so there's no point in promising to handle everything that is otherwise considered a file by unix standards.

    What is that definition of "file" you mention? Because footnote 2 wouldn't have followed from my existing notions, eg 3.395 Text File in POSIX.1-2008 allows for zero lines.

    Good point, as it makes it quite clear that I really have to be more precise because the notion of "file" is so stretched (to cover everything that is pushed by that "everything is a file" model) that it's more of an umbrella term than anything specific so not very useful in this context. What I had in mind was "source file that is not empty" (as per ISO/IEC 9899/1999 for the C programming language, section 5.1.1.2, meaning specifically "A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place."). You are right though that footnote 2 is not a direct follow-up of any of the definitions of "file" as such, since they all go through pains to specifically include the exception of empty file, at the very least. For the record and since I'm going through it again anyway, let's preserve and see here grouped together the relevant definitions from your link:

    3.164 File
    An object that can be written to, or read from, or both. A file has certain attributes, including access permissions and type. File types include regular file, character special file, block special file, FIFO special file, symbolic link, socket, and directory. Other types of files may be supported by the implementation.
    3.194 Incomplete Line
    A sequence of one or more non- <newline> characters at the end of the file.
    3.205 Line
    A sequence of zero or more non- <newline> characters plus a terminating <newline> character.
    3.403 Text File
    A file that contains characters organized into zero or more lines. The lines do not contain NUL characters and none can exceed {LINE_MAX} bytes in length, including the <newline> character. Although POSIX.1-2008 does not distinguish between text files and binary files (see the ISO C standard), many utilities only produce predictable or meaningful output when operating on text files. The standard utilities that have such restrictions always specify "text files" in their STDIN or INPUT FILES sections.

    Even leaving aside the interesting idea of organizing characters into zero lines, there is some contradiction in a text file "containing characters" while being empty at the same time. Possibly even more interesting, an incomplete line would technically fit the bill of "organizing characters into zero lines" I suppose, opening up a whole gray area of what "zero lines text file" even means: is it one not containing any character or one containing "zero lines" aka potentially an incomplete line (since funnily enough, an incomplete line cannot contain zero characters, even though a file can contain zero lines)? Fun aside though, I think that all this muddle has deeper roots that are rather way beyond my scope here. So I'll just work with the "VaMP file" if need be: a sequence of one or more lines.

    Finally as to the stricter notion of empty directories where, I gather, each directory must *directly* contain at least one regular file

    Correct and it brings more clarity once enforced, specifically because it points out any lacks/issues otherwise. Basically your first 2 examples there are exactly making this point: in both cases, the empty directories were pushed on by other lacks/deeper trouble but in practice they were/are a burden actually (and yes, in that context and at that time one might end up carrying that burden for lack of any better alternative or in other words for failure to identify the root cause and/or have a solution for it). As for the 3rd example, at a. the "generic" collection seems to be the exact content for the "arch" directory directly, since there isn't any "generic-specific", is there (ie each subdir in arch is some specific architecture, but the generic is just that, the part that belongs naturally at the top-level arch node). At b, there probably is some core part that again would be the exact reason of "src" to exist in the first place and as such would be more naturally found directly in src. Basically the stricter requirement wrt directories comes from seeing the directory structure as a taxonomy describing the content rather than some sort of file-cabinet/physical grouping into fixed containers or some such.

  3. [...] discussion of what are valid files and directories for VaMP was published on the 19th of April [...]

Leave a Reply