Welcome to a reference collection of tips from my documentation reviews on the debian-l10n-english mailing list, which I'm currently updating with some extra notes for debian-doc and debian-publicity in general.
These notes could go on the Debian Wiki, if it wasn't for the fact that
typing a paragraph or two of text into my web browser is enough to
remind me that editing text is easier in a text editor.
Besides, I don't want to have to defend my notes against
well‐intentioned sabotage by people who half‐remember some piece
of mumbo‐jumbo handed down to them by their English teacher; this
may be a prescriptive style guide, but it's one primarily designed
to help people write the way competent native speakers really do
in the twenty‐first century. The idea is that next time I'm
reviewing something claiming to be “an unix software
that allows to run an own irc‐based proxy” I'll just be able
to point at prefabricated summaries of what's wrong with it.
(Yes, that's how I'm highlighting “bad example” usages.)
First I'd better get out of the way some “grammar folklore” rules with no particular basis in linguistic reality. They have never been real features of the grammar of English as used by even the most universally admired writers – they're delusions propagated by people who want to be able to look down on all the members of the general public who fail to obey their imaginary rules. We might nonetheless choose to abide by these taboos just to avoid the arguments.
Some prescriptivists insist on usages like “none of you knows
whom he would choose if he were I”, even though they're long
extinct in most brands of natively spoken English. Following
their advice is a good way of making yourself sound as if you were
brought up on the Lost Island of Snooty Robots.
English offers plenty of opportunities for picking the wrong word. Sometimes it even seems to be systematic about it; for instance, it often presents a three‐way choice between “‑ing” noun, plain noun, or “‑ation” noun, all of them more or less synonyms (“some counting”, “a count”, “a computation”). The “‑ing” words can be tricky to fit into a sentence, since they keep some of their old verbal habits, while “‑ation” words tend to be fancy and abstract.
This one crops up so often I'm putting it right at the top.
You can't “allow to” do something (as in “this
option allows to compile code”). You can say that
“this option allows you to compile code”, or “this option allows
code compilation”, or even “this option allows code to be
compiled”; but if there's no direct object noun phrase immediately
after the verb, it's almost certainly ungrammatical (and the same
goes for “permit to”). Native‐anglophone readers will know
what you mean, but they'll also suspect you've got a funny accent.
Besides, unless the software is something like PAM, how likely is it that it literally “allows” me to do something otherwise forbidden? It enables or simplifies doing things, or helps me do them, or simply does them.
Well known cases where the English word doesn't mean what speakers of most European languages expect.
beware of… | when you mean… |
actual | current |
arrive | succeed |
conscience | consciousness |
consequent | consistent |
demand | request |
especially | specifically |
eventual | random/possible/hypothetical |
experiment | experience |
few | several/a few |
funny | fun |
mention | give/specify |
pretend | claim |
relative | relevant |
respective | corresponding/appropriate |
sensible | sensitive |
Each of the following words has more than one well established idiomatic meaning, so you need to be aware of the possible misinterpretations.
By which I mean an obviously incomplete survey of syntax, morphology, and so on. If you're looking for apostrophe‐pedantry, it's filed under Orthography.
English has four basic types of relative clause.
If in doubt, don't overlook the option of cutting it into two or more separate sentences.
Definite vs. indefinite vs. nothing is far too complicated to explain here beyond the rule of thumb that the definite article “the” is for when both writer and reader can identify the thing being referred to.
The question of whether it's “the file FOO” or “the FOO file” is a similar issue of information management, since the answer is that it's either, depending mainly on what's news and what's background knowledge:
Non‐native speakers also tend to have trouble guessing whether to refer back to a previously mentioned idea with “this” or “that”. This can be really difficult (that was an example).
On the other hand, the rule for whether the indefinite article is “a” or “an” is clear‐cut as long as you ignore the spellings – what you need to know is how the following word is pronounced, and whether it begins with a consonant sound (“a laptop”, “a one‐off”, “a USB device”) or a vowel sound (“an option”, “an hour”, “an xterm”). Unfortunately, there are a few debatable cases, since some of the things we may need to refer to don't have established consensus pronunciations. Is it “a shlib” or “an shlib”? How about “fsck” or “SASL” or “URL”? Sometimes people even alternate between saying “an /etc/hosts file” (pronouncing it “etceterahosts”) and “a /etc subdirectory” (“slash‐ee‐tee‐cee”).
There's no room here for a full explanation of the rules of the English tense system (besides, if I said that technically it has a grand total of two tenses I would only confuse people…) but here are some hints for the bits I see causing trouble most often.
Watch out for the subtle distinction between the simple past
tense (“this happened”), marking events as over and done with, and
the perfect construction (“this has happened”), marking them as
news with continued relevance (not quite the same thing as being
recent). There are slight differences in usage between
dialects, but basically, a warning message saying “FOO was
broken” suggests that it is now fixed; a warning message saying
“FOO has been broken” implies the opposite. To
complicate matters, there's a fairly strict rule outlawing past
datelines with present perfects, as in “we've released
a new version last week”; the fix is usually to go back
to simple past “we released a new version last week”.
English has a system of “sequence of tenses”, where past tense marking on a main clause spills over onto subclauses: “I said my name was Sam”. This can even happen when the tense mark on the main clause doesn't really indicate past time: “I could stop tomorrow if I wanted”.
The “used to” construction, as in “I always used to get this
wrong”, has the annoying quirk of lacking any present tense
equivalent – it would be logical if you could carry on
with “…and in fact even now I'm still using to get it
wrong”, but alas, natural languages don't run on logic!
What's more, even when you get the grammar right, the “used to”
construction can easily lead to confusion. For example, “the
software used to do this” would be fine in speech (since the
past‐habitual marker is pronounced “yoost” instead of “yoozd”),
but it can be ambiguous when written down.
Some dialects have complex rules for when you should use “shall” rather than “will”, but not mine – grep tells me I only use “shall” when I'm quoting something that includes the word “shall”.
All of the following things are singular in English (or at least, it would be grammatically correct to follow them with “is here” rather than “are here”):
Non‐count nouns also take singular agreement: “all software is fallible”, and so is “mathematics”, or (this side of 1950) “data”. On the other hand, “a lot of people are here”, while “Alice and/or Bob” can go either way.
Although “each politician” takes singular agreement, and faces are ordinarily distributed on a one‐per‐person basis, it's entirely non‐satirical to say “the politicians showed their faces”. “Their face” would imply it was shared.
Nouns that modify other nouns are usually unpluralisable (just
like the adjectives they resemble), so a collection of managers of
windows is a “window manager collection”, not a
“windows managers collection”. But
then again, a conference of managers of events is quite likely to
be an “events managers conference”, and I can't offer any sort of
rationale for these exceptions.
Most adjectives can occur either before the thing they describe or
after a linking verb (“lonely Jim is lonely”); a few can't (“the
lone ranger is alone” vs. “the alone ranger is
lone”). The word “own” may resemble an adjective, but
it isn't allowed to appear in either position without the support
of a possessive word. “Its own name” is fine, but “an
own name” has to become “a name of its own”.
Nouns “used as adjectives” don't behave exactly like natural
adjectives. They pile up immediately before their head noun,
never mixing in with the adjectives to participate in phrases like
“a simple, shell, useful script”.
The “dangling modifier” is another of those deprecated
constructions that native speakers get away with all the time:
“after reinstalling my PC the bug got worse!”
Interpreted pedantically, this sentence claims that the bug
performed the reinstallation…
Matters of style are essentially arguable, but if you don't want my advice, you don't have to ask for it.
When people say something is “bad grammar”, what they often mean is that it obeys the grammatical rules of the wrong dialect, which is a stylistic issue. The real reason for avoiding slangy or dialectal usages isn't that they're inherently bad, it's that they're less universally understood, especially by readers who are themselves non‐native speakers.
As you may have noticed, even though this page is itself written in my usual British‐English HTML style, the variety of English it recommends for debconf templates is the one that goes with an en_US locale. Other Debian subprojects use en_GB, or have no standard – and even in package description reviews we're often better off letting people follow whatever standard they know best rather than forcing them to adopt one they're uncomfortable with.
Educated American English isn't completely homogeneous anyway; and
where there's variation we need to avoid confusing or annoying
speakers of either variety. Take for example the
unpleasantly ambiguous phrase “in case”. For some
anglophones, the instruction “unplug your PC immediately just
in case of a short circuit” means “conditionally, if and
when a short circuit occurs, unplug your PC”; for others
(including me) it means “unconditionally, to avert a short
circuit, unplug your PC now”.
Using an informal register has the advantage that it can give a friendly impression; but there's also a risk that this chumminess may be unwelcome in a context where your readers just want you to get on with conveying information concisely and coherently. Spoken English tends to leave more things implicit, since a real‐world context normally makes what you mean instantly apparent. A classic example of a usage that's frowned upon in formal writing but taken for granted in conversation is the ambiguous use of “like”: does “options like FOO” mean “those options that resemble FOO, possibly excluding FOO itself, and certainly excluding options unlike FOO”? Or does it mean “any arbitrary option, such as FOO”?
Colloquial English often uses sequences of independent clauses, you just splice them one after another with nothing to signpost how they fit together, they're called run‐on sentences, like this, see? Constructions like that are deprecated in writing, but often all that's needed to fix them up is a few commas promoted to semicolons.
Addressing the audience directly with second person (“you/your”)
has advantages and disadvantages – it can make life
harder for translators – but first person
(“I/me/mine/
An excessively formal register should also be avoided. Convoluted uses of balanced antitheses within multi‐line relative clauses within hypothetical conditionals can be a very concise way of saying something, but they force readers to do extra work to “unpack” it. Even when your display of syntactic knotwork is technically perfect, if it bores everybody into skipping that paragraph you might as well not have written it.
Long, elaborate sentence structures can increase the risk of
scoping ambiguities: “One should not fail to avoid making a
foolish error and leave the button unpressed”.
On the other hand once you start breaking everything up into
bite‐size chunks there's the danger you'll introduce referential
ambiguities: “There's a button above the off switch. The
off switch should be recognisable because it's red.
Press it.”
The impersonal pronoun “one” (as in “using this emulator one can play arcade games”) almost always strikes me as hopelessly formal; either replace it with generic “you” or rephrase the whole thing. Similarly, over‐reliance on passive verbs (“a test‐tube was heated…”) is generally unpopular. Contrary to its bad reputation, the passive voice sometimes provides the most natural and direct way of continuing a sentence (“walking in the door, I was greeted by my friend Pat, so I went over…”); but that's no excuse for saying “please note that it is important that the button should immediately be pressed” when you mean “press it!”
Revising a sentence to introduce or eliminate a passive construction is an opportunity for syntactic problems to creep in and leave you with your pronouns pointing at the wrong things:
NOTE that tags saying “NOTE” are a bad sign. Documentation is entirely constructed out of strings of notable points, tacked together into (preferably) coherent paragraphs. If you need to sprinkle it with labels saying READ THIS PART, that probably means it's a bit of a mess.
Gender‐neutralising by explicitly saying “he or she” is often clunky (though not as ugly as telling half the human race they don't count as people). If you want to avoid breaking the taboo against “they” with a singular, there are some alternatives that avoid the issue:
Avoid unnecessary redundancy and repetition. Even if it makes sense to refer to the same thing several times, it's considered poor style in English to use the same word repeatedly unless it's deliberate emphasis. This rule can cause a lot of trouble if you're trying to describe how users usually used to use useful userspace usage‐monitors…
This is the field where I'm most likely to be bossy, since
languages and writing systems are two different kinds of
thing. Once there's a community of mother‐tongue
English‐speakers who have grown up talking about “less items”,
complaints from people who say “fewer items” are
pointless – it's one of the ways English is spoken, so
it gets to be listed in dictionaries. But orthographies are
artificial rule‐systems propagated via schools, and have no native
speakers. If you spell it as “fiewer itoms” then
you're just failing to comply with the standard.
If you run lintian with all the optional bells and whistles turned on it has checks for quite a few common typos.
Yes, I'm an en_GB‐er myself, but US spellings strike me as a clear improvement in the vast majority of cases. The best known difference is that en_US expands “i18n” as “internationalization”, while en_GB mostly uses “internationalisation”. However, the OED prefers “‑ize” (as did “The Times” when I was young), and there are a few words that are “‑ise” in both systems, including “advertise”, “compromise”, “exercise”, “promise”, “revise”, “supervise”, and “surprise”.
Other major categories of divergence:
GB | US | Notes |
centre | center | (but always ogre, auger) |
colour | color | (but always glamour, error) |
dialogue | dialog | (but always fugue, Prolog) |
mediaeval | medieval | (but always aerial, era) |
travelling | traveling | (but always felling, feeling) |
The un‐American spelling “programme” still exists, as a British word for TV shows and the like, but these days the computer variety is always “program”.
“Disc/disk” is a strange one: it started as a regional spelling variation but has taken root as a technical distinction between the Compact Discs and other optical media standardised by European audio companies and the hard or floppy disks standardised by the US computer industry.
Package synopses are rather like titles, but that doesn't mean they take Lots of Uppercase Letters; the Developer's Reference recommendation is not to capitalise them. This doesn't mean that you should write “gNU”, though! We have to distinguish situational capitalisation, imposed by context, from lexical capitalisation, which is part of the spelling of a word. A normal word can vary from all‐lowercase to first‐letter‐uppercase to all‐uppercase depending on factors like whether it's at the start of a sentence or whether it's in a newspaper headline. But words like “GNU” or “Linus” or “English” involve letters that are inherently uppercase, written that way regardless of context.
Words with intrinsically lowercase characters are rare outside the world of science and technology (where it can mean the difference between “millitesla” and “megaton”). But in IT, strings such as “/usr/bin/perl” or “itsupport@example.org” often have to be invoked precisely verbatim, and even strings like “https” or “usb” may need to be entered in a configuration file in lowercase. The same logic is often applied to package names such as “awk” or “gnome”, which may be left uncapitalised at the start of a sentence in documentation – after all, “apt show GNOME” won't find anything. Rather than insist on a stylistic policy for this issue that requires people to agree on some particular obscure analysis, it's safest to advise keeping package names out of sentence‐initial position where possible.
Upstream software project “brand names” are a different matter, and are upstream's decision. If they call it “FOObar” or “FooBar” we should respect the capitals, but if their website calls it “the foobar project” it's not clear whether they're leaving it unmarked or declaring it uncapitalisable. Incidentally, does anybody have any idea under what circumstances it's appropriate for Debian documentation to label brand names as registered trademarks? My own suspicion is that there's never any serious reason for us to put such labels on anything; if we were going to get sued for not saying Microsoft® Windows® it would have happened decades ago.
One context where I'm happy to see what looks like titlecase in a package synopsis is for things like cups, where including the expansion as “Common UNIX Printing System” makes it easier to see at a glance that it's doing double duty as an explanation for the name as well as a description.
Compounds like “front end” tend to become “front‐end” and then “frontend” as the term gets used more. Programmers are often early adopters of new jargon, so there's an unfortunate tendency for documentation to be written in a style that's unfamiliar and offputting for the readers who need it most. Feel free to talk in your private shorthand on the development mailing list, but try to stick to the more newbie‐friendly forms (“file system”, “web server”) when you're addressing the wider public.
I know of a couple of gotchas: being “online” isn't the same as being “on line”, “plaintext” is not the same thing as “plain text”, a “username” is not the same as a “user name”, and “userspace” isn't “user space”. You'd think the hyphenated versions would make good compromise candidates, but that rarely seems to work… instead my own rule of thumb is: if Wikipedia still treats it as two words, that's what the average reader probably expects.
Structurally complex noun phrases tend to acquire hyphenation not because they're becoming single words but just to make it easier to distinguish (e.g.) a “real‐time machine‐translation system” from a “real time‐machine translation‐system”.
Extra hyphens also occur with phrasal modifiers like “an easy‐to‐use application”, but here they serve to mark the whole thing as a unit; the hyphens aren't needed when the same phrase appears after a linking verb (“it's easy to use”). You might think the same applies to multi‐word modifiers made up of adverb plus participle, as in “an easily used application”, but since these are never structurally ambiguous a hyphen is considered redundant.
(A cover‐term for backticks, apostrophes, and opening or closing single or double quotation marks.)
The rules for apostrophe use are an obstacle course of arbitrary complexities, where errors are usually spell‐checker‐proof (and the real joke is that they almost never cause ambiguity – we could get along happily with no apostrophes anywhere). English possessive apostrophes are particularly shambolic.
There's some debate about the use of apostrophes on inflected forms of numbers, acronyms, and so on (“GUI's”, “GPL'ed”, “1990's”). Most style guides recommend leaving them out (“one OS, many OSs”), but this advice isn't widely followed.
The “logical” style of quotation mark placement, where punctuation is kept outside the bracketing quotes unless it's part of the original text, is prohibited by many US style guides… so let's ignore them in favour of the Jargon File.
And then there's the question of single vs. double quotation marks
vs. fancy Unicode ones. I personally prefer to stick to
ASCII in contexts where users are likely to want to do
command‐line searches or use copy‐and‐paste. I also use the
“"” character by default, reserving the “'” character for use as
an apostrophe or second‐level quotation mark. Although
that's what I learned at school, people tell me it's the American
style; and by happy coincidence it's also the style preferred on
d‑l‑e, but as long as a given text is
consistent I won't object particularly. (Well… not unless
you're using ``TeX'' quotation marks,
that is. Please don't; I'm sure they would get typeset into
something beautiful if only they were being post‐processed by
LaTeX, but sitting there in my terminal emulator they'll
just look rubbish.)
Some writers use single quotation marks not to indicate quotations but as an ASCII workaround for tagging verbatim strings – the sort that I'm HTMLising here in a nonproportional font. Thus for instance they might say that 'remake' is yet another "simple" replacement for 'make'; this is all very well, but trying to apply it consistently would often make text look too fussy.
Lists where some of the items are themselves slightly complex often benefit from being rephrased (and in particular re‐ordered) for clarity. For instance, “it supports FOO, BAR, and BAZ with QUUX or QUUX2” is ambiguous in a way that “it supports BAZ with QUUX or QUUX2, plus FOO and BAR” is not. Another tactic is to upgrade the separators between list items from commas to semicolons:
Where a list is organised by bullet points, d‑l‑e has developed a sort of house style.
It features: * leading single‐indented asterisk (or maybe dash); * semicolon at the end of each item; * final period (full stop).
However, a simpler approach, less integrated into the surrounding text, is still okay by me as long as it's self‐evident what it's a list of.
* Independent items * Asterisks * Capitalisation * No other punctuation (or not much)
Lists read more smoothly if items are kept structurally “parallel” – usually all adjective phrases, all noun phrases, or all verb phrases, not a mixture.
Avoid writing them like this. o broken parallelisms! o insufficiently similar; o Don't go together very well
Mind you, if it's only two or three bullet points it might work
better as a plain old sentence; lists with sublists are
particularly worth flattening. And although it's important
to make it clear whether the list is exhaustive, it's easy to
overdo it – there's no need to say “some of
its features, for example, include (but are not
limited to) FOO, BAR, and BAZ,
among many others”!
This is arguably outside the remit of a localisation mailing list, but while we're reviewing a piece of documentation it makes sense to do some fact‐checking and general editing.
The setting determines where the dividing line is between things being technical jargon and general knowledge. TLAs usually ought to be expanded or explained the first time they're used – and if they aren't used more than once, why waste time introducing the abbreviation in the first place? But that doesn't mean you need to interrupt your DIY Integrated Circuits HOWTO to explain what a “P.C.” is.
See the Debconf Spec and the existing Templates Style Guide (now part of the Developer's Reference).
Debconf dialogues should almost never need to mention debconf, or even “the installer”; these are technical implementation issues that should be transparent to the user. Besides, mentioning installation in the middle of an upgrade or dpkg-reconfigure run is just confusing.
When you need to give an example hostname, don't give free advertising to myhost.com, randomword.com, or foo.com; use an RFC‐compliant one like example.org.
It isn't necessarily appropriate to ask “would you like to reconfigure your server?” if the reader might be a sysadmin reluctantly following corporate guidelines for software installations on the company's server. All you know for sure is that it's up to the reader to answer the question “should the server be reconfigured?”
See DevRef 6.2.1 to 6.2.3 (and salvaged from the archives, some old guidelines by Colin Walters).
Questions like how the software is implemented and what standards it conforms to can wait. The basic point of a package description is to announce what this .deb is for – what can it do to solve users' problems and make their lives more fun?
The project homepage is the easiest place to get this kind of text, but don't take that to mean you should just copy it word for word off GitHub: their blurb isn't designed to convey the same information as a package description. So “diverging from upstream” isn't an issue here any more than it's a problem that the man page is different from the FAQ.
Upstream blurbs may involve confusingly divergent specialised uses of terms like “distribution” or “contrib package” or (if you're unlucky) “free software”, and may be full of hard‐sell advertising copy designed to compete with some unmentioned proprietary equivalent. Remember that the interests of our users always take priority over the developer's ego; stick to an objective summary of the software's pros and cons.
Unless it's going in “Section: (lib)devel”, you should try to avoid “developerese”; the typical user only wants to know what your application is good for, not how it's implemented. If libeg-bin is part of EGlib and provides a utility called eg_tool, don't assume that's self‐evident, and make sure that the text makes sense as a description of libeg-bin. If the significance of the name isn't obvious, the extended description is a good place to put an explanation. (If it's a TLA you may be able to get away with just using the expansion as the package synopsis.) I'm the kind of user who finds it easier to get a mental handle on a piece of software if its name has some intelligible connection to its function, so I often ask “why the name?” in d‑l‑e package reviews. There seem to be quite a few programmers out there who are content to dub their project “yix” just because that's a quick and easy key‐sequence to type on a Dvorak layout, but that label will often be the first aspect of their brainchild that people encounter as they browse through the menus. Think of it as the most basic starting point of the user interface!
Reimplementations of existing software should be careful not to live in the past, phrasing their descriptions purely in terms of how libfoo-tng was an improvement on libfoo – especially if libfoo2 might have all the same features. At best, once libfoo-tng succeeds in becoming the standard implementation and libfoo vanishes from the repositories, users will be left relying on software archaeology to work out what purpose your package serves. And I can never resist pointing out just how eighties the fad of calling things “The Next Generation” is! Beware dated content – references to boot‐floppies or X11R6 support, game reviews assuming that 3D acceleration is a novelty, and so on. In fact it's a good idea to avoid claiming that your package is notably “modern” (in ten years when it's an orphaned relic that text will be an annoyance); say what its features are (e.g. “graphical”), and let readers make up their own minds about whether that's an advantage.
Some other varieties of Too Much Information:
The balancing act between too little information and too much is particularly hard for short descriptions. One thing you should usually leave out is the programming language – it might fit in the long description, but it's a waste of space to say that python-pylibpython-mcpython (Section: python) is written in Python. Use debtags!
The Developer's Reference says that package synopses should be (articleless) noun phrases referring to the package – that is, they should fit the template “$PACKAGE provides a/the/some $SYNOPSIS” (though the alternative two‐part format popular with large families of packages also has explicit DevRef backing). They should not follow the example of the man pages that base their description line on verb phrases (“$BINARY lets you $DESCRIPTION” or “$BINARY is designed to $DESCRIPTION”). The logic of standardising on noun phrases goes like this:
Apologies for the linguistics jargon (which strictly speaking isn't even accurate – I should be talking about N‑bars, not NPs). I advise non‐syntacticians just to focus on the template approach.
I suspect my reviews on the mailing list give the impression I'm some sort of nit‐picking dimwit, so please bear in mind that the best way of spotting typos, grammatical ambiguities, missing definitions, and so on is to approach the text from the point of view of somebody who doesn't already know what it's trying to say. If you find that sort of ignorance annoying, I apologise; but this may be an indicator that you should delegate the task of writing user documentation to others.
If you disagree with me about some point of grammar or style or whatever, don't worry; at the end of the day, it's the maintainer's decision, not mine, and you're welcome to join the mailing list to provide an alternative viewpoint!