FANTASTIC STATISTICS

Analysing Classic SF

01 Oct 2021 _NEW_ Justin B Rye

“Ninety percent of everything is crud.”
—Theodore Sturgeon

FOREWORD

Last time I wrote a book‐reviews web page, back in the nineties, I was sampling just three items from a stock of science fiction and fantasy titles numbering in the hundreds.  My collection could have been larger, but many of my old favourites were things I'd never bothered to get personal copies of – I had long been librarian of a local SF&F society, and friend‐of‐a‐friend connections to UK fandom kept a trickle of complimentary‐copy paperbacks circulating through my sitting‐room.  (Did I ever mention that Sir Terry Pratchett OBE, RIP once spent a weekend in the spare room of my student flat, when he was barely starting out as a full‐time author?)

In the twenty‐first century I decided I didn't want a paper collection anyway – what I want is a story collection.  If I switched to ebooks then apart from a few sentimental‐value volumes of Teach Yourself Sumerian and the like the physical copies could go to the charity shop on the corner.  This proved a fortunate idea given the number of times I've needed to move house recently, but electronic texts have other advantages too – as long as I make sure that the thing I end up owning isn't some limited license to access the book via someone else's software.  I'm not interested in that sort of rip‐off, I want the old‐fashioned system where the text is my property and I can do anything I like with it short of publishing copies.  For a start, I always convert my ebooks into a consistent HTML format so they'll work in any browser (including my throwback of a mobile phone); but the part that got me writing this page is that once I've done that I can also carry out all sorts of basic text analysis from the command‐line.  And thanks to all the old magazines that are out of copyright, it's getting easier and easier to end up with a moderately comprehensive collection of the big‐name SF award‐winners of the twentieth century (and even quite a few of the ones I might actually want to re‐read).  So here are some interesting facts, or at any rate facts, about my virtual bookshelf.

You might alternatively prefer to read my audits of “Golden Age” SF authors' predictions for what used to be the future, under Retro‐Futurology; or perhaps my exhaustive listing of all known types of fictional alien, under SF Exobiology Top Tropes.


GENERAL STATISTICS

Years

Just to put some sort of meaningful frame around this, I'm only considering SF/fantasy published between 1901 and 2000 (inclusive), since I've got a comparatively representative sample of the classics of the period; so all the best works of H. G. and Martha Wells are out of scope.  For the early decades all I've got to go by are things like the SFWA's Hall of Fame plus the more basic criterion of whether I've heard of them; so I've got nothing at all for 1902, and only one or two items per year (mostly in short story collections) for the rest of that decade, rising to at least a dozen per year by the thirties.  Then there are Hugos, Nebulas, Locuses, BSFA awards… for the fifties through to the end of the eighties I've got at least two‐thirds of the winning stories, which tends to mean the works by repeat winners, but it gets patchier as the number of awards increases and my selection is increasingly determined by my own personal tastes, with the complicating factor that I still haven't replaced all my paper books of this era.

I'm dividing microfictions/short stories/novelettes from novellas/novels/novelzillas strictly in terms of wordcount, which may not always match up with how they handed out Best Novella awards, but I'm gathering a lot of Other Stories along with the gold medallists anyway – especially all those fifties short stories.

Authors

Even counting the authors isn't as straightforward as it should be, since some apparent solo authors are really team‐ups (like Lewis Padgett) or vice versa (like John Wyndham “and” Lucas Parkes).  Oh well, my collection has over 160 contributors in total – the one with the most awards may be Connie Willis, or counting another way Harlan Ellison, or if you use sufficiently contrived criteria (e.g. if we're counting only Hugos and Retro‐Hugos for twentieth‐century SF) then Robert A. Heinlein pulls ahead.

Stories

The borderline between individual self‐contained tales and episodes in an overarching narrative gets especially murky, but I've got a total of more than 1600 stories, of which four‐fifths are shorter works.  Which is shortest of all, though?  If I disregard the various trick zero‐word stories I've heard of but don't have copies of (as far as I can tell), the shortest complete story I happen to own is C. J. Cherryh's A Much Briefer History of Time at 100 words – though the big prizes rarely go to anything even as short as Isaac Asimov's Robot Dreams (Locus, 2018 words) or Arthur C. Clarke's The Star (Hugo, 2380).  The longest (not counting Clarke's The Longest Science‐Fiction Story Ever Told) is J. R. R. Tolkien's The Lord of the Rings: if we recognise it as a single story sliced up by the publishers, it's over a half‐million words.  Yes, of course I'm including the linguistic appendices!

Files

Once I exclude the things that aren't twentieth‐century speculative fiction (blurring off into fantasy/horror), my personal library amounts to over 35 million words: getting on for a quarter-gigabyte of HTML, though it comes down to sixty meg with decent text compression.  Thanks to my habit of merging things into omnibuses it's only 400 or so files – the biggest is Kim Stanley Robinson's Red/Green/Blue Mars trilogy (three stories, but one file) at 4.5 megabytes, while the tiniest, beating several files that are single short stories by one‐hit wonders, is also a multi‐story compendium: my anthology of Fredric Brown short‐shorts, at less than twenty kilobytes!


STORY TITLES

Most‐Used Title

I'm told the commonest title for an SF/Fantasy story overall is Homecoming, but there's no evidence of this among the ones I've got.  Sticking to my sample, if I take the most‐reused elements and assemble them into what ought to be the averagest story title possible then what I end up with is The Time Man, which has nonetheless never been used.  Instead the title that shows up most often is The End, which has been used for stories by Jorge Luis Borges, Fredric Brown (106 words!), Franz Kafka, Ursula K. Le Guin, and Murray Leinster (among others, but Lemony Snicket is post‐2001).  To make matters worse, the Le Guin short story is also known as Things, a title it shares with a Zenna Henderson story that definitely shouldn't be confused with The Things by Peter Watts (because again it's too late).

That's the winner of “most stories for one title” worked out, but “most titles for one story” has to be a Barrington J. Bayley novel that I first read in paperback.  It said Soul of a Robot on the front cover, Soul of the Robot on a sticker on the back, and The Soul of the Robot on the title page.

The Clue Is in the Name

When Brian W. Aldiss's novel Non‐Stop was retitled Starship for the US market, that only gave away something that becomes obvious early in the story.  But as I reread old short story anthologies I can't help noticing the cheeky way some of them reveal the whole twist ending right in the title, at least in retrospect.  Since it's a spoiler to name them they're all rot13ed (with mouseovers) below.

(And then there are all the ones I mentioned above that have the end in the title.)

Famous Titles First Published Together

I did a sweep to work out where things first appeared, and which issues of any magazine or original‐anthology series had the largest number of major stories.  There were a few issues of Galaxy with a crop of short fiction that I'd still pay good money for today (wait, at cover price? I'll take a dozen), and serials like Star Science Fiction Stories and Universe were also rather impressive, but the all‐time best value for money still has to be Dangerous Visions (ed. Harlan Ellison).  Even its worst stories are bad in a deeply seventies way, which is impressive for something published in 1967.

Title Length

I have a couple of obscure short stories with two‐letter titles, but it's more impressive when the short name is attached to a full‐length novel such as Yevgeny Ivanovitch Zamyatin's dystopia We, which was named equally snappily in the original Russian.  That was written too early to win a Hugo or anything similar, other than the coveted title of “first work banned by the Soviet censors”; among big name prize‐winners the smallest name must be Frank Herbert's Dune, with a four‐letter, three‐phoneme (for the author) monosyllable for what was at the time considered a sprawling epic.

Finding the longest title is slightly trickier, thanks to the subtleties of subtitles – we don't treat things like Sir Arthur Conan Doyle's The Lost World Being an Account of the Recent Amazing Adventures of Professor George E. Challenger, Lord John Roxton, Professor Summerlee, and Mr. E. D. Malone of the Daily Gazette as long titles.  Setting those to one side, the longest prize‐winner (beating several strong contenders by Ellison that I hope you'll excuse me not cataloguing) is the Connie Willis short story “The Soul Selects Her Own Society”: Invasion and Repulsion: A Chronological Reinterpretation of Two of Emily Dickinson's Poems: A Wellsian Perspective.


WORD FREQUENCIES

Definite Article

Here I'm tallying uses of the word “the”, and comparing this figure to other words, or to the total word‐count.  And no, Ellison's The Prowler in the City at the Edge of the World makes a strong start but has an undistinguished overall score of one definite article for every fifteen words.  Besides, it's only barely a novelette, and I don't want to have to work out word‐frequency figures for every individual short story – all the prizes would end up going to statistical freak microfictions.  This is a handy excuse for the way I assemble shorter works into by‐author best‐ofs.

The competition for highest density of definite articles immediately runs into a different eligibility question.  Should I include The Book of Imaginary Creatures by Jorge Luis Borges (one of the things I've still got because I had the paperback as a child)?  Probably not, but it has a ratio of one “the” in every eleven words!  And it's not just that it's full of definiteness due to being full of definitions; I own some similar “dictionaries” that are nearer the other end of the scale.  Borges's fiction is highly definite too, as are works by Jules Verne and Umberto Eco, so I suspect it's an artefact of being translated from a Romance language.  However, if I'm restricting the competition to narrative fiction and disqualifying the monster manual then Tolkien takes first place with The Silmarillion, whether or not I include all the appendices.  This seems like good corroborating evidence for his claims that he was a mere translator!

The story with the sparsest “the” supply is, as expected, Heinlein's The Moon Is a Harsh Mistress – a novel written in an invented Russian‐influenced variety of English of which most obvious feature is tendency to omit definite article (mostly: at one “the” in every 106 words it's still common, and indeed still the first word in the book).  Maybe doing it deliberately for effect sounds too much like cheating – although Heinlein's normal writing style is already the most consistently anarthrous in my library; but if we disqualified that one then the prize would go to E. E. “Doc” Smith's The Galaxy Primes with its score of one in 27.  At last I know what's so weird about that book's prose style, besides sheer unreadableness!

Doc Smith may have used the definite article less than usual, but it was still the word he used most.  I only have five ebooks where the commonest word is something other than “the”:

  1. The Moon is a Harsh Mistress has “to” as its commonest word, though not because there's anything special about it; it occurs once in every 32 words, which is enough to put it in the lead after “the” drops out.
  2. The Door into Summer, also by Heinlein, has “I” ahead by a nose, and here it does feel as if this statistic tells us something about the story's focus of attention.
  3. John Varley's Press Enter ▮ has an even higher density of first‐person subject pronouns for its shut‐in narrator (356 times the figure for H. P. Lovecraft's The Dream‐Quest of Unknown Kadath!)
  4. Anthony Burgess's A Clockwork Orange has “and” as its commonest word, which is interesting given that it's another story written in an imaginary Russian‐influenced patois, though one where the obvious changes are all in vocabulary.
  5. Just like the above, William Hope Hodgson's The Night Land is written in an artificial idiom full of “and”s – though in this case despite the far‐future dateline it's painful mock‐antique English.

Indefinite Article

For “a(n)”, the range is narrower, since Heinlein's Loonie dialect betrays its fakeness by scoring fairly normally here.  The winners in more or less a dead heat at one indefinite article in every 26 words are two quite different stories: Philip José Farmer's The Alley Man and J. B. S. Haldane's My Friend Mr Leakey.  The least indefinite by a good margin is The Silmarillion again, balancing its high use of “the” by having barely one “a(n)” in every hundred words.  It also has a high density of “and”s barely short of The Night Land, pastiching a biblical style less ponderously than Hodgson.

Mind you, I notice that calculating it in bulk by author rather than by individual text gives me a different result: apparently my two most indefinitely-articulate authors on average are both short‐story specialists, Borges (Jorge Luis) and Porges (Arthur).  But wondering whether that was a sign of some significant overall difference in word frequency patterns between shorter and longer works has led me to conclude that (a) no, it doesn't seem to be and (b) my computer is too old for this.

Other Function‐Words

The work that has the most “that”s is once again that oddity The Night Land, at one “that” in every 27 words.  The opposite extreme is dominated by the works of Jack Vance, and in particular The Dying Earth, which may have a similar far‐future setting but uses the word barely a tenth as often.

Vance might in fact have been one of the Knights Who Say “Ni!”, judging by his avoidance of the word “it” – he used it less than a quarter as much as R. A. Lafferty did.  But if I hadn't automated these tests I would never have noticed this, or even the way Sir Arthur Conan Doyle used “which” fifty times as frequently as Vernor Vinge!  Lovecraft, meanwhile, was sixteen times less into “you” than Alfred Bester was; and he covered both ends of the scale for “we”, from using it once a paragraph in At the Mountains of Madness to once in total in The Dream‐Quest of Unknown Kadath.

Content‐Words

A few miscellaneous statistics:

Missing Words

In each of the following I'm looking for the longest text in my collection that doesn't contain a given common word – searching case‐insensitively, and including inflected and derived forms, so “Examples” counts as a use of the word “example”:

In case you were wondering, the longest text without “example” is Le Guin's Earthsea series – no “instance”s, either!


DRAMATIS PERSONAE

Inhuman Interest

Does my library include any stories with no speaking parts for humans, meaning members of the species Homo sapiens or relatives/derivatives/fantasyland equivalents?  Well, I've got plenty of shorter works populated entirely by talking animals, or Martians, or archangels, or robots – Asimov alone must have written a dozen.  It's trivial to omit things for the duration of a short story, maybe as buildup to a twist ending; again for it to mean anything I'll have to limit this to longer works.  Once I do that the field is suddenly very limited, and for some reason it's full of names we've seen recently: Le Guin's Ekumen stories (except they don't count because everyone's related via the Hainish), Animal Farm (nope, Mr Pilkington), Star Maker (nope, the human narrator), Watership Down (has a human‐point‐of‐view chapter)… that leaves just Giant Killer making the length requirement by a whisker, plus almost anything by Iain M. Banks!  The Culture may be full of humanoids, but they're explicitly unrelated to us Earthlings.

And yet there's one type of human being that often turns out to be wildly under­represented in the cast list, to the point where there's a rather famous benchmark for it:

The Bechdel Test

All that's required for a story to pass this test is that it has two significant (usually meaning at least named) characters who are women and who at some point talk to one another “on‐camera” about some topic other than a man.  It should be easy, right?  The protagonist only has to introduce herself to the empress on page 24 and that's it.  Indeed, since novels often open with characters already in conversation, you might think a substantial proportion would be page one passes.  But no: test grades that good are few and far between, and almost all of the cases I can find were published in the late nineties at the earliest.  One noteworthy exception from the seventies is that Varley's Gaea trilogy is a rare instant pass from book one, line one (in Titan).

I have to say that even going into it expecting bad news I was taken aback by how many stories full of memorable female characters that you'd swear were surefire passes prove on closer inspection never to show them interacting – they just take turns talking to, and/or about, the hero.  Things have improved since the days when Jules Verne had next to no women in his stories at all, but it hasn't exactly been a matter of steady progress since then.

Now, this test shouldn't be mistaken for a simple diagnostic of writers' attitudes; for a start, John W. Campbell Jr.'s tale of square‐jawed Earthmen overthrowing a matriarchy, Cloak of Aesir, qualifies as a clear pass, while Joanna Russ's feminist classic When It Changed falls at the last hurdle – perhaps just because it's a short story.  To avoid penalising low‐wordage works, which can easily happen to be poorly supplied with dialogue opportunities without that necessarily meaning anything, I'll carry on with my strict rule of only counting longer works (though I'm being flexible enough with the categories of “woman” and “man” that for instance Animal Farm is a pass), and I'll aggregate the results by decade to get some overall rough figures:

1900s: 25 % (1/4)
L. Frank Baum's 1900 The Wonderful Wizard of Oz was technically nineteenth‐century, but I'm going by decades here.
1910s: 19 % (2/11)
Tarzan of the Apes by Edgar Rice Burroughs is a pass, though reading Jane's conversations with racist caricature Esmeralda you'll wish it wasn't.
1920s: 29 % (4/14)
Oddly high considering how often the chief's beautiful daughter is the only woman on Monster Island.
1930s: 08 % (2/25)
And this is with enough material to be a relatively reliable measure: things really have got worse since Mary Shelley's Frankenstein.
1940s: 28 % (7/25)
Things that first appeared in US magazines have a consistently worse batting average than original novels.
1950s: 26 % (15/57)
The decade's one and only Bechdel‐compliant Hugo winner is Leiber's The Big Time.
1960s: 27 % (16/59)
The proportion of contributors who are women rises markedly, not that you'd know it from this.
1970s: 22 % (13/59)
Part of the problem here is that parody sexism still scores as sexism.
1980s: 59 % (33/56)
Progress at last, but there are still so many surprise misfires!
1990s: 68 % (26/38)
If I was less completist about late works by aging Grand Masters and had more early works by rising stars it would be higher.

(This is with omnibuses de‐merged and dated by first publications, though The Lord of the Rings only scores as one failure.)

The Reverse Bechdel Test

Any time you feel any doubt as to whether the Bechdel test is measuring a real phenomenon, the cure is to consider the flipside.  What proportion of longer works manage to show two male characters discussing some topic other than a woman?  There's no good reason for this test to be any easier than the “forward” version, but the only failures I can find are stories that don't pass either the forward or reverse test, mostly due to having no quoted conversations at all (as in Hodgson's The Night Land or almost anything by Lovecraft).  Then again, speculative fiction also offers opportunities for cast lists full of characters with no definable sex – Le Guin's The Left Hand of Darkness is my only other ambi­directional fail.

I'm not aware of any litmus tests for structural racism that work anywhere near as neatly as this.  Some galactic empires seem to have populations evenly divided between blondes, brunettes, and redheads, but you can't find them as easily (and it's not the author's fault if it's the illustrators making this assumption).

The Sausage Zone

Off beyond Bechdel‐test failure is the Sausage Zone test.  Might the story be set, for all we can tell, in a parallel dimension where men (somehow) exist, but nobody and nothing is female?  Next stop: the Sausage Zone!


CHARACTER FREQUENCIES

Scrabble Highscores

Battles between K'zaakkons and Jhyqxians might be expected to result in crazy letter distributions, detectable by tallying occurrences of the letter “K”.  In fact this only makes a difference for acute cases such as Leigh Bracket, whose tales of Valkis and Jekkara use the letter almost three times as much as Olaf Stapledon – despite which A Clockwork Orange still pips her at the post thanks to Burgess's trick of telling the whole raskazz in made‐up slovos.

Doing the same calculation for “Z” mostly tells you how the publishers prefer to spell “realised”!  But correcting for that, I think I can see why H. Beam Piper's Little Fuzzy is twelve times richer in “Z”s than Russell Hoban's Riddley Walker, which is yet another tale written an imaginary dialect of English – this time a broken‐down post‐apocalyptic one.

What about “X”, I hear you ask?  (Please stop using your mutant psi powers like that.)  Again some filtering is required, this time for contents pages full of “XXIX”s, but it's a clear win for Jack Vance.  In particular his extravagantly exotic The Last Castle has eleven times as many as Tolkien's The Silmarillion, which is also the book most sparsely supplied with “C”s, “J”s, “P”s, “U”s, or “Y”s.

Apostrophenia

The K'zaakkons don't make much difference for apostrophes, either, any more than all those O'Neills and d'Artagnans dominate the figures for mainstream fiction.  But apostrophes have gradually become more and more prevalent since Mary Shelley used them in possessives, quoted poetry, the word “o'clock”, and nowhere else: the early nineteenth century was the low‐tide point for apostrophes, after contractions like “'tis” went out of fashion but before negations with “‑n't” came in.  For a long time the formal literary standard even shunned phrasal contractions like “I'll/it's/you've”, which only gradually came to be accepted in dialogue, and then later in the narration as well.

Once you eliminate the references to starships' fo'c's'les (not forgetting the mere closing single quotation marks), you find that during the twentieth century that sort of complete avoidance of contractions gradually moved from being a marker of an epic style to being something reserved for comic‐relief androids.  There are a total of five ebooks in my collection that stick entirely to contraction­less antique‐speak:

  1. A Princess of Mars by Burroughs (1912).
  2. The Night Land by Hodgson, yet again (also 1912).
  3. The Dream‐Quest of Unknown Kadath by Lovecraft (written 1927).
  4. Star Maker by Stapledon (1937).
  5. The Silmarillion by Tolkien (mostly written 1930s–50s).

Most of these maintain their contraction­lessness by limiting directly quoted chitchat, with Hodgson and Lovecraft taking it to the level of forwards‐and‐reverse Bechdel failure.  Burroughs and Tolkien on the other hand have plentiful quasi‐archaic dialogue and do pass the Bechdel test!

A more unobtrusive avoidance of apostrophes has persisted as a feature distinguishing elevated space‐operatic discourse from everyday conversation in the vernacular, neatly demonstrated by the statistic that my Iain M. Banks collection uses apostrophes only three‐quarters as often as my Iain Banks collection.  I also note Andre Norton's space adventure yarn The Zero Stone, where colloquialisms like “that's” or “I'll” occur only rarely even in exclamations; and Chandler's Giant Killer, which always leaves phrases like that uncontracted but allows negatives like “didn't” even within its omniscient narration.

The competition for densest concentration of apostrophes overall would be an easy win for The Alley Man if Farmer had put as many flyspecks as he might have in his eye dialect.  Instead he stuck to spellings like “barkin and bitin”, so the surprise champion is Bruce Sterling's Distraction, which uses apostrophes eleven times as frequently as The Zero Stone and 167 times as frequently as The Night Land.  Taking it as far as it could possibly go in the opposite direction, Riddley Walker uses none whatsoever, even in possessives: it's set after the Great Apostrophe Catastrophe.

Other Punctuation

Some miscellaneous punctuation rate record‐breakers.

Whitespace, the Final Frontier

The last thing left to tally is: how frequently do spaces and line breaks occur?  The answer is that I don't care; those are silly roundabout approaches to measuring average word‐ and paragraph‐lengths and I might as well do it the sensible way instead.

Not startling me in the slightest, my most sesqui­pedalian ebook is Greg Egan's Diaspora, averaging 6.2 letters per word, while Riddley Walker has a radiation‐fried vocabulary, at 4.9.

Although Stapledon and Lovecraft had infrequent paragraph breaks by anglophone standards, Franz Kafka produced much bigger walls of text: The Trial averages 660 words per paragraph.  What's more, if we've decided to allow R.U.R. into the competition then the trophy for most frequent breaks will also be going to Prague, since Čapek's play averages thirteen words per paragraph!  Or then again if not, the contest is between Ray Bradbury and Roger Zelazny; in particular Bradbury's Something Wicked This Way Comes is about as close to a screenplay as novels get at twenty words per paragraph.


AFTERWORD

The real reason for this web page is that I spent a year not being able to do any of the things that usually inspire me to write new stuff, and thought I might as well go through my bookshelf until I came up with something.  Instead, this happened.  The next step is to force‐feed all those pulp magazines through a neural net in order to have it automatically generate any required quantity of material that Campbell would have rejected as unoriginal.  But for that I'd need more computing power and less common sense.

For another variety of linguistic statistics about the genre, see the Historical Dictionary of Science Fiction; for everything else, see the Internet Speculative Fiction Database.