(image from http://bit.ly/1qqjlEB)
As an 11-year-old boy in 1982 and the proud owner of a Casio J-100 digital watch – the cutting-edge of inoperable wrist-based LCD technology at that time – Ridley Scott’s Blade Runner was a revelation to me. Here was an A-Z of life in November 2019, a date that was theoretically within my own lifetime. An intensive 15 minutes tabulation on my Casio calculator confirmed that I would be an improbable 47 years old when this terrifying dystopia rolled into view, and – ignoring the dystopian futurism, it being a concept I was unfamiliar with at that time – old enough to own one of the cool spaceship/flying car things seen in the movie’s bleak vision of Los Angeles (even if that vision was based on Teesside's Wilton ICI Chemical Plant). The possibilities were beyond exciting, despite the future resembling a dark and wet evening in Middlesbrough.
Today, with 2022 just around the corner it’s time to take stock and weigh up some dispiriting odds. It would seem that my car is not going to fly me to the shops any time soon. Also, bio-mechanical owls (pictured above), ruthless humanoid AI warrior replicants, mile-high mega buildings and off-world colonies where I can ‘begin again in a golden land of opportunity and adventure’ are not on the menu. My 11-year-old self is disappointed. But looking to the positive side, how would the 1982 me react to the actual modern technology of my everyday life?
The younger, slack-jawed and bug-eyed version of myself would be irrefutably agog to know that a lot of my grown-up working day is spent battling soulless, implacable AI-lite circuit boards that are intent on turning fine-tuned meaning into a nonsense soup. Here’s the enigmatic formula for how this works, a coda cracked by me on while operating on the editorial front line: (1) a publisher accepts an author’s manuscript. (2) Words are edited so that the story makes sense and is attractive to readers, and the manuscript is reborn as a book. (3) Editors and proof-readers then scour the freshly made pages, removing errors, typos and other imperfections. (4) The gilded lily is then printed and goes on sale to much aplomb. Finally, (5) those beautiful camera-ready printed pages are fed into a computer that is meant to turn the publication into an eBook – an ethereal digital facsimile of an actual thing – but which randomly, horrifically, adds spelling mistakes, weird characters, bad word/line breaks, extra spaces and all sorts of other digital atrocities. This corrupted, typo-littered eBook then goes on sale. After all, it’s already been edited and proofed several times – right?
Both the old and young versions of myself are righteously offended by this turn of events, and the editorial spirit is forged in the white-hot furnace of indignation…
Optical Character Recognition
Optical Character Recognition (OCR) works by scanning an image of a word on a printed page, which it then recognizes as a word and renders digitally, thus making it searchable. All straightforward in theory, but the havoc wreaked by clumsy OC recognition can drive readers mad, and occasionally an egregious substitution causes authors acute embarrassment. If you’re a fantasy author, do you really want a dragon to ‘bum its victim with fiery breath’ or a depressed man ‘smiling through his teats’ in your story? Imagine R.R. Martin’s surprise when a poor PDF conversion merged every page’s running head into every page of his body copy:
OCR (not to be confused with RoboCop’s evil OCP conglomerate, although the mistake is understandable) has a particular problem with rendering ‘ar’ as ‘an’, especially with older, ‘noisy’ printed pages, which led to some unexpected erotic liaisons for romance writers:
‘When she spotted me, she flung her anus high in the air and kept them up until she reached me. “Matisse. Oh boy!” she said. She grabbed my anus and positioned my body in the direction of the east gallery and we started walking.’ (from Matisse on the Loose, by Georgia Bragg)
‘Mrs, Nevile, in exquisite emotion, threw her anus around the neck of Caroline, pressed Her with fervour to her breast.’ (from Edward: Various Views of Life & Manners, by John Moore)
Research based on the Large Scale Historic Newspaper Digitisation Program (2007–2008) found that the accuracy of commercial OCR software varied from 81 to 99 percent. Six years after this research was carried out, OCR (now reborn as the Occasionally Correct Reader) is still struggling to make sense. I have proofed more than 150 titles for Aurum Press and other publishers in the last three years, and only four of them had no conversion errors. Although some of these mistakes were typos found in the original text, the majority was introduced by OCR software – a typical example being ‘tlie’ instead of ‘the’.
Some publishers seem to think that digitising their back catalogue and uploading the result is a quick way to squeeze a profit from long-expired titles. Proof-reading is skipped as the book has already had extensive editorial treatment in preparation for its original publication. This of course underestimates the half-arsed job that OCR often performs, as highlighted in this blog post. As eBooks continue their rise in popularity (there are currently around 2.5 million on Amazon, and more than 40 million available on Google Books), it’s important to not trust the machines with content. If Blade Runner, RoboCop and Terminator 4: Salvation (alarmingly set in 2018) can teach us anything about modern tech-anxiety, it’s that we absolutely can’t trust the machines. The message is clear: always get eBooks proof-read by a carbon-based bipedal editor to find the irritating quirks of OCR scanning. These errors look awful, wind readers up, insult the author and damage a publishing brand.
Refusing to budget for this editorial search & destroy mission also puts people like me out of work, which is even worse.