
|
Digital derring-do for the Information Manager: Consultancy principal Michael Steemson's 4,000-word case study of an eight-year project to develop an image-based news and picture library at a large British newspaper group for which he worked. |
|
Little more than a decade ago, an English newspaper jack with the name of a king, Shah, fired the opening shots in a revolution. He launched the first national daily newspaper in Britain to be produced entirely by computer technology. Then, an Australian newspaper king with the name of a warrior, Murdoch, built a printing fortress in London’s old East End and won the war with the print unions. The casualties were the supposedly impregnable trade unions, the thousands of print workers with "unassailable" jobs and the British newspaper industry around London’s Fleet Street -- "The Street" to generations of journalists from all over the English-speaking world -- which was scattered to the four corners of the fair city. In the following five years, British national newspapers took to high technology like so many ducks to water. Office bulged with humming boxes and simultaneous print runs started all across the nation. The almighty computer had arrived in The Street. It was not before time. Our country cousins, the provincial dailies, had been using computers for years. Fleet Street companies had been held back from making the leap into the technological trenches because of the inevitable battles with the over-manned print unions. It needed an Eddie Shah to light the fuses and a Rupert Murdoch to resist the blast. It is probable that it would not have happened at all without the capitalistic crusade of the then Prime Minister, Margaret Thatcher. This "Revolution of the Nationals" discarded the last remnants of inventor William Caxton’s 15th Century technology. Out with them went hot metal type-setting, the Linotype dinosaurs on the composing room floors, the clattering typewriters in the newsrooms, paper, forms, galleys, slugs and cried of "Copy boy!" In their place came air-conditioned office suites, hushed as the ancient chapels where, centuries ago, monks learned the printer’s trade long before anyone had heard of "the Media". In their place came lasers, screens, buttons and bytes, work stations, algorithms and programs. Journalists even had to learn a new spelling of "programmes". One of the oldest Fleet Street companies, Express Newspapers, jumped joyously into the technological Jacuzzi with the rest of them. But it jumped further and deeper than anyone else at the time with a new, high-tech imaging process for the news reference library, the OPAL system. OPAL, an acronym for Optical Automatic Library, coupled Filenet and Digital Equipment Corporation computers with some smart indexing software to store images of newspaper clippings for access by our journalists. In 1988, this was a major first for U.K. national newspapers. At its heart was the image handler, the Optical Storage and Retrieval unit, or OSAR, which held racks of 64 twelve-inch (30cm) optical storage laser disks, four disk drives and a robotic arm. Storage on each laser disk was 2.3 gigabytes, or more than two thousand million bytes of information. Each disk carried 30,000 clipping images. When full seven years later, it contained the equivalent to two miles of clipping file shelving. But, by then, the system had been orphaned by the demise of the company that built it. The papers’ perception of "high-tech" had changed and the system could not cope with an index database of 20 million references and a clippings store of one and a half million images. CODS for OPALExpress systems technicians had, by then, developed a newer, swifter, more intelligent OPAL system based on an array of optical CD drives. Knowing the technologists’ love of acronyms, the mischievous, unofficial name for the new OPAL was CODS - Cheap Optical Disk Systems. How cheap? The old OPAL system cost around £2 million to buy and carried an annual maintenance burden of £200,000. The new system cost slightly more than £200,000 to build and its annual maintenance costs are around £15,000 - an 18-month pay-back and reduction in running costs of more than 90 percent. The migration process from the Filenet 12Gb disks to the CDs was carried out smoothly. Image conversion was carried out through a bank of a dozen or so Pentium-chip PCs number-crunching away for a month or more. Each image was turned from Filenet’s own Group 3 format to an open Group 4 then trimmed of its surrounding white space, de-speckled and de-skewed. The new processes and modern medium allowed the archive to be compressed to only 57 per cent of its original size, a welcomed fact that gave the clever algorithms doing the work the nick-name "Heinz". It was exiting to watch the conversion process. The original image flicked onto a screen and in a second, the white, A4-size surround disappeared. Moments later, the image lurched to an exactly upright position. Occasionally, because of a mark or distortion on the original, the software was uncertain of the letterpress orientation and rotated the image through 360 degrees, scanning it at 90 degree intervals, seeking an identifiable plane in lines of type. If the checks failed to recognise a misalignment, it filed the image as it found it. Concurrently, a Picture Store was being built for the newspapers’ huge library of Press photographs. Early in work with OPAL, the potential for storing picture images had been seen, but in the late 1980’s the technology at an affordable price was not up to it. But a start was made by introducing bar-codes to the Picture Library. Everything that did not locomote was bar-coded - prints, transparencies, negative sets, storage envelopes and sleeves. A number of bar-code readers were tested before settling on hand units made by the Bohemia, N.Y., manufacturer Symbol Technologies. The systems department began by creating software with which to index photo and negative hard copy and log its movement into and out of the library. The basic programme language was Cobol, old-fashioned but infinitely variable and almost plain English. In places, programmers added Pascal, Algol and, later on for the front end of the picture image system, Visual Basic. Within a few months, the indexing and logging system was up and running and for the first time in it’s history, the Library was able indefinitely to keep details of who had withdrawn photograph envelopes or transparency sets. The picture system rides on the company’s UniSys A-series mainframe computer and is driven by Sony and Taxan 486 66Mb PCs built to the company’s own configuration. The mainframe was originally intended just for the accounts department, but was extended to budgeting, pensions, circulation and other tasks as the regular, bi-annual upgrades bought greater processing power. Photo-CD InspirationIn 1992, the systems gurus saw that the storage of colour pictures on line for use by the picture editors was an attainable reality. The catalyst for this was the Kodak Photo-CD system which they inspected at Kodak’s UK headquarters in Hemel Hempstead, a light industrial town of some 80,000 inhabitants 20 or so miles (32km) North of London up the main arterial M1 motorway. The Kodak system could scan 35mm negatives or transparencies in seconds to produce simultaneous picture files from tiny thumbnail to massive 18 to 20 megabyte definitions. It was very expensive at that time. So, the 30-strong systems department set about creating its own process adapting the already-written programmes for indexing hard copies. Company experts studied other developments in Britain, Europe and the United States, visiting sites in Paris, Portsmouth, Pimlico and Pasadena. They realised that storage on-line of the really high, 20-plus megabyte images was beyond company needs, but the middle range of eight to 10 megabytes was not only storable, but ideal for reproduction on even high-quality newsprint paper. Several cheap scanners like those from Kodak, Canon or Scitex - could provide images from 35mm colour negatives or transparencies fast enough to keep indexers busy and at a wide variety of file sizes. The new system kept four different definitions, three of them on-line - the thumbnails, the low-resolution full-screen version for editing and the eight to 10Mb image for printing. Using the J-PEG algorithms, these three were compressed to around one megabyte, without great loss of definition. The largest image was stored on laser disk, the other two on hard disks. Two juke boxes made by the British company Reflections were linked to store the high-resolution images. The juke boxes carried fifty 5.25in (17cm) optical disks on a treadmill-style wheel which revolved to load one of the disk drives. Each box was the size of a two-drawer filing cabinet. The two juke boxes mirror each other, both carrying all the images to provide resilience demanded by the editors. The software controlling the juke boxes loads to both identically unless a box fails. Then, it remembers which images have been omitted and completes the task when communi-cations are restored. The 50 disks, each of one megabyte storage capacity, carry 900 images, a total 45,000 pictures which will take a couple of years’ work to fill. But, what became of the fourth, 18-20Mb images? The system compressed them to around 4Mb and loaded them sequentially onto cheap, magnetic tape against the day in an estimated four years’ time when technology would provide the means to use them efficiently. The head of the systems department believes that new technology doubles all its abilities -- storage, speeds, processing power -- every two years. He and the company expected that within four years or so, storage and telecommunications capacity will have advanced to the point where it will be practical to put the big files on line. More than that, before the turn of the century, image picture archives will be expected to have bigger files available as a matter of course, When that time comes, the Express images will be ready to down-load, using the tapes only that one time. The images will automatically pick up the indexing references already applied to the small file-copies and they will be ready to roll out. Because all the work had been done in house, re-configuration, adjustment or re-design work will be possible indefinitely, avoiding further effects of orphaned application software. Pivotal Indexing ProtocolsThe storage procedures are much the same for pictures or news clippings. Filing clippings begins with library staff cutting articles from the newspapers - all stories from the companies newspapers The Express (formerly the Daily Express), the Express on Sunday (the Sunday Express, as was) and the Daily Star, and a selection from other national dailies, evening newspapers and magazines. Each clipping is scanned on a flat-bed machine rather like an office duplicator, then laser etched onto one of the disks. The library stores about 800 clippings a day. Picture images come direct from the wire services through the picture desks or are scanned in the Library form negatives, transparencies or prints. The scanners automatically give each image its own identity, a unique number, and remembers. The scanner operators routinely attach the first index references, the newspaper name and date of publication. The library stores about 300 images a day. The indexing function is pivotal to the success of the process. The archive was designed to be accessed by the users, the writers and photographers, searching for the words they used in the stories or picture captions. Hence, library staff had to re-learn all they knew about indexing. Here was where I came in. I had been a journalist for more than 30 years. The company realised that some journalistic input was needed for the new magic, and I became the link between the library and the newsmen. Learning curves for library staff were long and steep. The librarians had to abandon their lovingly-devised file descriptions like "Aero: Civil Airports, London (Heathrow)" or "Anatomy: Ears, adornments". They had to learn to call them simply "Heathrow" and "earrings", just like the journalists. They had to learn to think laterally, remembering to file a story about the price of oil into the petrol folder, or a mountain scene under scenery, even if the words "petrol" and "scenery" were not mentioned. They had to forecast what other words would be used to retrieve the item. After the scanning process, library clerks call up clipping or picture images in batches to index work stations where they manually enter a mass of index references linking them to each image - names, places, events and subjects that appear in the picture or article. The system has not yet been developed to a text-reading capability. If a particular index reference -- a "folder" -- does not exist, the indexer creates a new one and the systems write a new code number to the database. Each clipping gets an average 13 index references. This equates to putting a paper clipping into 13 different filing envelopes and compares with the average two copies that were actually stored in the old paper-based archive. Pictures usually get even more references. The process is very time-consuming. It takes 11 scanning and indexing shifts a day to load 800 clippings and a proportionate time for the new pictures. Back-filing of pictures will take many man-hours to complete. Learning Picture IndexingIf the learning curves were long and steep for indexing clippings, they were really mountainous for picture indexing. At least with news stories, indexers had the words in them as a guide. But pictures are different. Anyone who has dealt with photo-journalists, as news picture-men sometimes wish to be called, will know that the only thing they like doing is taking pictures. By the same token, the one thing they absolutely hate is writing picture captions. I have never understood why, but it’s something to do with crossing journalistic boundaries; perhaps, a touch of indolence; and, I suspect, more than a little of the temperamental artist leaving his familiar medium. Anyway, pictures usually arrive in the Library with little or no indication as to their subject. As a result, if the pictures are unpublished, indexing can be very difficult. Worse still, a picture may be used for a dozen different reasons in the future. Unless the indexers realise these potential uses, they will not index them properly and future users will come across them only by accident. Another problem: Many pictures show no particular news event but rather illustrate a general theme - stock pictures, as opposed to news pictures. They can be useful to writers of feature or background articles may want to a picture to illustrate an idea, or an impression, or just create an atmosphere. But these stock pictures are even more difficult to index than the news pictures. They must be given references for the ideas as well as the images they represent. Stock picture librarians are not dismayed when asked for a "warm" picture. They discover what the client wants with a few deft questions. But, how does one get the client to enter the right concepts into an on-line archive for the "warm" pictures of lovers holding hands, cats curled up in a rug, glowing sunsets or beds of marigolds, without also giving him pictures of sweating athletes or forest fires? Library chiefs are getting their news-orientated heads round these conceptual problems, deciding on a more controlled index thesaurus and retrieval screens with icons and buttons sign-posting the way. Concepts and general subjects will be categorised allowing users to drill down to key-word lists. We’ve got a lot more work to do on this, but we know which way we are going. (Concepts and Feelings, Facial Expressions and Actions key words are listed in the Indexing Appendix) Learning curves for some journalists have been long, too. As well as absorbing the manipulatory commands, they have had to learn where to find items, usually without the help of librarians. To do this, he effectively reverses the indexing process. At a work station near his desk, he enters the name of a person, place, event of situation he wishes to research. If the words are recognised, the monitor screen indicates the number of clippings or pictures available. If the "set" is too large, the user can refine it by entering further names or places, other key words to pin-point more precisely the items he wishes to see. Once satisfied with his set, the journalists can view the images, latest date first, on the work station screen. He may order copies on one of the systems laser printers. Designers’ MisconceptionsThe original OPAL designers made one or two false assumptions about journalists. Understandably, perhaps, they relied on the belief that journalists can spell. Wrong! They also worked on the premise that, when a newsman sits down at a work station, he knows what he is looking for. Wrong again! Most newspaper men worth their salt will tell you, in private, of course, that their spelling is far from good and that often what they often require from a library is inspiration rather than information. Express programmers set about making allowance in the software for these misconceptions. The English language is filled with words carrying double letters. Not everyone is always sure how many double letters there in, say, "accommodation". Yes, two c’s and two m’s! Another example: North Americans often disagree with Englishmen about the number of l’s in words like "traveller" or "marvellous". So, the index system was re-designed to ignore double consonants, filing them all as if they were single, though still displaying the word correctly spelled on screens. A number of other special algorithms were built in to help a user who might be uncertain if the name he is looking for his spelled "Mc" or "Mac", of ends with "or", "ar" or "er". The software also recognises many other homonymous word endings, those that sound alike but are spelled differently. Systems staff tailored other functions to company specifications, improving an original process linking certain folders to others to create a hierarchical "family tree" through which clippings or pictures may be filed automatically. These and other modifications and innovations have vastly improved the system and brought increasing acceptance and use by the journalists. Mistakes were made, of course! The number of index references needed for each item to provide full coverage for all journalistic enquiries was under-estimated. This impacted on the time indexing would take and the number of staff needed to do the job. Early attempts at creating the index contained many multiple-idea folders that caused confusion. There are folders called "aircraft" and "accidents. For a time, the index also held a folder for "aircraft accidents" which led to some indexers using this while others filed into the other two. The family tree hierarchy of linked folders is still unfinished. The Basis was RightBut the basic concepts were right. A journalist can use the system with a couple of hours’ training and get from the screens satisfactory standards ofd information, without having to read though envelope-fulls of unnecessary clippings or huge files of picture prints. Once the library clerks understood the new indexing concepts, they had little difficulty filing material where the journalists can find it. After all, journalistic language is very largely the language of the people, street language that needs no degrees in grammar to comprehend. The decision to use human brain power for entering the index references rather than relying on full-text search functions was right, at the time. No full-text system then available could sufficiently accurately emulate the trained selectivity of the human mind or the creative intuition of an experienced librarian, wise in the ways of his journalistic colleagues. But in the last few years a whole varsity of software systems have come onto the market which provide more intelligent text searches. Some can automatically convert images of clippings into text for searches. Coupled with the knowledge of the indexers, these processes can be built into a very powerful force for the retrievers. In the early days company investigations, picture manipulating systems were dominated by Macintosh. Much of the work on the preparation of pictures for printing was done with Macs in the Express office and the platform dominates the publishing industry outside newspapers. Nonetheless, it was felt, and recent events have supported the impression, that the cheaper, IBM-compatible PCs are the way forward. That is the route that was taken. The library work stations are standard 486 PCs with 66Mb memories, video-graphic image cards and 17in screens. Picture manipulation is carried out with Microsoft Windows software backed up with interfacing software to supply the remaining Macintosh processors in-hour and, when the system is made public, the Mac-users out there in the publishing world. Most of the British newspaper groups rivalling the Express now have text-based reference libraries. A group of them, including the Daily Mail, the Daily Mirror and the Daily Telegraph, exchange text archives daily. Others like The Times and the Daily Telegraph have plunged deeply into electronic publishing. In 1995, Rupert Murdoch said his group News International, which publishes The Times, the Sunday Times, the News of the World and the Sun, would begin publishing its newspaper titles on-line within a couple years. News International Newspapers has the best-developed strategy with News Corporation for on-line publication of newspapers and will have some of its titles available on both business and consumer on-line services by the middle of 1996, largely through the Internet. The Next StepsAt Express Newspapers, the next stage of development for the news library will almost certainly be the ability to load images direct from the print process to the library database, thereby eliminating the need to re-scan the clippings. Adobe's Acrobat software seems to provide all the requirements for this task providing text search functionality but with displays that still look like the clipping images. Thumbnail displays of cutting images are possible, too, allowing journalists for the first time to browse actual headlines before clicking on the images he wishes to read in detail. This is where the image-based process shows its great strength. A journalist can tell a great deal form the look of a clipping. He has probably read it already, on the day of publication, and can often recall its shape and size. It is not just a screen full of words. It recreates the excitement of the news story as it was reported in the heat of the moment, with fanfaring headlines and graphic pictures. Much of this the journalists can understand without re-reading as single word of the text. He has to work at speed. All these extra impressions are invaluable extra tools available only from an image-based library. Archivists value clippings. It gives them a sense of contact with the past. The type-daces can date the item almost as accurately as decorations on the face of a building date its construct ion. The images have immediacy and researchers can get a real sense of the urgency and excitement of the moment the story came off the presses, or dropped through the letter boxes in the millions of British front doors each morning. Express Newspaper began in the year 1900. If OPAL had been around then, the electronic library would now have stored the history of almost the whole of this extraordinary century - two world wars, heavier-than-air flight, the development of the internal combustion engine, modern Europe, and the rise and fall of Communism. It would not have had to throw away any of those 70 million clippings but kept them in the space of little more than half a dozen wardrobes. Perhaps more importantly, it would have contained picture images of the time, as well. It is the capturing of images, textual and photographic, that makes this sort of library so valuable, such compulsive viewing. Technology continues to develop image storage with increasing enthusiasm and resource. Most personal computers on the market now have screen capable of displaying images. Soon, more and more archives like the Express library will be able to link to PCs anywhere in the world and sell users facilities allowing research of history as it was made. The possibilities are huge for the development of storage of vast numbers of pictures at a quality high enough for reproduction in any publication. The future prospects are enormous, and enormously exciting. A clever man, or perhaps it was a women, once remarked that a picture was worth as thousand words. I was a journalist, not a mathematician, so I leave it to you to calculate the value of the picture of a thousand words, or the images reflected in the millions of words stored in the Express library’s humming grey boxes. To go to the Home Page, click HERE.
The Caldeson Consultancy.
|