publicidad

 

Página principal
    Hispania [Publicaciones periódicas]. Volume 73, Number 1, March 1990
    
Página principal Enviar comentarios Ficha de la obra Marcar esta página Índice de la obra Anterior Abajo Siguiente



––––––––   315   ––––––––


ArribaAbajoElectronic Text Scanning

Joseph A. Feustle, Jr.


University of Toledo


Among the claims about what computers can do for research and teaching, one fundamental point is consistently overlooked: many of the most powerful and appealing programs are all but useless unless you already have the text that you intend to work with in electronic form. All of the indexing power of Word-Cruncher and ZyIndex, the analytical capacity of MTAS and STRAP, the sophisticated comparative functions of CompareRite and UNITE, the pedagogical potential of Módulos para el análisis literario, indeed almost anything that you will want to do with hypertext is for naught unless you have the text on disk206. Traditionally, this has meant that, unless you could find an electronic version of your text -something highly unlikely given the negligible amount of text that has been put into electronic form in our field-, you were forced to opt for one of three expensive or unsavory alternatives: purchase or rent a Kurzweil, Calera (formerly Plantir), or equivalent optical character scanner, hire a typist to key the text in, or do it yourself. Today this is no longer the case thanks to a combination of declining prices for hardware and rapidly increasing capabilities of a number of OCR (optical character recognition) or «text scanning» programs.

My experience with text scanning derives from a hypermedia project that started with Rubén Darío's Prosas profanas and has grown to one million bytes of information as I added poems from Azul, Cantos de vida y esperanza, and other collections of poetry to it. I am now adding the text of Darío's unfinished autobiographical novel, El oro de Mallorca, and Juan Valera's important «carta-prólogo» to Azul. Typing in the text myself has been out of the question from the start. Having a typist do it might be an acceptable alternative were it not for the fact that the price of text transcription in the Toledo area is about $12.50 per hour and, even at a most optimistic rate of seven pages per hour, the cost quickly exceeds my resources. Access to cost effective and reasonably efficient text scanning is the only thing that has made the continued expansion of my project possible.

If you have a Macintosh, a PC, PS/2 or compatible computer with a hard disk, you already possess a substantial part of the hardware required for scanning text. The next most important part is a scanner. I have worked with two models from the Hewlett Packard line: the ScanJet and ScanJet Plus. The difference between the two is minimal from the point of optical character recognition, but the ScanJet Plus is the newer model and will scan images at 256 shades of grey as compared to 16 for the original ScanJet. The increased grey-scale capability is an added bonus that you can put to good use when incorporating images in your word processor files, desktop publishing documents or hypermedia. There are scanners on the market, similar in configuration and price, from many other manufacturers such as Abaton, Apple, Dest, and Microtek, to name a few. Once you have made your choice, you must be absolutely certain to purchase the scanner interface that is correct for your computer. While this poses no problem with Macintoshes -he scanner connects to either the SCSI or the serial port-, it can be, with PC's where you must obtain an interface card compatible with IBM's micro channel architecture design if you plan to connect the scanner to any of the PS/2 line of computers. The typical university price for a ScanJet Plus, and either Macintosh or PC-PS/2 interface, is about $1150.00. A typical retail price for this same combination is $1400.

If you plan to scan large amounts of text from 8 1/2 by 11 inch pages you may want to consider purchasing an automatic sheet feeder. The one that I have with my ScanJet Plus will automatically feed up to 15 sheets of paper through the scanner before reloading. This convenience represents an additional $250.00.

Hardware considerations can be further complicated by hidden factors such as the additional memory that is required to make effective use of the text scanning programs, and the added speed that you will want from your computer to quickly process the scanned text. My experiences with three of four text scanning programs, two for the Macintosh and one for the IBM PS/2, was that, while they

––––––––   316   ––––––––

did work as advertised on a minimally configured computer (a Macintosh with 1 megabyte or a PS/2 with 640K of memory), they scanned only one half a page of text at a time until I added additional memory. Thus, you actually must have a computer with at least 2 megabytes of memory before you can take full advantage of these programs. Optical character recognition, with the one exception that I will detail below, makes substantial demands on the computer's microprocessor. A complex page of text and graphics may take up to 8 minutes to process. Thus you may wish to consider a faster computer such as an 80286 or 80386-based PC-PS/2 or one of the newer Macintosh SE/30 or IIcx or IIci models. Some programs will only run on one of these faster machines.

Optical character recognition programs approach text scanning in two different ways: through feature extraction or through pattern matching. The former searches out the strokes, lines and curves (sometimes expressed as mathematical formulas) that are common to the formation of a particular character independent of any particular font or type style, and compares them with a built-in library of forms and characters. The latter compares the entire pattern of dots that make up a letter against its built-in library. Feature extraction programs can be further refined through the use of artificial intelligence. Some, after extracting all pertinent information, compare the characters identified against a dictionary to see if they constitute a real word. Pattern recognition programs can also be enhanced through «training»: you work with them in a kind of question and answer session where they show you the character that they have scanned and how they interpret it. You either confirm the program's «best guess» or correct it and in the process build an extensive template against which all other characters will be compared. Figure one shows an example from Read-It!

I have worked with four programs that are typical of both approaches to text scanning and that range in price from a low of $199.00 to a high of $2,900: ReadRight International and TrueScan (IBM); and Read-It! and TextPert (Macintosh). While each will effectively read text from a document on paper and turn it into electronic form, they differ in the ease and speed with which they do so, in their capabilities for reading text from other sources such as books and newspapers, and in the demands that they make on your time to prepare them to work effectively.

ReadRight International, from OCR Systems, is a feature extraction program that my department acquired after we discovered that ReadRight itself would not recognize foreign language characters. This «international» version recognizes Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Spanish, and Swedish (see figure 2)

and produces very accurate readings (between 95 and 100%) with materials that have been typed or printed in a clear monospaced format. Its speed on our PS/2 Model 50 is excellent, but it will not read text from books, articles or newspapers. It is compatible with most of the more popular scanners and can be obtained for a university price of about $200.00. OCR Systems has recently announced a new version of ReadRight that will handle more complex formats.

TrueScan, from the Calera Corporation, is a combination of software program and hardware card that requires at least a PC/AT (80286) class computer in order to operate. The model E that I use costs about $2,900 and will not yet recognize foreign language characters. But that is its only weak point,

––––––––   317   ––––––––

though a serious one for us. The hardware card installs in one of the expansion slots of a PC and functions as a complete, dedicated text scanning computer. It has 4 megabytes of memory and uses a Motorola 68020 microprocessor, the same as the one used in Apple's Macintosh II computer. This computational power allows TrueScan to make extensive use of artificial intelligence to automatically recognize page layouts and scan typewritten documents, pages printed on dot-matrix and laser printers, columns of text from books, journals and newspapers varying in size from 6 to 28 points, and to do so quickly (up to 100 characters per second) [see figure 3].

Pages are recognized in portrait, or landscape layout as well as facsimile images. I have never had TrueScan take longer than two minutes to scan and process any type of text, even the very small print in Espasa-Calpe's Colección Austral edition of Daróo's Azul. You can save the scanned text in any of two dozen major word processing formats that, for the most part, preserve any italics, bolding or underlining found in the original document. In addition to word processor formats, TrueScan will also save files in Excel, Lotus and Quattro spreadsheet formats, and all of the major graphic image formats. The lack of foreign language character recognition is more than offset by this program's speed and accuracy (up to 99%) and representatives from Calera with whom I have had the opportunity to talk indicate that foreign language capability is under consideration for future versions. Kurzweil, now a division of Xerox Imaging Systems, Inc., has a similar product for IBM and compatible systems called Discover, as does the Caere Corporation, makers of OmniPage, though this product is better known as a software-only program for Macintosh computers.

Read-It! ($199), from the Olduvai Corporation, and TextPert ($595), from CTA (Ciència i Tecnologia Aplicada, S. A., Barcelona and New York) take a different approach to optical character recognition and are least useful for scanning the odd page or two. They are best suited for longer documents such as a collection of poems, a long essay, short story or a novel where you consistently scan from the same type of text page after page. While both programs come ready to scan text automatically, they are truly effective only after you have taught them to read your material.

In the process of teaching Read-It! or TextPert to interpret text, you quickly become aware of the things that can detract from an accurate reading, exclusive of the mere clarity of text on the page: the size of the text, the use of kerning, the presence or absence of ligatures, and special formatting such as bolding and italics.

Type is usually measured in points. Seventy-two point type is one inch high. The text that you are now reading is ten point and the footnotes are set in eight point type. Some programs can read text as small as six points such as is found in this sentence. Kerning is «the adjustment of space between pairs of characters to create a visually even texture that is easy to read»207, and ligatures result from the actual connection of certain pairs, «f» and «i» and «a» and «e» (ae), to cite two typical examples.

Read-It! and Text-Pert both «learn» to read your text in essentially the same way. First, the programs make a scanned picture of your text. Second, they show individual letters on the screen and prompt you to enter the corresponding correct form from the keyboard. Figure 4 shows a sample training session from TextPert.

As this process goes on, the program will guess at certain forms already «learned» and ask you to confirm or correct

––––––––   318   ––––––––

its guess. The end result of the «learning» process is the creation of a template that the program holds on disk or in the computer's memory against which it compares subsequent scannings and readings of the text. The third step is to correct this template by purging any errors, a «u» interpreted as an «n», for example.

By training these programs, it is indeed possible to achieve a very accurate reading with certain text. Read-It! and TextPert differ, however, in the speed and ease of the learning process. Read-It! prefers to work with a template containing twenty to twenty-five examples of each scanned character while TextPert starts guessing from the moment you enter the first character. Read-It! will not read italicized or bolded text mixed in with regular text. TextPert will, and is the easier of the two to «correct». In addition to Spanish, TextPert will read thirty other European character sets from Albanian to Serbo-Croatian, Turkish and Welsh in four alphabets: Roman, Hebrew, Greek and Cyrillic.

As Phillip Robinson pointed out in his article «The Well-Read Mac», a 90 to 95% accurate reading of a page of text containing an average of 2000 characters translates into between 50 and 100 mistakes that you must correct208. While this is indeed true, these programs frequently make their mistakes in a consistent manner, «qne» for «que», and you can correct many by doing a global search and replace in your word processing program. After «training» TextPert for about an hour, I achieved a ratio of only 8 errors in 2496 characters in a scan of Rubén Darío's El oro de Mallorca, yet the same program trained for one and a half hours on Sor Juana Inés de la Cruz's Respuesta a Sor Filotea demonstrated consistent difficulties in recognizing the letters «u», «m» and «n». Thus, the degree of accuracy depends on the clarity and quality of print in each individual text.

The question that you must ask yourself is: «is text scanning worth the cost and the bother?» You still have to proof and correct the text after the computer has read it, but you also have to do that if you have a typist key in the text for you or if you type it in yourself. You may also object to having the program tie up your computer for long periods of time when you need it for other projects. I catch up on my professional reading or grade papers while waiting to turn the page in the book or insert the next set of pages. True computer multitasking -Macintosh's multifinder is not- will eventually eliminate this problem and you will be able to have the OCR program read your text in the background of your computer's memory while you write or telecommunicate at the same time.

If you have plenty of time to train it, Read-It! is a fine choice for a Macintosh. If you can afford the extra cost, TextPert is even better. The reputed best of the Macintosh text scanning programs is Caere's OmniPage, but to use it you must have at minimum a Macintosh II with a hard disk and 4 megabytes of memory. On the IBM side, Read-Right International is quite adequate for basic scanning needs, and its maker, OCR systems, will have an even better version out shortly. TrueScan is high in price, but if you anticipate doing a lot of scanning, it is a fast, excellent, reliable choice. No matter which program you choose, though, you must exercise the same caution regarding copyrighted material when scanning text that you do when making photocopies.

Text scanning has quickly become one of the «hotter» issues in the computer industry. Today's major trade journals regularly feature articles on text scanning programs and tout their ability to read foreign language characters: «Omnifont OCR Offers Learning Capability, Reads Foreign Text» and «Foreign Language Support Added to Xerox Scanners»209. In his article, «Reading into the Macintosh», Lawrence Stevens used Spanish text (Gabriel García Márquez's El amor en los tiempos del cólera) to carry out part of his evaluation of several well-known text scanning programs210. This flurry of activity is due to increasing competition in the foreign markets, particularly in the area of desktop publishing. All of which makes it a lot easier for us to spend our time doing what we do best and, in my case especially, that is not typing text into a computer.




    Hispania [Publicaciones periódicas]. Volume 73, Number 1, March 1990
    
Página principal Enviar comentarios Ficha de la obra Marcar esta página Índice de la obra Anterior Arriba Siguiente
Marco legal