Humanist Discussion Group, Vol. 33, No. 808. Department of Digital Humanities, King's College London Hosted by King's Digital Lab www.dhhumanist.org Submit to: firstname.lastname@example.org  From: C. M. Sperberg-McQueen
Subject: (Humanist 33.803) (131)  From: Henry Schaffer Subject: Re: [Humanist] 33.803: proprietary formats (11)  From: Henry Schaffer Subject: Re: [Humanist] 33.803: proprietary formats (9) -------------------------------------------------------------------------- Date: 2020-04-28 20:14:27+00:00 From: C. M. Sperberg-McQueen Subject: (Humanist 33.803) on proprietary formatsJonothan Halfin observes quite correctly that many common proprietary formats (and formerly proprietary formats) can be read by open source software, so that " it makes little difference in any research I have done to date, whether the source material I am seeking is archived in .pdf, in .doc, in .xls, in .ppt, in .jpg, in .tif, in .png, in .gif, in .wpd, in.ps, or in.txt or .rtf. (or for that matter, any other commonly used file format.)” A similar argument has been made within the library and archival community to the effect that one needn’t really worry about commonly used formats, because the market will ensure that there will always be decoders for them. There are two reasons some people are unwilling to join JH in his conclusion that "whether older archived digital media is still in a usable format [is] rather a moot concern”. First, the phenomenon is largely restricted to very widely used formats which can be decoded without violation of relevant patents or other intellectual property. That many programs can handle JPEG is good, and unsurprising given that it is defined by an open, publicly available standard and has never been proprietary. One might have less luck with any of the scores of proprietary formats once heavily marketed by vendors and now abandoned. Yes, Open Office can read current Microsoft Office formats, but how well does it do on Multiplan? Or for that matter on early versions of Microsoft Word? I knew some people who were devoted users of Edix/Wordix, but the last time I pulled down an Import menu in Open Office, Edix/Wordix was not on the list of supported formats. (And how is Gimp on Kodak Photo CDs? ouch!) Yes, of course many readers of this list have not heard of Edix/Wordix, or of many of the other proprietary formats used by academics a few decades ago. That is the point. They were current then, and largely forgotten today. Some software and formats that are current today will still be familiar in a few decades; some will be forgotten. When it comes to making sure that data you care about is still readable in twenty or forty years, all you have to do is guess right about which is which. Who am I to say you won’t be lucky? At the DH conference in Utrecht last summer there was a session on virtual-reality projects, with a talk I found rather poignant about some older VR projects into which their developers had sunk a great deal of time, which are now effectively inaccessible because they can be run only on older hardware and software, or in some cases on emulators which not everyone in the potential audience is likely to have lying around. All of them had been done in formats that were known to be proprietary, but also know to be quite commonly used. But despite being commonly used (by some standards, at any rate), those formats are not now easily readable. If you use proprietary formats for work that you would like to see remain available for a while (say, while you are alive), then all will be well, as long as the format you choose is so commonly used and so commercially significant that someone (else) will write an open source decoder for it. If it turns out otherwise, well, you’ll be like a recording artist who discovers that all of their master tapes and all of their songs are owned by someone else. So you'll have plenty of company. The second reason is that existing decoders are so often faulty. It’s not too hard to reverse engineer some formats *in part*; a program to read WordStar files and translate them into a form other then available software could read was one of the first programs I ever wrote for someone else to use. It lost all of the formatting information (though I think it managed to detect and preserve paragraph breaks), but the student had been so panicked by the fear that they were going to have to retype their entire thesis (or rewrite it, in the case of the chapters for which they did not have printouts) that they were grateful just to get the character stream back out. A later program I wrote to decipher Word Perfect’s binary format and produce SGML was, I think, more successful (but then, I did have access to a running copy of Word Perfect and could run experiments to try to understand the format, which was not the case for the WordStar project), but my program was only ever intended to handle the one set of files I was interested in. That better programmers with more time can often do better, is clear. But it’s also clear that it is rare for any decoder to handle everything correctly. (Is there any evidence that it has ever happened? that it has ever happened in a non-trivial case? that it has ever happened for the particular proprietary format you would like to rely on?) I don’t use word-processor formats much, but every year or two I write a paper which a book editor or a journal cannot handle unless I convert it into Word. So I generate HTML from my TEI-encoded source, and import it into Open Office, try to clean up some of the worst excrescences of Open Office’s import facility, and save it as a Word document. After copy editing and perhaps some reformatting by the publisher, the editor will send it back to me in Word and I will open it in Open Office. This has happened ten or twenty times in the last twenty years, and I have yet to get a document back in which the Open Office / Word interconversions have gotten everything right. A list has been changed from a bulleted list to a numbered list, or vice versa, or screwed up in some other way. Two of the footnotes have mysteriously been changed to gibberish, or one paragraph has been truncated in the middle. If the conversion routines botch things this badly for simply formatted expository prose, I don't like to think what they do with complicated documents. So when I hear people say that for commonly used formats there will always be satisfactory conversion programs, I always want to ask "have you ever actually *looked* at the output of those conversion programs?" (Actually what I want to ask is slightly different, but it's rather rude, so I don't want to say it in front of Willard.) Those who are content for the work they do to be preserved for the future only in mutilated form are welcome to use proprietary formats. But do please note that when you ask for sympathy later, after the years of work you poured into that proprietary format are gone because the format is no longer supported, you may only get a shrug: Your gun, your bullet, your foot. As for me, I have come to think that putting anything you care about into any format not defined by an openly available specification is like leaving the only copy of your newly completed manuscript lying on the desk in your study and then turning around and throwing a lit Molotov cocktail into the room before closing the door and going into the front room to watch Netflix. Enjoy the film! ******************************************** C. M. Sperberg-McQueen Black Mesa Technologies LLC email@example.com http://www.blackmesatech.com ******************************************** -------------------------------------------------------------------------- Date: 2020-04-28 12:47:10+00:00 From: Henry Schaffer Subject: Re: [Humanist] 33.803: proprietary formats This discussion of proprietary formats nudged my memory and I found that I still have a C program (with documentation) which I wrote back in the mid 1980s to translate an ASCII file into WordStar 1.4 format. So it gives some information about that format. Might this be of interest to anyone? I could easily distribute it (program + C source = 262 lines 10kB) via email or whatever (e.g. should I put it in GitHub?) --henry schaffer -------------------------------------------------------------------------- Date: 2020-04-28 13:21:05+00:00 From: Henry Schaffer Subject: Re: [Humanist] 33.803: proprietary formats Oh, one more item from the past which makes me grin - at the end of the doc I give my contact information - my postal mail address and also " TSCHES@TUCC.BITNET, TSCHES@TUCC.TUCC.EDU, firstname.lastname@example.org or ...mcnc!ecsvax!hes" The last one was known as the "bang address". --henry _______________________________________________ Unsubscribe at: http://dhhumanist.org/Restricted List posts to: email@example.com List info and archives at at: http://dhhumanist.org Listmember interface at: http://dhhumanist.org/Restricted/ Subscribe at: http://dhhumanist.org/membership_form.php
Editor: Willard McCarty (King's College London, U.K.; Western Sydney University, Australia)
Software designer: Malgosia Askanas (Mind-Crafts)
This site is maintained under a service level agreement by King's Digital Lab.