Drobe :: The archives
About Drobe | Contact | RSS | Twitter | Tech docs | Downloads | BBC Micro

Processing text the easy way

By Paul Beverley. Published: 7th Sep 2003, 19:35:03 | Permalink | Printable

Paul Beverley shows search'n'replace who's boss

Archive logoEvery publication online and offline has its own house rules on the style and presentation of content. Archive Magazine editor Paul Beverley explains how he uses a new freeware application to prepare text for publication.

Last month, "My Favourite Application" was SpamStamp. Indeed, SpamStamp still continues very high on the list, but this month a new PD program has appeared which is my latest "Favourite Application". It's still being developed, but it's most certainly already a very usable application.

The history
Some months ago, I asked on Archive-on-Line if someone could write me a program that would carry out a sequence of search and replace functions that I could specify by using a script. The particular job I had in mind was speeding up the production of the words-only version of the magazine each month. I take the Text file from inside the directory-type Impression document which contains all the text of all the articles in the magazine. This then has to be massaged by a whole string of search and replace actions. I wanted to automate this.

At the time, there were various suggestions as to how I could improve things, but no-one offered to write an application. Then, out of the blue, Paul Sprangers sent a copy of ConvText - and this is exactly what I was asking for.

ConvText in action
ConvText finishes processing a textfile



The job
The script below is what is used to convert the Text file. Having selected that particular script within ConvText, I just drag the file to the ConvText iconbar icon and the output file appears a few seconds later - brilliant.

SCRIPT:Archive
{nextstory }:
ž:fi
Ÿ:fl
{REMOVE SMART QUOTES}
{REMOVE MANY SPACES}
{REMOVE MANY NEW LINES}
[10][10]:[10]
[10]:[10]16.11[10]
[160]:[32]
[173]:-
[151]:-
[32][10]:[10]


I think it should be fairly self-explanatory, but I'll explain it anyway. On each line, the search element is the bit in front of the colon (:), and the thing which replaces it is after the colon. So, the first line after the SCRIPT title replaces all occurrences of "{nextstory }" with nothing - there's nothing after the colon. The two ligatures are then expanded into "fi" and "fl".
The next three lines are three of the predefined actions that Paul has created. The first changes all smart (or 'sexed') quotation marks into the straight equivalent. The second cuts down all multiple spaces into single spaces, and the final one removes multiple linefeeds, but only down to a minimum of two. This maintains a single blank line between paragraphs. In fact, in this case, I don't want blank lines between paragraphs, so I add the next line ([10][10]:[10]) which turns all pairs of ASCII 10 characters (i.e. linefeeds) into one.

In the next line, I add the volume and issue number between every paragraph and the next, and I guess the rest of it doesn't really need detailed explanation - [160] is a hard space and [32] is an ordinary space.

"I could get it to..."
As I used ConvText I began to think of other jobs I could use it for. Here are the first two that occurred to me:

SCRIPT:Single line addr
{REMOVE MANY NEW LINES}
[10][10]:zczc
[10]:,[32]
zczc:[10]

SCRIPT:Multi line addr
[10]:zczc
,[32]:[10]
zczc:[10][10]


These two convert an address list back and forth between a multi-line format for printing out labels and a single line format with commas suitable for printing the addresses on sheets of paper.

"Could you just get it to...?"
As I worked with ConvText, it pretty soon became obvious that a wildcard function would be useful, so Paul added # and * wildcards with # and * just in case I wanted to do a search and replace on either of those two characters, though I could of course have used [35] and [42].
What's more, the contents of the wildcards can be used in the replace string. For example..

[32]#pounds:[32]£#
[32]##pounds:[32]£##


Sledges and nuts
One word of warning - beware using a sledge to crack a nut. I had a drawfile and I wanted to get the text out of it for putting into an article. I knew that as long as the text in the drawfile hadn't been converted to paths, the actual text would there inside the drawfile. So I changed the filetype to 'text' and set to work. (Are you ahead of me?) I studied the format of the files, seeing what special characters occurred before each bit of text, and I tried to work out how I could use this wonderful automatic system to extract the text. "Once I've worked out a script, I and other users of ConvText will be able to just drag-and-drop drawfiles to our hearts' content", I thought.

I laboured for hours trying to work out clever ways of doing this (managing to show up some bugs in the application as I proceeded). I was beginning to get to the point of admitting that this was a bit beyond me when I asked on Archive-on-Line if there was an easier way. "Load the drawfile into Vector and there's a menu item to "Save text" - and that was only one way of a number of ways of doing it that were suggested.

Munging Impression files
One search and replace facility that has eluded us for years is in Impression. There's the facility to search for a particular style, but you can't replace it with another style name. I have one contributor who sends in articles for Archive using style names that are not the ones I use in a standard Archive document, so I have to change these names in the incoming Impression file before cutting and pasting it into one of my magazine files. Now I can do it automatically.

SCRIPT:Someone's articles
SubHead:Heading
ByLine:Main Heading
Heading:Box Heading
Emphasise:Italic
Together:Normal
! :![32]
. :.[32]
? :?[32]


This gentleman also uses double spaces at the end of each sentence, so those are changed to single spaces - easy-peasy!

A real time-saver
The article I get that needs most editing (and this is no reflection on its author, Steve Knattress) to get it into Archive-standard format is Archive-on-Line OffLine. The reason is that it consists of emails dashed off quite quickly in some cases, never intended for publication, with abbreviations, etc. So here's how I save time on that.

SCRIPT:AoL
<: {"email" on}
> :{"email" off}[32]
{REMOVE MANY NEW LINES}
[10][10]:[10]
/: "italic" on}
/ :"italic" off}[32]
*: "italic" on}
* :"italic" off}[32]
RISC OS:RISC[160]OS
RISCOS:RISC[160]OS
RISC[160]OS Ltd:RISCOS Ltd
Risc OS:RISC[160]OS
RiscOS:RISC[160]OS
RO:RISC[160]OS
Risc PC:RiscPC
RISC PC:RiscPC
Disk:disc
BASIC:Basic
web site:website
CD≠ROM:CD≠ROM
Draw file:drawfile
icon bar:iconbar
Mb:MB
Gb:GB
x :×
StrongArm:StrongARM
Jpeg:JPEG
jpeg:JPEG
e-mail:email
hz:Hz
file type:filetype


I won't explain it in detail, but the first few commands are for adding styles to email addresses and sometimes URLs which are enclosed in angle brackets by Steve. Then I try to pick up where people have put asterisks (*) or slashes (/) around words or phrases to add emphasis. OK, there are places where angle brackets slashes and asterisks might occur for other reasons, but all I do is look at the resultant Impression file and, if there are spurious styles, go back to the original text file, change it slightly and do the drag-and-drop again - quick and easy! I may have to do this a few times to account for each and every spurious style, but certainly that takes a lot less time than manually editing every single email address and every emphasis in the whole document.

And the other important point to make is that each time I use this script, I'll find other corrections that I haven't covered, so I'll add them to the script ready for next month. So, month-on-month, at a single drag-and-drop, a higher and higher percentage of the corrections will be done automatically.

Other uses
In addition to those I've already mentioned, I've come up with another four uses of ConvText in the first couple of weeks of using it. I won't bore you with them all, but simply say that for someone like me, for whom words are his stock-in-trade, this is a really productive application - many thanks to Paul Sprangers for producing it.

Links


ConvText can be found in the downloads section on the website of Archive's sister publication Living with Technology. Paul's article on ConvText will appear in the next issue of Archive magazine, due out around 19th September.

Previous: New OS wide edition of Toolbox released
Next: Select 3i2 available to subscribers

Discussion

Viewing threaded comments | View comments unthreaded, listed by date | Skip to the end

I do this sort of thing quite often, too. But I use one of Awk, Perl, or sed, depending on how I'm feeling.

 is a RISC OS Usernunfetishist on 8/9/03 1:35PM
[ Reply | Permalink | Report ]

I suspect this is more friendly than those tools. It took me a week to pick up Perl.

Chris, drobe.co.uk

 is a RISC OS Userdiomus on 8/9/03 3:33PM
[ Reply | Permalink | Report ]

I think I prefer Perl for this kind of thing - the example where pairs of [10]s are changed to [10], what happens if you have 4 [10]s? Do you get 1 [10], 2[10]s (it depends on whether it starts the search from the start, or from the last change)?

With Perl, you'd do ~s:n{2,}:n:; which, to me, is a lot clearer as to what happens. But then I have read "Mastering Regular Expressions" quite thoroughly :)

Obviously I could try it out, but I'm in the office, and my RiscPC is sitting at home...

 is a RISC OS Usertribbles2 on 8/9/03 5:12PM
[ Reply | Permalink | Report ]

Actually, thinking about it, I misread the original article, and it's probably the same as ~s:nn:n:g; (I thought that it was meant to remove multiple lines, whereas the 'macro' beforehand did it).

 is a RISC OS Usertribbles2 on 8/9/03 5:15PM
[ Reply | Permalink | Report ]

diomus: More friendly? I have to read the body of this article *very* carefully to work out wtf the scripts do!

 is a RISC OS Usernunfetishist on 8/9/03 5:34PM
[ Reply | Permalink | Report ]

I don't read Archive-on-Line. Maybe if I did, I might have been able to point out to Paul that an app to do the sort of thing he was asking about exists already. Ho hum. :-/

The scripts are also human readable :-p

 is a RISC OS UserVinceH on 8/9/03 7:48PM
[ Reply | Permalink | Report ]

VinceH: Which is?

 is a RISC OS Userdansguardian on 8/9/03 11:23PM
[ Reply | Permalink | Report ]

I can't tell you that. It's top secret, and I really don't have time at the mo to be running around exotic locations, being chased by and getting into battles with any of MI6's double-O agents. Especially that 007, who just damned well refuses to die. I'm far too busy working on WebChange for that sort og malarky ATM.

 is a RISC OS UserVinceH on 9/9/03 6:23PM
[ Reply | Permalink | Report ]

Please login before posting a comment. Use the form on the right to do so or create a free account.

Search the archives

Today's featured article

  • Review: A9home v. Koolu
    Clash of the tiniest
     31 comments, latest by polas on 18/10/07 6:03PM. Published: 15 Oct 2007

  • Random article

  • Cocognut author mulls open source
    Options considered as sales fall
     12 comments, latest by JGZimmerle on 17/2/05 11:38PM. Published: 12 Feb 2005

  • Useful links

    News and media:
    IconbarMyRISCOSArcSiteRISCOScodeANSC.S.A.AnnounceArchiveQercusRiscWorldDrag'n'DropGAG-News

    Top developers:
    RISCOS LtdRISC OS OpenMW SoftwareR-CompAdvantage SixVirtualAcorn

    Dealers:
    CJE MicrosAPDLCastlea4X-AmpleLiquid SiliconWebmonster

    Usergroups:
    WROCCRONENKACCIRUGSASAUGROUGOLRONWUGMUGWAUGGAGRISCOS.be

    Useful:
    RISCOS.org.ukRISCOS.orgRISCOS.infoFilebaseChris Why's Acorn/RISC OS collectionNetSurf

    Non-RISC OS:
    The RegisterThe InquirerApple InsiderBBC NewsSky NewsGoogle Newsxkcddiodesign


    © 1999-2009 The Drobe Team. Some rights reserved, click here for more information
    Powered by MiniDrobeCMS, based on J4U | Statistics
    "Regarding Drobe, are they [incompetent], simply biased or is it company policy?"
    Page generated in 0.1099 seconds.