Drobe logo
Beta! | About us | Contact | Submit news | RSS | Twitter Webspace | Tech docs | Downloads | BBC Micro | Gallery | Wallpaper

Processing text the easy way

By Paul Beverley. Published: 7th Sep 2003, 19:35:03.

Paul Beverley shows search'n'replace who's boss

Archive logoEvery publication online and offline has its own house rules on the style and presentation of content. Archive Magazine editor Paul Beverley explains how he uses a new freeware application to prepare text for publication.

Last month, "My Favourite Application" was SpamStamp. Indeed, SpamStamp still continues very high on the list, but this month a new PD program has appeared which is my latest "Favourite Application". It's still being developed, but it's most certainly already a very usable application.

The history
Some months ago, I asked on Archive-on-Line if someone could write me a program that would carry out a sequence of search and replace functions that I could specify by using a script. The particular job I had in mind was speeding up the production of the words-only version of the magazine each month. I take the Text file from inside the directory-type Impression document which contains all the text of all the articles in the magazine. This then has to be massaged by a whole string of search and replace actions. I wanted to automate this.

At the time, there were various suggestions as to how I could improve things, but no-one offered to write an application. Then, out of the blue, Paul Sprangers sent a copy of ConvText - and this is exactly what I was asking for.

ConvText in action
ConvText finishes processing a textfile



The job
The script below is what is used to convert the Text file. Having selected that particular script within ConvText, I just drag the file to the ConvText iconbar icon and the output file appears a few seconds later - brilliant.

SCRIPT:Archive
{nextstory }:
ž:fi
Ÿ:fl
{REMOVE SMART QUOTES}
{REMOVE MANY SPACES}
{REMOVE MANY NEW LINES}
[10][10]:[10]
[10]:[10]16.11[10]
[160]:[32]
[173]:-
[151]:-
[32][10]:[10]


I think it should be fairly self-explanatory, but I'll explain it anyway. On each line, the search element is the bit in front of the colon (:), and the thing which replaces it is after the colon. So, the first line after the SCRIPT title replaces all occurrences of "{nextstory }" with nothing - there's nothing after the colon. The two ligatures are then expanded into "fi" and "fl".
The next three lines are three of the predefined actions that Paul has created. The first changes all smart (or 'sexed') quotation marks into the straight equivalent. The second cuts down all multiple spaces into single spaces, and the final one removes multiple linefeeds, but only down to a minimum of two. This maintains a single blank line between paragraphs. In fact, in this case, I don't want blank lines between paragraphs, so I add the next line ([10][10]:[10]) which turns all pairs of ASCII 10 characters (i.e. linefeeds) into one.

In the next line, I add the volume and issue number between every paragraph and the next, and I guess the rest of it doesn't really need detailed explanation - [160] is a hard space and [32] is an ordinary space.

"I could get it to..."
As I used ConvText I began to think of other jobs I could use it for. Here are the first two that occurred to me:

SCRIPT:Single line addr
{REMOVE MANY NEW LINES}
[10][10]:zczc
[10]:,[32]
zczc:[10]

SCRIPT:Multi line addr
[10]:zczc
,[32]:[10]
zczc:[10][10]


These two convert an address list back and forth between a multi-line format for printing out labels and a single line format with commas suitable for printing the addresses on sheets of paper.

"Could you just get it to...?"
As I worked with ConvText, it pretty soon became obvious that a wildcard function would be useful, so Paul added # and * wildcards with # and * just in case I wanted to do a search and replace on either of those two characters, though I could of course have used [35] and [42].
What's more, the contents of the wildcards can be used in the replace string. For example..

[32]#pounds:[32]£#
[32]##pounds:[32]£##


Sledges and nuts
One word of warning - beware using a sledge to crack a nut. I had a drawfile and I wanted to get the text out of it for putting into an article. I knew that as long as the text in the drawfile hadn't been converted to paths, the actual text would there inside the drawfile. So I changed the filetype to 'text' and set to work. (Are you ahead of me?) I studied the format of the files, seeing what special characters occurred before each bit of text, and I tried to work out how I could use this wonderful automatic system to extract the text. "Once I've worked out a script, I and other users of ConvText will be able to just drag-and-drop drawfiles to our hearts' content", I thought.

I laboured for hours trying to work out clever ways of doing this (managing to show up some bugs in the application as I proceeded). I was beginning to get to the point of admitting that this was a bit beyond me when I asked on Archive-on-Line if there was an easier way. "Load the drawfile into Vector and there's a menu item to "Save text" - and that was only one way of a number of ways of doing it that were suggested.

Munging Impression files
One search and replace facility that has eluded us for years is in Impression. There's the facility to search for a particular style, but you can't replace it with another style name. I have one contributor who sends in articles for Archive using style names that are not the ones I use in a standard Archive document, so I have to change these names in the incoming Impression file before cutting and pasting it into one of my magazine files. Now I can do it automatically.

SCRIPT:Someone's articles
SubHead:Heading
ByLine:Main Heading
Heading:Box Heading
Emphasise:Italic
Together:Normal
! :![32]
. :.[32]
? :?[32]


This gentleman also uses double spaces at the end of each sentence, so those are changed to single spaces - easy-peasy!

A real time-saver
The article I get that needs most editing (and this is no reflection on its author, Steve Knattress) to get it into Archive-standard format is Archive-on-Line OffLine. The reason is that it consists of emails dashed off quite quickly in some cases, never intended for publication, with abbreviations, etc. So here's how I save time on that.

SCRIPT:AoL
<: {"email" on}
> :{"email" off}[32]
{REMOVE MANY NEW LINES}
[10][10]:[10]
/: "italic" on}
/ :"italic" off}[32]
*: "italic" on}
* :"italic" off}[32]
RISC OS:RISC[160]OS
RISCOS:RISC[160]OS
RISC[160]OS Ltd:RISCOS Ltd
Risc OS:RISC[160]OS
RiscOS:RISC[160]OS
RO:RISC[160]OS
Risc PC:RiscPC
RISC PC:RiscPC
Disk:disc
BASIC:Basic
web site:website
CD­ROM:CD­ROM
Draw file:drawfile
icon bar:iconbar
Mb:MB
Gb:GB
x :×
StrongArm:StrongARM
Jpeg:JPEG
jpeg:JPEG
e-mail:email
hz:Hz
file type:filetype


I won't explain it in detail, but the first few commands are for adding styles to email addresses and sometimes URLs which are enclosed in angle brackets by Steve. Then I try to pick up where people have put asterisks (*) or slashes (/) around words or phrases to add emphasis. OK, there are places where angle brackets slashes and asterisks might occur for other reasons, but all I do is look at the resultant Impression file and, if there are spurious styles, go back to the original text file, change it slightly and do the drag-and-drop again - quick and easy! I may have to do this a few times to account for each and every spurious style, but certainly that takes a lot less time than manually editing every single email address and every emphasis in the whole document.

And the other important point to make is that each time I use this script, I'll find other corrections that I haven't covered, so I'll add them to the script ready for next month. So, month-on-month, at a single drag-and-drop, a higher and higher percentage of the corrections will be done automatically.

Other uses
In addition to those I've already mentioned, I've come up with another four uses of ConvText in the first couple of weeks of using it. I won't bore you with them all, but simply say that for someone like me, for whom words are his stock-in-trade, this is a really productive application - many thanks to Paul Sprangers for producing it.

Links
ConvText can be found in the downloads section on the website of Archive's sister publication Living with Technology.
Paul's article on ConvText will appear in the next issue of Archive magazine, due out around 19th September.

Discussion

Viewing threaded comments | View comments unthreaded, listed by date | Skip to the end

I do this sort of thing quite often, too. But I use one of Awk, Perl, or sed, depending on how I'm feeling.

 is a RISC OS Usernunfetishist on 8/9/03 1:35PM [ Reply | Permalink | Report ]

I suspect this is more friendly than those tools. It took me a week to pick up Perl.

Chris, drobe.co.uk

 is a RISC OS Userdiomus on 8/9/03 3:33PM [ Reply | Permalink | Report ]

I think I prefer Perl for this kind of thing - the example where pairs of [10]s are changed to [10], what happens if you have 4 [10]s? Do you get 1 [10], 2[10]s (it depends on whether it starts the search from the start, or from the last change)?

With Perl, you'd do ~s:n{2,}:n:; which, to me, is a lot clearer as to what happens. But then I have read "Mastering Regular Expressions" quite thoroughly :)

Obviously I could try it out, but I'm in the office, and my RiscPC is sitting at home...

 is a RISC OS Usertribbles2 on 8/9/03 5:12PM [ Reply | Permalink | Report ]

Actually, thinking about it, I misread the original article, and it's probably the same as ~s:nn:n:g; (I thought that it was meant to remove multiple lines, whereas the 'macro' beforehand did it).

 is a RISC OS Usertribbles2 on 8/9/03 5:15PM [ Reply | Permalink | Report ]

diomus: More friendly? I have to read the body of this article *very* carefully to work out wtf the scripts do!

 is a RISC OS Usernunfetishist on 8/9/03 5:34PM [ Reply | Permalink | Report ]

I don't read Archive-on-Line. Maybe if I did, I might have been able to point out to Paul that an app to do the sort of thing he was asking about exists already. Ho hum. :-/

The scripts are also human readable :-p

 is a RISC OS UserVinceH on 8/9/03 7:48PM [ Reply | Permalink | Report ]

VinceH: Which is?

 is a RISC OS Userdansguardian on 8/9/03 11:23PM [ Reply | Permalink | Report ]

I can't tell you that. It's top secret, and I really don't have time at the mo to be running around exotic locations, being chased by and getting into battles with any of MI6's double-O agents. Especially that 007, who just damned well refuses to die. I'm far too busy working on WebChange for that sort og malarky ATM.

 is a RISC OS UserVinceH on 9/9/03 6:23PM [ Reply | Permalink | Report ]

Please login before posting a comment. Use the form on the right to do so or create a free account.

Login

Username

Password

Create a new account
Forgot your password?

Search this website

This week's poll

Recent developments have left me feeling...
Assured ROS will appear on new hardware Assured ROS will appear on new hardwareAssured ROS will appear on new hardware 55%
Pleased OS desktop features are being developed Pleased OS desktop features are being developedPleased OS desktop features are being developed 10%
ROL and ROOL will eventually kiss and make up ROL and ROOL will eventually kiss and make upROL and ROOL will eventually kiss and make up 5%
App developments are critical App developments are critical App developments are critical 10%
Dave Holden sleeps easy at night Dave Holden sleeps easy at nightDave Holden sleeps easy at night 19%
Discuss this | Archives

Featured articles

  • Wakefield 2009 wrap-up, photos and video
    The weekend's RISC OS event has been and gone and we've got the rest of our lives to look forward to. Here's a round-up of extra news and Drobe's show-related coverage and some photos taken from Wakefield 2009 - plus a video from the show floor.
     16 comments, latest by AW on 29/4/09 7:41PM. Published: 27 Apr 2009

  • RISC OS 5 pictured running on ARM Cortex-A8 kit
    Picture exclusive - This grainy photograph shows a port of RISC OS 5, sourced from the RISC OS Open project, running on a Beagleboard - a device powered by a 600MHz ARM Cortex-A8 processor with a built-in graphics chip. The port, developed by Jeffrey Lee with help from Uwe Kall and ROOL staff, is seen as a major breakthrough for the shared-source project as it proves the OS can be ported to new hardware without the need for a large team of engineers.
     75 comments, latest by rjek on 30/4/09 3:15PM. Published: 25 Apr 2009

  • Open documents from Windows-using pals with handy online tool
    It can be a pain when someone sends you a file that can only be opened on Windows, Mac OS X or Linux - but with the help of a free-to-use website and NetSurf, Paul Stewart reveals how these documents can be viewed on RISC OS.
     6 comments, latest by AW on 8/5/09 12:12AM. Published: 19 Apr 2009

  • Useful links

    News and media:
    IconbarMyRISCOSArcSiteRISCOScodeANSC.S.A.AnnounceArchiveQercusRiscWorldGAG-News

    Top developers:
    RISCOS LtdRISC OS OpenMW SoftwareR-CompAdvantage SixVirtualAcorn

    Dealers:
    CJE MicrosAPDLCastlea4X-AmpleLiquid SiliconWebmonster

    Usergroups:
    WROCCRONENKACCIRUGSASAUGROUGOLRONWUGMUGGAGRISCOS.be

    Useful:
    RISCOS.orgRISCOS.infoFilebaseNetSurf

    Non-RISC OS:
    The RegisterThe InquirerApple InsiderBBC NewsSky NewsGoogle Newsxkcddiodesign


    Recently logged in: stevek is a RISC OS User stevek • jmb is a RISC OS User jmb • Phlamethrower is a RISC OS User Phlamethrower • Mart is a RISC OS User Mart • AW is a RISC OS User AW • JMBarber is a RISC OS User JMBarber • turbo is a RISC OS User turbo • Hairy is a RISC OS User Hairy • hubersn is a RISC OS User hubersn • rjek is a RISC OS User rjek •  Stats
    © 1999-2009 The Drobe Team. Some rights reserved, click here for more information | Powered by MiniDrobeCMS, based on J4U
    "We accept Drobe likes to be [controversial], no problem there - but a sinister pattern has appeared over the past year or so"
    Page generated in 0.1362 seconds.