WgetRexx

This manual describes WgetRexx, version 2.1 (13-May-1998).

WgetRexx is an ARexx script to invoke Wget from your favourite Amiga WWW browser. It is designed in a way that you can easily maintain a copy of the interesting parts of the WWW on your local hard drive.

[Spinnakinni]This allows you to browse them without having to start your network, without having to worry if documents are still in the cache of the browser, without having to figure out the cryptic name of a certain document in your cache directory and without having to deal with slow, overloaded or/and unstable connections.

WgetRexx is © Thomas Aglassinger 1998 and is copyrighted freeware. Therefor you are allowed to use it without paying and can also modify it to fit your own needs. You can redistribute it as long as the whole archive and its contents remain unchanged.

The current version of this document and the program should be available from http://www.giga.or.at/~agi/wgetrexx/.

Contents

Overview

For those going to refuse to read this whole manual, here are the interesting parts: the requirements tell you where you can get needed stuff from, an example configuration shows how your macro menu might look like (and should terrify those away who don't know how to use Wget), a chapter about troubleshooting describes some common problems, and there are also some notes on updates and support.

Legal Issues

Disclaimer

Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies.

Copyright

Wget.rexx, this manual and the "Spinnankini" logo are Copyright © 1998 by Thomas Aglassinger

No program, document, data file or source code from this software package, neither in whole nor in part, may be included or used in other software packages unless it is authorized by a written permission from the author.

No Warranty

There is no warranty for this software package. Although the author has tried to prevent errors, he can't guarantee that the software package described in this document is 100% reliable. You are therefore using this material at your own risk. The author cannot be made responsible for any damage which is caused by using this software package.

Distribution

This software package is freely distributable. It may be put on any media which is used for the distribution of free software, like Public Domain disk collections, CDROMs, FTP servers or bulletin board systems.

The author cannot be made responsible if this software package has become unusable due to modifications of the archive contents or of the archive file itself.

There is no limit on the costs of the distribution, e.g. for the media, like floppy disks, streamer tapes or compact disks, or the process of duplicating.

Other Material

Wget is written by Hrvoje Niksic and is Copyright © Free Software Foundation

Wget.rexx uses ReqTools, Copyright © Nico François and Magnus Holmgren, RexxReqTools, Copyright © Rafael D'Halleweyn and RexxDosSupport, Copyright © Hartmut Goebel.

For more details about these packages and where to obtain them from see below.

Copyright of Web Sites

Note that unless explicitly stated to the contrary, the copyright of all files on the WWW is held by the owners of the appropriate site.

If you intend to redistribute any files downloaded from the WWW please ensure that you have the permission of the copyright holder to do so.

Requirements

You will need a WWW-browser with a reasonable ARexx-port. AWeb and IBrowse meet this criteria. Voyager does not allow to query the URI currently browsing by means of ARexx.

Of course you will need Wget. As it is distributed under the terms of the GNU General Public License, you can obtain its source code from ftp://prep.ai.mit.edu/pub. A compiled binary for AmigaOS is part of the Geek Gadgets and available from ftp://ftp.ninemoons.com/pub/geekgadgets/. You can also download the binary from aminet:dev/gg/wget-bin.lha.

You need reqtools.library, rexxreqtools.library and rexxdossupport.library to be installed in your libs: drawer. They are not included with the WgetRexx archive, but you can obtain them from aminet:util/libs/ReqToolsUsr.lha and aminet:util/rexx/rexxdossupport.lha.

As these libraries are very common, they are maybe already installed on your system. Simply check your sys:libs drawer to validate this.

Installation

Installing Wget

The most difficult part of the installation by far is to make Wget work. As Wget is a tool coming from the Unix-world, it is completetly unusable, has 873265 command line options and a very short manual with only a few examples.

If you do not know how to install and use it, this script won't make it easier. It just makes it more efficient to invoke while browsing.

Recursive web sucking is a very poweful feature. However, if used in an improper way, it can cause harm to the network traffic, and therefore should only be used by experienced people. Successfully installing Wget on an Amiga is one requirement to call yourself experienced.

Creating Your Local Web

Before you can use this script, you have to decide where your local web should reside and create a directory for it. After that you have to create an assign named Web: to this directory.

For example, if your local web should be in Work:Web, create a drawer called Web from the Workbench or enter

makedir Work:Web

into CLI. Then add the following line to your s:user-startup:

assign Web: Work:Web

Installing Wget.rexx

Now you can copy the script where ever you like. As it does not depend on a certain browser, it can make sense to store it in rexx:. This spares you to specify a full path when invoking rx as it will automatically look there for it.

All you have to do now is to assign them to the ARexx macro menu by changing the preferences. See the example configuration for a few suggestions.

First Steps

Although the script will later on be started from your browser, you should first try if everything works. For that, open a CLI, and type:

cd ram:
wget http://www.cucug.org/amiga.html

This should invoke Wget and download the title page of the popular Amiga Web Directory and store it in ram:amiga.html.

If it does not, Wget does not work for you, and you can forget about the rest.

Assuming you've been lucky, now try this:

wget -x http://www.cucug.org/amiga.html

This downloads the document again, but now also creates a directory for the server structure. Now you can find the copy in ram:www.cucug.org/amiga.html.

And another step further:

wget -x --prefix-directory=/Web/ http://www.cucug.org/amiga.html

This time, the document ends up in Web:www.cucug.org/amiga.html, even if the current directory still is ram:.

Now finally let's see what Wget.rexx can do for you. Start your browser, go to http://www.cucug.org/ and wait until it displays the page. Switch to the CLI and enter:

rx wget.rexx

This downloads the document you are currently viewing in your browser to Web: and automatically displays the local copy after finishing. If everything worked fine, your browser should show file://localhost/Web:www.cucug.org/amiga.html.

And all of this is done by communicating with the browser via ARexx and then invoking Wget the same was as you saw before.

To be more specific, the script queries the URI your browser currently is displaying, tells Wget to download the stuff from there to your hard drive and put it into a reasonable place. Finally the browser is made to load the local copy.

Also many possible errors are handled by the scripts before Wget is launched, and they usually result in a requester with a descriptive message. However, once Wget is started, you are to the mercy of its non-existent or esoteric error messages. The script can only warn about errors in general if Wget returned with a non-zero exit code (which it does not for all possible errors). In such a case, analyse the output in the console for problems.

In this first few examples, only one file was downloaded. Fortunately, the underlying tool Wget can do more. With proper command line options, it can download several documents, scan them for links and images and continue downloading them recursively.

The only catch about this is: you have to know how to work with Wget. There are some examples macros here and there are several interesting command line options mentions here, but this is not enough to really know what's going on. Nearly every site needs its own command line options to get the useful part - and prevent the useless stuff to be downloaded. Very often you will have to interrupt the process and resume with more proper options.

There are some hints about this here, but: read the manual for Wget. There is no way around this.

Example Configuration

Below you will find some example macros you can assign to the ARexx menu of your browser. As there are small differences between the various browsers, there exists a configuration for every one of them.

It is recommended to use the clipboard to copy the macro texts into the preferences requester.

Read also the about the usage of the example configuration to find out how it works. For further information about passing arguments to the script, see command line options.

IBrowse

To make the script accessible from the Rexx menu, go to IBrowse's Preferences/General/Rexx and add the following entries:

Name Macro
Copy single resource wget.rexx
Copy page with images wget.rexx --recursive --level=1 --accept=png,jpg,jpeg,gif --convert-links
Copy site wget.rexx --recursive --no-parent --reject=wav,au,aiff,aif --convert-links
Copy site... wget.rexx Ask --recursive --no-parent --reject=wav,au,aiff,aif --convert-links
Copy generic... wget.rexx Ask

This assumes that the script has been installed to rexx: or the the browser directory. If this is not the case, you have to specify the full path name.

To change the default console the script will send its output to, modify Preferences/General/General/Output Window.

AWeb

To make the script accessible from the ARexx menu, go to AWeb's Settings/GUI Settings/ARexx and add the following entries:

Name Macro
Copy single resource wget.rexx >CON://640/200/Wget/CLOSE/WAIT
Copy page with images wget.rexx >CON://640/200/Wget/CLOSE/WAIT --recursive --level=1 --accept=png,jpg,jpeg,gif --convert-links
Copy site wget.rexx >CON://640/200/Wget/CLOSE/WAIT --recursive --no-parent --reject=wav,au,aiff,aif --convert-links
Copy site... wget.rexx >CON://640/200/Wget/CLOSE/WAIT Ask --recursive --no-parent --reject=wav,au,aiff,aif --convert-links
Copy generic... wget.rexx >CON://640/200/Wget/CLOSE/WAIT Ask

This assumes that the script has been installed to rexx: or the the browser directory. If this is not the case, you have to specify the full path name.

Note that you have to redirect the output of every script to a console. Otherwise you would not be able to see what Wget is currently doing. Therefor this looks a bit more confusing than the example for IBrowse.

See also the problems with AWeb for some further notes.

Usage of the Example Configuration

Here is a short description of the macros used for the example configuration.

Copy single resource

This is comparable to the download function of your browser. The difference is that it will automatically be placed in your local web in a directory depending on the location where it came from.

For example, you could point your browser to http://www.cucug.org/aminew.html and would get a single file named aminew.html being stored in the directory Web:www.cucug.org/. If such directory does not yet exist, it will be created automatically. This is the same as if you would have typed

wget -x --directory-prefix=/Web/ http://www.cucug.org/aminew.html

in the CLI. The difference is that you did not have to type a single letter.

Copy page with images

This now retrieves a page with all its images. Of course it only makes sense to call is when actually viewing a HTML-page. With other types of data like images it will act same as Copy single resource.

After this operation, inline images will still work in the local copy of the downloaded page.

Copy site

This is a very powerful macro that will copy a whole site to your local web. It starts at the document currently viewing, and will download all pages, images and some other data within the same or a deeper directory level.

The option --reject has been specified to refuse to download often unwanted music and sound data. You might want to specify further extensions here, so also movies, archives and printable documents are skipped, for example: --reject=mpg,mpeg,wav,au,aiff,aif,tgz,Z,gz,zip,lha,lzx,ps,pdf.

Copy site...

Same as before, but because of the Ask it will pop-up a requester where you can change the options for Wget before it is actually invoked. For example, you can modify the --reject that it will not refuse sound data because you once in a while want to download from a music site.

You can also add additional options like --no-clobber to continue an aborted Copy site from before or --exclude-directories because you known that there is only crap in /poetry/.

Copy generic...

This will simply pop-up a requester where you can enter options for Wget. Except for the internal -x --directory-prefix nothing is specified yet. It is useful when occasionally none of the above methods is flexible enough.

Command Line Options

As you just learned, it is possible to pass additional options to Wget.rexx. There are two different kinds of them:

The complete ReadArgs() template for Wget.rexx is:

To/K,Ask/S,Further/S,Port/K,Continue/S,Clip/K,Verbose/S,Options/F

In most cases you will not need to specify any options except those for Wget.

Get Some Details

If you enable Verbose, Wget.rexx will tell you some details what is going on:

Note that this does not influence the output of Wget itself.

Ask for Options

One might not always be satisfied with a few standard macros and would like to pass different options to Wget. On the other hand it does not make sense to clutter the Arexx menu of your browser with loads of only slightly different macros. Invoking the script like

wget.rexx Ask

will pop-up a requester where you can enter options to be passed to Wget. If you already passed other Wget options via command line, the requester will allow you to edit them before starting Wget:

wget.rexx Ask --recursive --no-parent

will bring a requester where the text "--recursive --no-parent" is already available in the input field and can be extended or reduced by the user. Sometimes it may be more convenient to use

wget.rexx Ask Further --recursive --no-parent

This also brings up the requester, but this time the input field is empty. The options already specified in the command line will be passed in any case, and you can only enter additional options. If for example you now enter --reject=mpeg this would be the same as if you would have called

wget.rexx --recursive --no-parent --reject=mpeg

The main advantage is that the input field is not already cluttered with loads of confusing options. The drawback is that you can not edit or remove options already passed from the command line.

Specify Options for Wget

The last part of the command line can contain additional options to be passed to Wget.

Important: You must not pass any Wget.rexx specific options after the first option for Wget. For example,

wget.rexx Ask --quiet

tells Wget.rexx to pop-up the options requester and Wget to not display download information. But on the other hand,

wget.rexx --quiet Ask

will pass both --quiet Ask to Wget, which of course does not really know what to do about the Ask.

Specify a Different Download Directory

If you do not want the downloaded data to end up in Web:, you can use To to specify a different download directory. For example:

wget.rexx To=ram:t

The value denotes a path in AmigaOS format. Internally it will be converted to the ixemul-style before it is passed to Wget. Fortunately you do not have to know anything about that.

With a little CLI magic, you can let a file requester ask you for the download directory:

wget.rexx To=`RequestFile DrawersOnly SaveMode NoIcons Title="Select Download Directory"`

Note that the quote after To= is a backquote, not a single quote. You can find it below Esc on your keyboard.

Select the Browser ARexx Port

Normally you do not want to do this because Wget.rexx figures out the ARexx port to use by itself.

First it assumes that the host it was started from was a browser. In such a case, it will continue to talk to this host no matter how many other browsers are running at the same time.

If this turns out to be wrong (e.g. because it was started in the CLI), it tries to find one of the supported browsers at its default port. If any such browser is running, it will use this.

If no browser is running, the script does not work at all for apparent reasons.

The only possible problem can be that several supported browsers are running at the same time, and you do not start the script directly from one of them. In such a rare case the browser checked first will be used, which is not necessarily the one the user prefers. Therefore you can use

wget.rexx Port=IBROWSE

in CLI, even if AWeb is also running.

Continue an Interrupted Download

Especially when copying whole sites, it often happens that Wget ends up downloading stuff that you do not want. The usual procedure then is to interrupt Wget, specify some additional options to reject some stuff and restart again. For example, you found some site and started to download it:

wget --recursive --no-parent http://www.interesting.site/~stuff/

But soon you notice that there are loads of redundant PDF documents which only reproduce the information you are just obtaining in HTML. Therefor you interrupt and start again with more options:

wget --recursive --no-parent --no-clobber --reject=pdf http://www.interesting.site/stuff/

To your further annoyance it turns out that the directory /stuff/crap entirely hold things you are not interested in. So therefor you restart again:

wget --recursive --no-parent --no-clobber --reject=pdf --exclude-directories=/stuff/crap/ http://www.interesting.site/stuff/

And so on. As you can see, it can take quite some effort before you find proper options for a certain site.

So how can the above procedure be performed with Wget.rexx? Apparently, there is no history function like in the CLI, where you can switch back to the previous call and add additional options to it.

However, you can make Wget.rexx to store the options entered in the requester when Ask was specified in a ARexx-Clip. This will be preserved and can be read again and used as default value in the requester. To achieve that, enable Continue.

Now that sounds confusing, but let's see how it works in practice, using an extended version of the Copy site... macro from before:

Name Macro
Copy site.. wget.rexx Clip=wget-site Ask Further --recursive --no-parent
Continue copy site... wget.rexx Clip=wget-site Ask Further Continue --recursive --no-parent --no-clobber

The macro Copy site... will always pop up an empty requester, where you can specify additional options like --reject. The options you enter there will be stored in an ARexx clip called "wget-site". It does not really matter how you call this clip, the only important thing is that you use the same name for the macro that reads the clip.

And this is exactly what Continue copy site... does: because of Continue it does not clear the clip, but instead reads it and uses the value as default text in the string requester. The additional parameter --no-clobber just tells Wget not to download the files you already got again.

So how does the above example session look with Wget.rexx?

First, you browse to http://www.interesting.site/stuff/ and select Copy site. Soon you have to interrupt it, select Copy site... and enter --reject=pdf into the upcoming requester. Until now, there is nothing you could not have already done with the old macros.

But when it turns out that the --reject was not enough, you only have to select Continue copy site..., and the --reject=pdf is already available in the requester. You just have to extend the text to --reject=pdf --exclude-directories=/stuff/crap/ and can go on.

And if later on some MPEG animations show up, you should know what to do by reselecting Continue copy site... again...

Specify a URI

You can also specify a URI for Wget when invoking the script. In this case, the document currently viewed in the browser will be ignored.

Specifying more than one URI or a protocol different to http:// will result in an error message. This is a difference to Wget, which allows multiple URIs and also supports ftp://.

URIs have to be specified in there full form, which means you for example can not use www.cucug.org. Instead, the complete http://www.cucug.org/ is required.

Among other things, this can be useful for creating a macro to update a local copy of an earlier downloaded site. For example with IBrowse you could use:

Name Macro
Update biscuits wget.rexx --timestamping --recursive --no-parent --exclude-directories=/~hmhb/fallnet http://www.btinternet.com/~hmhb/hmhb.htm

This basically acts like if http://www.btinternet.com/~hmhb/hmhb.htm would have been viewed in the browser, and you would have selected Copy site. Because of --timestamping, only new data are actually copied. Because of the --exclude-directories=/~hmhb/fallnet some stuff that was considered to be unwanted at a former download is skipped.

Putting this into the macro menu spares you to remember which options you used two months ago.

Outsourcing Macros

If you have many macros like Update biscuits from before, it does not make sense any more to put them into the macro menu of the browser. Such macros are not called very often and mostly serve the purpose of "remembering" options you used to copy a site.

Fortunately there are many different places where you can put such macros.

A straight forward approach would be to put them in shell scripts. For example, s:wget-biscuits could hold the following line:

rx wget.rexx --timestamping --recursive --no-parent --exclude-directories=/~hmhb/fallnet http://www.btinternet.com/~hmhb/hmhb.htm

If you set the script protection bit by means of

protect s:update-biscuits ADD s

you can simply open a CLI and type

update-biscuits

to invoke the macro.

But this merely is useful to outline the idea. Who wants to deal with CLI, if it can be avoided? If you have some experience, you can easily store such a call in a button for DirectoryOpus or ToolManager (available from util/wb/ToolManager#?.lha.

These even allow you to create dock hierarchies and sub-menus for your download macros. Refer to the manuals of these applications for more details.

Troubleshooting

Here are some common problems and some hints how to solve them or where to get further information from.

Problems with Wget.rexx

My browser is not supported!

Supporting new browsers should be easy, assuming that its ARexx port is powerful enough. In detail, the following data are needed:

You should be able to find this information in the manual of your browser. If you submit it to me your browser will probably be supported within the next update of WgetRexx.

When interrupting Wget.rexx by pressing Control-C, my browser quits!

Sad but true. Contact your technical support and try to convince them that this sucks.

The macro Copy page with pictures does not seem to work with frames!

Of course not, because the technical implementation of the whole frame concept sucks.

As a workaround, you can view every frame in its own window and apply the macro to it. At the end, you will have all frames the pages is consisting of and can finally also copy the frame page itself.

Now what is case-sensitive and what is not?

WgetRexx uses several different mechanisms, and unfortunately they differ concerning case-sensitivity:

For URI values, it becomes even more confusing, as it depend on the server. Nevertheless filenames are case insensitive as soon as they are on your hard drive.

Problems with Wget

I can't make Wget work at all, not even from the command line!

Bad luck. Not my problem.

Wget basically works, but I don't know how to ... with it.

Refer to the Wget manual for that. There is a command line option for nearly everything, so usually you only have to search long enough.

Wget works nice, but it also download loads of bullshit I don't want to have on my hard drive!

There are several command line options to prevent Wget from downloading certain data. Most remarkable, there are:

Refer to the manual how to use them and pass them to the requester of Copy Site... macro.

Wget downloads a site, but the links on my hard disk still all refer to the original in the WWW!

This is because the web author has decided to use global links (which always have the full http:// stuff) instead of relative links which only specify the filename.

A well known example for that is the CGX Support Page at http://www.vgr.com/.

See --convert-links to find out what to do about that.

Wget refuses to download everything on a site!

First you should check, if the options you passed to Wget don't make it to reject some stuff you possibly want to have. For example, with --no-parent "global" images like logos accessed by several sites on the same server are ignored. However, not specifying --no-parent when copying sites usually lets you end up in the deep mud of ways to reject directories, domains and file patterns so that it is usually not worth the trouble.

If there is more missing then just a couple of images, this can have several other reasons:

In most cases you can assume that the web author was a complete idiot. As a matter of fact, most of them are. Unfortunately, there is not much you can do about that.

Wget always requests me to "insert volume usr: in any drive"!

As Wget comes from the Unix-world, the AmigaOS binary uses the ixemul.library. It expects a Unix-like directory tree, and looks for an optional configuration file in usr:etc/.wgetrc.

It won't hurt if you assign usr: to wherever (e.g. t:) and do not provide this configuration file at all, as internal defaults will then be used. If you have an AssignWedge-alike installed, you can also deny this assign without facing any negatives consequences for Wget.

However, you should remember that other Unix-ports you might install in future could require usr: to point to some reasonable location. For example, when installing Geek Gadgets, it adds its own assign for usr: to your s:user-startup, and you should not overwrite this later on.

Problems with AWeb

AWeb does not allow me to modify the ARexx menu!

The full version does.

AWeb does not allow me to enter such long macros as Copy site with pictures!

It seems that at least for AWeb 3.1, the maximum length of the input fields has an unreasonable small value. (CLI accepts commands upto 512 characters).

This problem has been reported to the technical support and might be fixed in a future version. Until then, you can use the following workaround:

Do not enter the macro text in the settings requester, but instead, write it to a script file and store for example in s:wget-page-with-images. This script would consist of only a single line saying:

wget.rexx --recursive --level=1 --accept=png,jpg,jpeg,gif --convert-links

In the macro menu, you now only add:

Name Macro
Copy page with images s:wget-page-with-images >CON://640/200/Wget/CLOSE/WAIT

Again, this is only a workaround and not the way the things should be.

Updates and Support

New versions of WgetRexx will be uploaded to aminet:comm/www/WgetRexx.lha. Optionally you can also obtain them from the WgetRexx Support Page at http://www.giga.or.at/~agi/wgetrexx/.

If you found a bug or something does not work as described, you can reach the author via e-mail at this address: Thomas Aglassinger <agi@sbox.tu-graz.ac.at>.

But before you contact me, look if your problem has not already been covered in the chapter about troubleshooting.

When reporting problems, please include the name of the WWW browser and version number of Wget.rexx you are using. You can find this out by taking a look at the source code or by typing version wget.rexx into CLI.

And please don't bother me with problems like to how to use a certain command line option of Wget to achieve a certain result on a certain site. This program has its own manual.

History

Version 2.1, 13-May-1998

Version 2.0, 8-May-1998

Version 1.1, 30-Apr-1998

Version 1.0, 23-Mar-1998


Thomas Aglassinger <agi@sbox.tu-graz.ac.at>