This manual describes WgetRexx, version 2.0 (8-May-1998).
WgetRexx is an ARexx script to invoke Wget from your favourite Amiga WWW browser. It is designed in a way that you can easily maintain a copy of the interesting parts of the WWW on your local hard drive.
This allows you to browse them without having to start your network, without having to worry if documents are still in the cache of the browser, without having to figure out the cryptic name of a certain document in your cache directory and without having to deal with slow, overloaded or/and unstable connections.
WgetRexx is © Thomas Aglassinger 1998 and is copyrighted freeware. Therefor you are allowed to use it without paying and can also modify it to fit your own needs. You can redistribute it as long as the whole archive and its contents remain unchanged.
The current version of this document and the program should be available from http://www.giga.or.at/~agi/wgetrexx/.
For those going to refuse to read this whole manual, here are the interesting parts: the requirements tell you where you can get needed stuff from, an example configuration shows how your macro menu might look like (and should terrify those away who don't know how to use Wget), a chapter about troubleshooting describes some common problems, and there are also some notes on updates and support.
Permission is granted to make and distribute verbatim copies of this manual provided the copyright notice and this permission notice are preserved on all copies.
WgetRexx is Copyright © 1998 by Thomas Aglassinger
No program, document, data file or source code from this software package, neither in whole nor in part, may be included or used in other software packages unless it is authorized by a written permission from the author.
There is no warranty for this software package. Although the author has tried to prevent errors, he can't guarantee that the software package described in this document is 100% reliable. You are therefore using this material at your own risk. The author cannot be made responsible for any damage which is caused by using this software package.
This software package is freely distributable. It may be put on any media which is used for the distribution of free software, like Public Domain disk collections, CDROMs, FTP servers or bulletin board systems.
The author cannot be made responsible if this software package has become unusable due to modifications of the archive contents or of the archive file itself.
There is no limit on the costs of the distribution, e.g. for the media, like floppy disks, streamer tapes or compact disks, or the process of duplicating.
Wget is was written by Hrvoje Niksic and is Copyright © Free Software Foundation
Wget.rexx uses ReqTools, which is Copyright © Nico François and Magnus Holmgren, RexxReqTools, which is Copyright © Rafael D'Halleweyn and RexxDosSupport, which is Copyright © by Hartmut Goebel.
For more details about these packages and where to obtain them from see below.
You will need a WWW-browser with a reasonable ARexx-port. AWeb and IBrowse meet this criteria. Voyager does not allow to query the URI currently browsing by means of ARexx.
Of course you will need Wget. As it is distributed under the terms of the GNU General Public License, you can obtain its source code from ftp://prep.ai.mit.edu/pub. A compiled binary for AmigaOS is part of the Geek Gadgets and available from ftp://ftp.ninemoons.com/pub/geekgadgets/. You can also download the binary from aminet:dev/gg/wget-bin.lha.
You need reqtools.library, rexxreqtools.library and rexxdossupport.library to be installed in your libs: drawer. They are not included with the WgetRexx archive, but you can obtain them from aminet:util/libs/ReqToolsUsr.lha and aminet:util/rexx/rexxdossupport.lha.
As these libraries are very common, they are maybe already installed on your system. Simply check your sys:libs drawer to validate this.
The most difficult part of the installation by far is to make Wget work. As Wget is a tool coming from the Unix-world, it is completetly unusable, has 873265 command line options and a very short manual with only a few examples.
If you do not know how to install and use it, this script won't make it easier. It just makes it more efficient to invoke while browsing.
Recursive web sucking is a very poweful feature. However, if used in an improper way, it can cause harm to the network traffic, and therefore should only be used by experienced people. Sucessfully installing Wget on an Amiga is one requirement to call yourself experienced.
Before you can use this script, you will have to decide where your local web should reside and create a directory for it. After that you will have to create an assign named Web: to this directory.
For example, if your local web should be in Work:Web, create a drawer called Web from the Workbench or enter
makedir Work:Web
into CLI. Then add the following line to your s:user-startup:
assign Web: Work:Web
Now you can copy the script where ever you like. As it does not depend on a certain browser, it can make sense to store it in rexx:. This spares you to specify a full path when invoking rx as it will automatically look there for it.
All you have to do now is to assign them to the ARexx macro menu by changing the preferences. See the example configuration for a few suggestions.
The script automatically adds the options -x --directory-prefix=/Web/ when invoking Wget. The -x tells Wget to always expand file names, even if --recursive is not specified. The --directory-prefix=/Web/ is used to store the data in your local Web. The strange looking /Web/ is the ixemul-style for Web:. You should not worry too much about that as you do not have to pass this option manually.
You can specify all other options allowed by Wget to these scripts. They will not be validated and are simply passed to Wget later on.
The location currently shown in your browser will be used as argument for Wget. Only http:// is supported, other protocols will result in an error message.
For experimenting, it might be handy to first start the script from CLI. The current directory does not matter as data are always stored in Web: and communication with the browser takes place via ARexx.
Normally you will probably assign wget.rexx with some options to the ARexx macro menu of your browser. In such a case make sure that your browser opens a console window so you can see the output of Wget while it is working.
Note that especially with --recursive it is important that you watch what Wget is currently doing so you can abort it in case it gets lost in unexpected directories.
Many errors are handled by the scripts before Wget is launched, and they usually result in a requester with a descriptive message.
However, once Wget is started, you are to the mercy of its non-existent or esoteric error messages. The script can only warn about errors in general if Wget returned with a non-zero exit code (which it does not for all possible errors). In such a case, analyse the output in the console for problems.
If Wget finished successfully, the script tells the browser to load the data from your local web. If you iconified the browser, it will automatically be uniconified.
Below you will find some example macros you can assign to the ARexx menu of your browser. As there are small differences between the various browsers, there exists a configuration for every one of them.
It is recommended to use the clipboard to copy the macro texts into the preferences requester.
Read also the about the usage of the example configuration to find out how it works. For further information about passing arguments to the script, see command line options.
To make the script accessible from the Rexx menu, go to IBrowse's Preferences/General/Rexx and add the following entries:
Name | Macro |
---|---|
Copy single resource | wget.rexx |
Copy page with images | wget.rexx --recursive --level=1 --accept=png,jpg,jpeg,gif --convert-links |
Copy site | wget.rexx --recursive --no-parent --reject=wav,au,aiff,aif |
Copy site... | wget.rexx Ask --recursive --no-parent --reject=wav,au,aiff,aif |
Copy generic... | wget.rexx Ask |
This assumes that the script has been installed to rexx: or the the browser directory. If this is not the case, you have to specify the full path name.
To change the default console the script will send its output to, modify Preferences/General/General/Output Window.
To make the script accessible from the ARexx menu, go to AWeb's Settings/Program Settings/ARexx and add the following entries:
Name | Macro |
---|---|
Copy single resource | wget.rexx >CON://640/200/Wget/CLOSE/WAIT |
Copy page with images | wget.rexx >CON://640/200/Wget/CLOSE/WAIT --recursive --level=1 --accept=png,jpg,jpeg,gif --convert-links |
Copy site | wget.rexx >CON://640/200/Wget/CLOSE/WAIT --recursive --no-parent --reject=wav,au,aiff,aif |
Copy site... | wget.rexx >CON://640/200/Wget/CLOSE/WAIT Ask --recursive --no-parent --reject=wav,au,aiff,aif |
Copy generic... | wget.rexx >CON://640/200/Wget/CLOSE/WAIT Ask |
This assumes that the script has been installed to rexx: or the the browser directory. If this is not the case, you have to specify the full path name.
Note that you have to redirect the output of every script to a console. Otherwise you would not be able to see what Wget is currently doing. Therefor this looks a bit more confusing than the example for IBrowse.
See also the problems with AWeb for some further notes.
Here is a short description of the macros used for the example configuration.
This is comparable to the download function of your browser. The difference is that it will automatically be placed in your local web in a directory depending on the location where it came from.
For example, you could point your browser to http://www.cucug.org/aminew.html and would get a single file named aminew.html being stored in the directory Web:www.cucug.org/. If such directory does not yet exist, it will be created automatically. This is the same as if you would have typed
wget -x --directory-prefix=/Web/ http://www.cucug.org/aminew.html
in the CLI. The difference is that you did not have to type a single letter.
This now retrieves a page with all its images. Of course it only makes sense to call is when actually viewing a HTML-page. With other types of data like images it will act same as Copy single resource.
After this operation, inline images will still work in the local copy of the downloaded page.
This is a very powerful macro that will copy a whole site to your local web. It starts at the document currently viewing, and will download all pages, images and some other data within the same or a deeper directory level.
The option --reject has been specified to refuse to download often unwanted music and sound data. You might want to specify further extensions here, so also movies, archives and printable documents are skipped, for example: --reject=mpg,mpeg,wav,au,aiff,aif,tgz,Z,gz,zip,lha,lzx,ps,pdf .
Same as before, but because of the Ask it will pop-up a requester where you can change the options for Wget before actually invoking it. For example, you can modify the --reject that it will not refuse sound data because you once in a while want to download from a music site.
You can also add additional options like --no-clobber to continue an aborted Copy site from before or --exclude-directories because you known that there is only crap in /poetry/.
This will simply pop-up a requester where you can enter options for Wget. Except for the internal -x --directory-prefix nothing is specified yet. It is useful when occasionally none of the above methods is flexible enough.
As you just learned, it is possible to pass additional options to wget.rexx. There are two different kinds of them:
The complete ReadArgs() template for wget.rexx is:
To/K,Ask/S,Further/S,Port/K,Continue/S,Clip/K,Verbose/S,Options/F
In most cases you will not need to specify any option except those for Wget.
If you enable Verbose, the wget.rexx will tell you some details what is going on. It tells you:
Note that this does not influence the output of Wget itself.
One might not always satisfied with a few standard macros and would like to pass different options to Wget. On the other hand it does not make sense to clutter the Arexx menu of your browser with loads of only slightly different macros. Invoking the script like
wget.rexx Ask
will pop-up a requester where you can enter options to be passed to Wget. If you already passed other Wget-options via command line, the requester will allow you to edit them before starting Wget:
wget.rexx Ask --recursive --no-parent
will bring a requester where the text "--recursive --no-parent" is already available in the input filed and can be extended or reduced by the user. Sometimes it may be more convenient to use
wget.rexx Ask Further --recursive --no-parent
This also brings up the requester, but this time it input field is empty. The options already specified in the command line will be passed in any case, and you can only enter additional options. If for example you now enter --reject=mpeg this would be the same as if you would have called
wget.rexx --recursive --no-parent --reject=mpeg
The main advantage is that the input field is not already cluttered with loads of confusing options. The drawback is that you can not edit or remove options already passed from the command line.
The last part of the command line can contain additional options to be passed to Wget.
Important: You must not pass any wget.rexx specific options after the first option for Wget. For example,
wget.rexx Ask --quiet
tells wget.rexx to pop-up the options requester and Wget to not display download information. But on the other hand,
wget.rexx --quiet Ask
will pass both --quiet Ask to Wget, which of course does not really know what to do about the Ask.
If you do not want the downloaded data to end up in Web:, you can To to specify a different download directory. For example:
wget.rexx To=ram:t
The value denotes a path in AmigaOS format. Internally it will be converted to the ixemul-style before it is passed to Wget. Fortunately you do not have to know anything about that.
Normally you do not want to do this because wget.rexx figures out the ARexx port to use by itself.
First it assumes that the host it was started from was a browser. In such a case, it will continue to talk to this host not matter how many other browsers are running at the same time.
If this turns out to be wrong (e.g. because it was started in the CLI), it tries to find one of the supported browsers at its default port. If any such browser is running, it will use this.
If no browser is running, the script does not work at all for apparent reasons.
The only possible problem can be that several supported browsers are running at the same time, and you do not start the script directly from one of them. In such a rare case the browser checked first will be used, which is not necessarily the one the user prefers. Therefore you can use
wget.rexx Port=IBROWSE
in CLI, even if AWeb is also running.
Especially when copying whole sites, it often happens that wget ends up downloading stuff that you do not want. The usual procedure then is to interrupt wget, and specify some additional options to reject some stuff. For example, you found some site and started to download it:
wget --recursive --no-parent http://www.interesting.site/~stuff/But soon you notice that there are loads of redundant PDF documents which only reproduce the information you are just obtaining in HTML. Therefor you interrupt and start again with more options:
wget --recursive --no-parent --no-clobber --reject=pdf http://www.interesting.site/stuff/
To your further annoyance it turns out that the directory /stuff/crap entirely hold things you are not interested in. So therefor you restart again:
wget --recursive --no-parent --no-clobber --reject=pdf --exclude-directories=/stuff/crap/ http://www.interesting.site/stuff/
And so on. As you can see, it can take quite some effort before you find proper options for a certain site.
So how can the above procedure be performed with wget.rexx? Apparently, there is no history function like in the CLI, where you can switch back to the previous call and add additional options to it.
However, you can make wget.rexx to store the options entered in the requester when Ask was specified in a ARexx-Clip so. This will be preserved and will be read again and used as default value in the requester if you set Continue.
Now that sounds confusing, but let's see how it works in practice, using an extended version of the Copy site... macro from before:
Name | Macro |
---|---|
Copy site.. | wget.rexx Clip=wget-site Ask Further --recursive --no-parent |
Continue copy site... | wget.rexx Clip=wget-site Ask Further Continue --recursive --no-parent --no-clobber |
The macro Copy site... will always pop up an empty requester, where you can specify additional options like --reject. The options you enter there will be stored in an ARexx clip called "wget-site". It does not really matter how you call this clip, the only important thing is that you use the same name for the macro that reads the clip.
And this is exactly what Continue copy site... does: because of Continue it does not clear the clip, but instead reads it and uses the value as default text in the string requester. The additional parameter --no-clobber just tells Wget not to download the files again you already got.
So how does the above example session look with wget.rexx?
First, you browse to http://www.interesting.site/stuff/ and select Copy site. Soon you have to interrupt it, select Copy site... and enter --reject=pdf into the upcoming requester. Until now, there is nothing you could not have already done with the old macros.
But when it turns out that the --reject was not enough, you only have to select Continue copy site..., and the --reject=pdf is already available in the requester. You just have to extend the text to --reject=pdf --exclude-directories=/stuff/crap/ and can go on.
And if later on some MPEG animations show up, you should know what to do by reselecting Continue copy site... again...
You can also specify a URI for wget when invoking the script. In this case, the document currently viewed in the browser will be ignored.
Specifying more than one URI or a protocol different to http:// will result in an error message. This is a difference to Wget, which allows multiple URIs and also supports ftp://.
URIs have to be specified in there full form, which means you for example can not use www.cucug.org. Instead, the complete http://www.cucug.org/ is required.
Among other things, this can be useful for creating a macro to update a local copy of an earlier downloaded site. For example with IBrowse you could use:
Name | Macro |
---|---|
Update biscuits | wget.rexx --timestamping --recursive --no-parent --exclude-directories=/~hmhb/fallnet http://www.btinternet.com/~hmhb/hmhb.htm |
This basically acts like if http://www.btinternet.com/~hmhb/hmhb.htm would have been viewed in the browser, and you would have selected Copy site. Because of --timestamping, only new data are actually copied. Because of the --exclude-directories=/~hmhb/fallnet some stuff that was considered to be unwanted at a former download is skipped.
Putting this into the macro menu spares you to remember which options you used two months ago.
Here are some common problems and some hints how to solve them or where to get further information from.
Supporting new browsers should be easy, assuming that its ARexx port is powerful enough. In detail, the following data are needed:
You should be able to find this information in the manual of your browser. If you submit it to me your browser will probably be supported within the next update of WgetRexx.
Sad but true. Contact your technical support and try to convince them that this sucks.
WgetRexx uses several different mechanisms, and unfortunately they differ concerning case-sensitivity:
For URI values, it becomes even more confusing, as it depend on the server. Nevertheless filenames are case insensitive as soon as they are on your hard drive.
Bad luck. Not my problem.
Refer to the Wget manual for that. There is a command line option for nearly everything, so usually you only have to search long enough.
There are several command line options to prevent Wget from downloading certain data. Most remarkable, there are:
Refer to the manual how to use them and pass them to the requester of Copy Site... macro.
This is because the web author has decided to use global links (which start with http://) instead of relative links which only specify the filename.
See --convert-links to find out what to do about that.
First you should check, if the options you passed to Wget don't make it to reject some stuff you possibly want to have. For example, with --no-parent it very "global" images like logos accessed by several sites on the same server are ignored. However, not specifying --no-parent when copying sites usually lets you end up in the deep mud of ways to reject directories, domains and file patterns so that it is usually not worth the trouble.
Apart from this, there several other reasons:
In most cases you can assume that the web author was a complete idiot. As a matter of fact, most of them are. Unfortunately, there is not much you can do about that.
As Wget comes from the Unix-world, the AmigaOS binary uses the ixemul.library. It expects a Unix-like directory tree, and looks for an optional configuration file in usr:etc/.wgetrc.
It won't hurt if you assign usr: to wherever (e.g. t:) and do not provide this configuration file at all, as internal defaults will then be used. If you have an AssignWedge-alike installed, you can also deny this assign without facing any negatives consequences for Wget.
However, you should remember that other Unix-ports you might install in future could require usr: to point to some reasonable location. For example, when installing Geek Gadgets, it adds its own assign for usr: to your s:user-startup, and you should not overwrite this later on.
The full version does.
It seems that at least for AWeb 3.1, the maximum length of the input fields has an unreasonable small value. (CLI accepts commands upto 512 characters).
This problem has been reported to the technical support and might be fixed in a future version. Until then, you can use the following workaround:
Do not enter the macro text in the settings requester, but instead, write it to a script file and store for example in s:wget-page-with-images. This script would consist of only a single line saying:
wget.rexx --recursive --level=1 --accept=png,jpg,jpeg,gif --convert-links
In the macro menu, you now only add:
Name | Macro |
---|---|
Copy page with images | s:wget-page-with-images >CON://640/200/Wget/CLOSE/WAIT |
Again, this is only a workaround and not the way the things should be.
New versions of WgetRexx will be uploaded to aminet:comm/www/WgetRexx.lha. Optionally you can also obtain them from the WgetRexx Support Page at http://www.giga.or.at/~agi/wgetrexx/.
If you found a bug or something does not work as described, you can reach the author via e-mail at this address: Thomas Aglassinger <agi@sbox.tu-graz.ac.at>.
But before you contact me, look if your problem has not already been covered in the chapter about troubleshooting.
When reporting problems, please include the name of the WWW browser and version number of wget.rexx you are using. You can find this out by taking a look at the source code or by typing version wget.rexx into CLI.
And please don't bother me with problems like to how to use a certain command line option of Wget to achieve a certain result on a certain site. This program has its own manual.