News Clipper

 


News Clipper

User Guide: Version 1.33

© 1999, 2000 by Spinnaker Software Inc. All rights reserved.

No part of this documentation may be reproduced in any form or by any means or used to make any derivative work (such as translation, transformation or adaptation) without the written permission of Spinnaker Software Inc.

Spinnaker Software Inc. reserves the right to change specifications at any time without notice. Spinnaker Software Inc. reserves the right to revise this documentation and to make changes in content from time to time without obligation on the part of Spinnaker Software Inc. to make notifications of such revision or change.

Spinnaker Software Inc. has obtained information from sources believed to be reliable. However, because of the possibility of human or mechanical error, Spinnaker Software Inc. does not guarantee the accuracy, adequacy or completeness of any information and is not responsible for any errors or omissions or the results obtained from the use of such information.

Neither Spinnaker Software Inc. nor its dealers or Distributors shall be liable to the purchaser or any other person or entity with respect to any liability, loss or damage caused or alleged to have been caused directly or indirectly by the User Guide.

Windows and Windows NT are registered trademarks of Microsoft Corporation. Other brand and product names may be registered trademarks or trademarks of their respective holders.

Spinnaker Software Inc.

719F Mountainwood Road

Charlottesville, VA 22903


Table of Contents

 

SECTION 1 Introduction to News Clipper.................................................... 1-1

Getting More Help....................................................................................... 1-1

Frequently Asked Questions Database............................................... 1-1

Mailing Lists............................................................................................ 1-1

Mailing List Archives............................................................................. 1-1

Technical Support................................................................................... 1-1

Reporting Bugs........................................................................................ 1-2

News Clipper Consulting....................................................................... 1-2

Conventions Used in This Manual........................................................... 1-3

Platform-Specific Information................................................................ 1-3

Fonts......................................................................................................... 1-3

What is News Clipper?................................................................................ 1-4

Overview................................................................................................... 1-4

What kind of information?..................................................................... 1-4

Integrate into what?................................................................................ 1-4

How does it work?................................................................................... 1-4

How do I run it?....................................................................................... 1-5

A Note About Copyright............................................................................ 1-6

Fair Use..................................................................................................... 1-6

Getting Permission to Publish............................................................... 1-6

How Does It Work?..................................................................................... 1-7

Running News Clipper............................................................................ 1-7

The <!-- newsclipper ... --> Tag.................................... 1-8

Handlers.................................................................................................... 1-9

Caching................................................................................................... 1-10

Timeouts................................................................................................. 1-11

SECTION 2 Setting Up.................................................................................... 2-1

Installing the Software................................................................................ 2-1

System Requirements............................................................................. 2-1

Instructions (Binary Installation).......................................................... 2-1

Instructions (Source Code Installation)............................................... 2-2

Configuring the Software............................................................................ 2-3

SECTION 3 News Clipper Tutorial................................................................. 3-1

A Simple Input File...................................................................................... 3-1

Trying Handlers with the -e Flag.......................................................... 3-1

A Single Command in an Input File...................................................... 3-2

The Commands........................................................................................ 3-2

Default Commands.................................................................................. 3-2

Snip and Ship Remote Information............................................................ 3-4

Using Filter and Output Commands.......................................................... 3-5

Filter and Output Commands................................................................. 3-5

SECTION 4 MakeHandler Tutorial................................................................ 4-1

Overview....................................................................................................... 4-1

Step 1: General Information........................................................................ 4-2

Step 2: URL From Which to Fetch Data................................................... 4-3

Step 3: Choosing An Acquisition Function............................................ 4-4

Step 4: Specifying Starting and Ending Patterns.................................... 4-5

Step 5: Specifying Update Times............................................................... 4-6

Step 6: Manual Editing of the Handler...................................................... 4-7

SECTION 5 The News Clipper Tag Language............................................. 5-1

Overview....................................................................................................... 5-1

Input, Filter, Output..................................................................................... 5-3

Types............................................................................................................. 5-5

Basic and Built-in Types........................................................................ 5-5

Type Checking......................................................................................... 5-5

Overview of the Standard Handlers.......................................................... 5-7

Input Handlers......................................................................................... 5-7

Filter Handlers.......................................................................................... 5-8

Output Handlers.................................................................................... 5-10

Updating Handlers..................................................................................... 5-12

Bugfix and Functional Updates........................................................... 5-12

Updating a Handler Manually............................................................. 5-12

If There is No Update…....................................................................... 5-12

Finding Out More About the Handlers.................................................. 5-13

Examples...................................................................................................... 5-14

News Clipper Comments........................................................................... 5-16

SECTION 6 Running News Clipper............................................................... 6-1

Overview....................................................................................................... 6-1

Debug Mode................................................................................................. 6-3

Summary of Flags......................................................................................... 6-4

Running At Set Times................................................................................. 6-5

Windows 95/98........................................................................................ 6-5

Windows NT............................................................................................ 6-5

Unix or Linux............................................................................................ 6-5

FTP Files To Server..................................................................................... 6-7

SECTION 7 Writing Custom Handlers.......................................................... 7-1

Introduction.................................................................................................. 7-1

Basic Handler Structure.............................................................................. 7-2

Acquisition Handler Components........................................................ 7-2

Other Handler Components................................................................... 7-4

Choose Your News Source......................................................................... 7-7

Standards...................................................................................................... 7-8

Choosing the Acquisition Function....................................................... 7-10

Choosing the Starting and Ending Patterns.......................................... 7-11

GetUpdateTimes......................................................................................... 7-13

Other Functionality.................................................................................... 7-14

Additional Processing............................................................................... 7-16

Default Filter and Output Handlers......................................................... 7-17

Submitting your handler........................................................................... 7-18

SECTION 8 Appendix...................................................................................... 8-1

An Introduction to Open-Source Software.............................................. 8-1

The Benefits of Collaborative Software Development...................... 8-1

A Brief History of News Clipper........................................................... 8-1

How to Build a Business........................................................................ 8-2

Index............................................................................................................... 8-3

Glossary......................................................................................................... 8-2



Thank you for purchasing News Clipper™. This user’s manual is designed to answer most questions concerning News Clipper. If there are any questions that are not covered by this manual, a number of resources are available.

Frequently Asked Questions Database

A database of common questions that have already been answered by the News Clipper staff is available at:

http://www.NewsClipper.com/faq.html

This database is a good source for information for general issues related to installing, configuring, and using News Clipper.

Mailing Lists

Two mailing lists are available for News Clipper users.

NEWS CLIPPER COMMUNITY – An interactive list for the mutual benefit of all News Clipper enthusiasts.  This list is useful to learn how other people use News Clipper, and to discuss particular problems.

NEWS CLIPPER DEVELOPMENT – An interactive list for those interested in the development of News Clipper and conversation of a more technical nature. Topics include potential features, changes to the tag language, suggested improvements, etc.

To join either or both of these lists, subscribe online at:

http://www.NewsClipper.com/­techsup.htm­l#MailingList

Mailing List Archives

All e-mail sent to the mailing lists is archived, so it can be quickly searched for conversations on any topic.  To ensure user’s privacy, e-mail addresses are automatically removed. To view the archives, visit:

http://www.mail-archive.com/­newsclipperlist­%40­newsclipper.com/.

Technical Support

If the answer to a question can not be found using the above-mentioned resources, skilled staff can be reached using the on-line priority support form at:

http://www.newsclipper.com/­techsup.htm­#Priority_Support_Form

Our technical support staff will reply within 24 business hours.

Reporting Bugs

If you find a bug in News Clipper, send email to bugreport@newsclipper.com. Please remember to run News Clipper with the -d switch, and then attach the debug log which News Clipper generates and stores in $home/.NewsClipper/logs.

News Clipper Consulting

It’s sometimes easier and faster to pay an expert to handle particularly tricky problems. Spinnaker Software can connect you with one of our expert consultants. We provide support for custom handler development, News Clipper installation, and more. Just visit http://www.newsclipper.com/services.html.


This manual uses a number of conventions that need to be explained.

Platform-Specific Information

All of the examples are given in terms of the Microsoft Windows distribution of News Clipper. This means that certain file names and paths might be slightly different for Unix-like platforms. For example, Windows users would run NewsClipper.exe, while Linux/Solaris users would run NewsClipper.

Likewise, per-user configuration information for Unix-like platforms is always stored in $home/.NewsClipper. On Windows, this information is stored in <INSTALLDIR>/.NewsClipper, where <INSTALLDIR> is the directory into which News Clipper was installed.

Other platform-specific issues will be described in the text.

Fonts

9-point Courier will be used for all commands and code:

This is an example of some code: <!--newsclipper ...-->.

Longer code fragments will be boxed and in 8-point Courier:

<!-- newsclipper
<input ...>
<filter ...>
<output ...>
-->

Extra information will be given in notes:

Note: This is a note.

Warnings will be boxed:

Warning: Warning, this is a warning!

The first time a new term is used, it will be italicized. The definition can be found in the glossary located in the appendix.


Overview

News Clipper is a tool that allows for the processing and integration of dynamic information into a web page. This information might be something simple, like the date, or complex, like a set of links to recent Usenet postings about llamas. News Clipper allows the user to specify, using an HTML-like syntax, the source of data, how that data should be filtered, and how that data should be output.

What kind of information?

News Clipper has built-in support for over 200 information sources, and more are being added every day. In addition, the capabilities of News Clipper can be extended so that it can handle any information that is available on the Internet. For example, information on web pages, in Usenet articles, in text files on the Internet, at FTP sites, etc. can all be used by News Clipper.

It is also possible to extend News Clipper to be able to integrate with information stored locally. This means that News Clipper can integrate information in databases, text files, etc.

Integrate into what?

News Clipper outputs HTML that is suitable for integrating into web pages written by the user. At various points in the web page creation, special commands are added that tell News Clipper what information to integrate, and how it should look. When News Clipper is run, it looks for the special commands. When it finds one, it snips the information from the Internet source and pastes it into the HTML file. This output file is the one viewed by visitors to your site.

How does it work?

News Clipper is not a CGI script that gets executed when a person visits your website. CGI programs can be slow, and create extra load on your webserver. Furthermore, some people can not run CGI programs due to restrictions on their account. (Some people only have FTP access to their servers, for example.)

News Clipper usually operates in the background, updating the dynamic content in your web pages whenever needed. You provide a set of input files with special commands in them, and News Clipper converts those commands into information acquired and formatted from sources on the Internet. Visitors to your web pages will load static pages with fresh information.

Each of the News Clipper commands contains a sequence of input, filter, and output commands. These commands are implemented in terms of modules called handlers that encode the details of performing a News Clipper command. Handlers are written in a computer programming language called Perl.

Handlers can be written by users, but they are usually downloaded from a repository of pre-written handlers called the handler database. When News Clipper encounters a command it does not have a handler for, it automatically tries to download the handler from the handler database. When new handlers are added to the on-line database, they instantly become available to all users.

When News Clipper encounters an input command (and has downloaded the handler file if necessary), it then loads the handler into memory. The input handler contains instructions that tell News Clipper how to fetch information from a remote website. Once the information is fetched, News Clipper makes the information available to the user in a format that depends on the type of data fetched.

News Clipper then processes additional filter commands that manipulate the data in some way. For example, the filter handlers may extract only links with certain keywords. Finally, the output command causes an output handler to be loaded that sends the data to the output file in a particular format.

How do I run it?

News Clipper is run from the command line. One would normally install News Clipper on the web server, then telnet to the machine to run the program manually or configure the server to run it automatically.

Another way to use News Clipper is to install it on a local machine and configure it to automatically send the output files to the remote server. This is useful if your web account is unusually restrictive about which programs can be installed and run by users.

During web site development, it is best to run News Clipper manually. During production use, one should use a third-party utility to run News Clipper automatically every few hours.

The following sections describe the various aspects of News Clipper in more detail.


Most information on the Internet is copyrighted. When something is copyrighted, it means that the expression of the idea is protected by law. The person who expressed the idea has the right to control how that expression is reproduced. So, for example, if a journalist writes a story, then that journalist holds the copyright to the story. The same is true for art work and even headlines.

Fair Use

The law makes exceptions to copyrights, which it calls fair use. Examples are a teacher who makes photocopies of material for educational purposes, a musician who mimics a song for satire, or a movie reviewer who plays a film clip for an audience. Another important exception is that of personal use, which means that a reproduction can be made by an individual for the individual’s own use. Of course, the owners of copyrights can use the material however they want.

When agreeing to the licensing terms for News Clipper, users agreed that they would not use News Clipper to violate copyright laws. What this generally means is that material protected by copyright can not be re-published, unless prior agreement from the copyright holder has been received. In other words, do not make News Clipper-enhanced web pages available on the web without verifying that copyrights are not being violated.

Getting Permission to Publish

Spinnaker Software, Inc. is currently working with Internet publishers to acquire syndication rights for material that is to be re-published on the web. Most likely, end users will have to pay royalties to the copyright holder either when they acquire a handler that allows re-publishing of the material, or when they actually use the handler to re-publish the material. As these partnerships are established, announcements will be made on the News Clipper website.

In the mean time, companies that wish to re-publish news and other content are advised to syndicate that content from the provider. News Clipper can then be used to integrate that information into corporate web pages.

Of course, authors of copyrighted works can re-publish all they like.


This section describes the basic operation of News Clipper at a high level.

Running News Clipper

When you run News Clipper, it processes the input files and creates output files with the dynamic content. Each input file has embedded in it special News Clipper commands that are replaced with the dynamic content specified by the command. That is, when News Clipper encounters a command sequence, it uses that command sequence to determine how to find and snip content from the web and insert it into the output web page.

There are several ways to run News Clipper. You can either run News Clipper manually to cause it to create the output files immediately, or run News Clipper automatically. Running News Clipper manually is useful when you are designing your commands, or trying out a new information source. When you are ready to put News Clipper into production use, you simply use a third party tool that runs News Clipper periodically.

One simple way to test a new information source manually is by using the -e flag:

NewsClipper –e ’yahootopstories  source=ap’

Running News Clipper this way tells it to execute the single command specified and output the results to the screen.

If you want News Clipper to process a single file manually, you can use the -i and -o flags to specify input and output files:

NewsClipper –i inputfile –o outputfile.html

There are two special files, STDIN and STDOUT. If using STDIN as an input file, it means that News Clipper should take its input from data that is “piped” to it from another program. Similarly, STDOUT means that News Clipper should output its information to the screen instead of a file. The following example takes the input information from STDIN, and prints the output to STDOUT (which is sent to another program).

prog1 | NewsClipper –i STDIN –o STDOUT | prog2

The usual way of running News Clipper is without any flags at all. In this case, the input and output files are specified in the configuration file NewsClipper.cfg, which is usually stored in $home/.NewsClipper. In addition to the default input and output files, this file contains a number of configuration options.

We recommend creating the directory $home/.NewsClipper/­inputfiles in which to store your News Clipper input files. You may also wish to give them a particular extension like “.nc” to indicate that the files are “News Clipper enabled”. (However, keeping a .html extension may also be prudent so that your HTML editor will interpret the input files as HTML.) Be sure that when updating your website you update the News Clipper input files and not the output files, as the output files will be overwritten the next time News Clipper is executed.

More information about running News Clipper can be found in the section entitled Running News Clipper.

The <!-- newsclipper ... --> Tag

The “special commands” that News Clipper recognizes are actually hidden in HTML comments. When News Clipper analyzes the input file, it looks for comments where the first word is “newsclipper”. When it sees such a comment, it then interprets the remainder of the comment as a News Clipper command.

Commands

There are three types of commands that News Clipper recognizes: input, filter, and output. The basic idea is that the commands tell News Clipper what kind of information is needed, how that information should be transformed, and how it is to be displayed. See the section The News Clipper Tag Language for more information.

Note: If there is only one command to be executed (an input command), News Clipper determines the filter and output handlers from the defaults given in the input handler. The resulting (expanded) command list will have an input command, zero or more filter commands, and an output command.

Below is an example News Clipper tag, with four commands. We’ll use it as a running example in the next few sections.

<!--newsclipper
<input name=yahootopstories source=tech>
<filter name=map filter=hash2string
  format=’<a href="%{url}">%{headline}</a>’>
<filter name=grep words="microsoft,linux,y2k">
<output name=array numcols=3>
-->

By separating acquisition, filtering, and output of information, web designers are given more freedom to control the presentation.

Input

<input name=yahootopstories source=tech>

The input command tells News Clipper where to get its information. In this case, the news source is Yahoo’s top stories, technology section. At time of publication, News Clipper supports over 200 news sources.

Filter

<filter name=map filter=hash2string
  format=’<a href="%{url}">%{headline}</a>’>

One or more filters can be used to modify the data that is collected. In the first filter, the array of hashes of information returned by yahootopstories is converted into an array of links.

<filter name=grep words="microsoft,linux,y2k">

In the second filter, the grep filter takes the links created by the first filter and removes any links that do not contain the keywords microsoft, linux, or y2k.

Output

<output name=array numcols=3>

The output command prints the data in a particular format. In this case, the filtered links are printed in three columns.

See the section entitled The News Clipper Tag Language for more information.

Handlers

News Clipper uses handlers to support the input, filter, and output commands. Handlers tell News Clipper where the information is located and how to get it. People extend the capabilities of News Clipper by writing new handlers and submitting them to the News Clipper website.

Every News Clipper command has a corresponding handler, which is saved in a subdirectory of $home/.NewsClipper/NewsClipper/­Handler which depends on the type of handler (Acquisition, Filter, or Output). The handler name is the same as the command name, plus the “.pm” extension.

When News Clipper encounters a command for which no local handler exists, it will contact the News Clipper server and ask for a suitable handler. If one is available, News Clipper will download it and use it to complete the command.

Using New Handlers

Using new handlers is easy – simply start writing commands with them. News Clipper will download the handlers as needed.

Checking for Updates

Occasionally, the news source for a handler will change, and the handler will stop working. When this happens, an HTML comment will be inserted in the output file that explains what happened. Sometimes the problem has already been fixed, and a new handler will be downloaded automatically. Other times, the handler writer will need to be contacted and asked to update the handler. Browse the handler database at http://www.newsclipper.com/handlers­.htm to find out the contact information for an author of a particular handler.

Writing Handlers

Sometimes there is a news source that is not currently supported by News Clipper. Maybe it is a website that does not have a handler yet, or maybe it is a news source that is internal to an organization. In either case, it is easy to extend News Clipper to support it.

News Clipper comes with a tool called MakeHandler, which automates some of the tedious aspects of writing a new handler. It will ask a series of questions, and write the bulk of the handler. (In most cases, the handler will work without any further modifications.) Handlers are written in the Perl programming language because it has powerful operations for manipulating text.

When the handler is finished, please consider submitting it to the News Clipper handler database. This will allow other people to use the handler too. The on-line submission form is at http://www.newsclipper.com/­submithandler.html.

Sometimes it is possible to avoid writing a new handler for a news source by using one of the general news handlers such as the rss handler or the moreover handler. For more information about writing new handlers see the section entitled MakeHandler Tutorial.

Caching

News Clipper has built-in support for caching of information. This allows the same news source to be used in multiple places in the input files, but only fetching it from the remote location once. Likewise, if the script is run several times a day, it will only update the data when the handlers require it. (During the creation of a handler, the writer specifies when data is updated at the news source.)

Images can also be cached, which will improve the speed of viewing HTML pages. It also lowers the load on the websites where the images come from.

Timeouts

Sometimes a news source is not available or is slow to respond. When this happens, News Clipper will “time out” after a specified number of seconds. When a timeout occurs, a message will be inserted in the run log explaining what happened.

In addition to the timeouts for each handler, News Clipper also has a timeout for the entire program (non-Windows platforms only). Keep this time limit in mind when asking News Clipper to process a large number of files.


SECTION 2
Setting Up


System Requirements

News Clipper requires approximately 5 MB of hard-disk space. If images will be cached locally, News Clipper will require extra space. How much space to use for image caching can be specified in the cacheimages configuration file. Similarly, the size and number of log files that News Clipper creates can be specified in the configuration file.

News Clipper is a computer program that is written in a computer language called Perl. Usually, Perl programs can only be run by a Perl interpreter, and users must install any extra modules that are required. Because installing Perl and the supporting modules can be very difficult, News Clipper is sold as pre-compiled binaries, which do not require that Perl be installed, or that extra modules be installed. Most users will want to install these.

However, if one chooses not to install the binaries, the source code can be installed from the files located in the “source” subdirectory of the distribution. Do this only if Perl is installed on the system, and you are familiar with the process for downloading and installing Perl modules.

Instructions (Binary Installation)

Windows™

News Clipper can run on any Windows 95/98/NT machine with a connection to the Internet.

1.        Download the distribution.

2.        Double-click the executable file that was downloaded.

3.        After agreeing to the terms of the license, click the browse button to select a different installation directory or click next to accept the default installation directory.

4.        Choose a program folder in which to place the icons.

5.        Click finish to complete the installation.

To Uninstall

To uninstall News Clipper in the future, double-click the Add/Remove Programs icons in the system settings dialog box. Select News Clipper from the selection box, and click “Add/Remove”.

LinuxTM or SolarisTM

1.        Download the distribution.

2.        Unpack the distribution by running zcat <file.tar.gz> | tar xvf -.

3.        Copy the files NewsClipper and MakeHandler to a location in the path.

4.        If installing News Clipper system-wide, edit the NewsClipper.cfg file to specify the location of handlers, the html cache, and the image cache.

5.        Make sure the permissions are set correctly to allow people running News Clipper to write to the directories.

6.        Copy NewsClipper.cfg to a convenient location, typically /etc/.

7.        Set the NEWSCLIPPER environment variable in every user’s configuration to point to the location of NewsClipper.cfg. This is typically done by editing a file such as /etc/profile.

Instructions (Source Code Installation)

A source code version of News Clipper can be installed if that is preferred over the pre-compiled version. In addition to the requirements above, Perl 5.004 or newer is required, as well as several Perl modules. See the README file located in the src directory of the distribution

1.        Download the distribution.

2.        Unpack the distribution by running zcat <file.tar.gz> | tar xvf ‑ in Unix, or using a compression manager like WinZip in Windows.

3.        Change directories into the NewsClipper-X.XX-os directory.

4.        Complete the installation according the instructions in the README file.

 


Configuring the Software

All of News Clipper’s configuration information is stored in a file called NewsClipper.cfg. On Windows, this file is located in the $home/.NewsClipper directory. On Unix-based systems, this file is located either in $home/.NewsClipper or the system-wide location specified by the NEWSCLIPPER environment variable. The normal configuration file can be overridden using the -c flag.

The configuration file should be fairly self-explanatory. If you do not understand an option, it is safest to leave the option alone. Here is a description of the options that can be changed:

$ENV{TZ} = '' if ($^O eq 'MSWin32') || ($^O eq 'dos');

On Windows systems, the TZ environment variable must be set to the local time zone. Unix-based systems can ignore this option, as the time zone can be gotten from the operating system. Time zones can be of the following forms:

Universal: GMT, UT

US zones: EST, EDT, CST, CDT, MST, MDT, PST, PDT

Military: A to Z (except J)

Other: +HHMM or -HHMM

ISO 8601: +HH:MM, +HH, -HH:MM, -HH

Insert the time zone between the quotes.

'email' => '',
'registration_key' => '',

When registering News Clipper, enter the exact email you supplied during registration, and the registration key you received.

'input_files' =>
  ['c:/path/in1.nc’,’c:/path/in2.nc'],
'output_files' =>
  ['c:/path/out1.html’,’c:/path/out2.html'],

News Clipper can handle multiple input and output files. Enter the complete paths and names of the files, separated by commas. The first input file is used to generate the output file whose name is the first one in the outputFiles list, and so on. Be sure that the number of input files equals the number of output files.

'ftp_files' => [
  {'server' => 'www.myserver.com',
   'username' => 'myusername',
   'password' => 'mypassword',
   'dir' => '/home/myusername/public_html'},
  {'server' => 'www.myserver.com',
   'username' => 'myusername',
   'password' => 'mypassword',
   'dir' => '/home/myusername/public_html'},
],

You can tell News Clipper to FTP your output files to a remote server. For each output file, provide the server, username, password, and directory between the “{...}” as above. The first set of information applies to the first output file, the second set to the second, etc. If you do not want to transfer a particular file, just use empty brackets (“{}”) in that file’s position. See the section entitled FTP Files To Server for more information.

'email_files' => [
  {'From' => 'Your Name <user@server>',
   'To' => 'Recipient <recipient@server>',
   'Subject' => 'The Subject',
   'CC' => '',
   'BCC' => ''},
],

You can tell News Clipper to email your output files to a remote server. For each output file, provide the from header, to header, subject, cc, and bcc between the “{...}” as above. The first set of information applies to the first output file, the second set to the second, etc. If you do not want to email a particular file, just use empty brackets (“{}”) in that file’s position. See the section entitled Email Files To Users for more information.

'handler_locations' => ["$home/.NewsClipper"],

Additional paths can be added to this variable to tell News Clipper where to find additional handlers. ($home is the installation directory on Windows, and $home/.NewsClipper on Unix.) When installing News Clipper for multiple users on a Unix system, this value should point to a globally accessible directory.

You can specify multiple locations, separating each with commas. However, be aware that when News Clipper needs to install a new handler, it will save it in the first directory in the list. This means that this directory should be writable by the user.

'module_path' => '',

This option can safely be ignored for pre-compiled versions of News Clipper. For source code installations, this configuration option contains the path to the directory NewsClipper, which contains News Clipper’s source modules.

'cache_location' => "$home/.NewsClipper/cache",
'max_cache_size' => 5,

News Clipper caches remote web pages locally. cache_location specifies where that cache should be, and max_cache_size specifies how big the cache can get. The size of the cache is given in megabytes.

'script_timeout' => 360,
'socket_timeout' => 40,
'socket_tries' => 3,

Timeouts are used to prevent News Clipper from running too long, and to prevent unresponsive remote servers from slowing things down. Set socket_timeout to the maximum amount of time that News Clipper should wait for a response from a server. Set script_timeout to the maximum time that News Clipper should run. (Time values are in seconds.)

One problem that users have is that the main script times out because one or more handlers timed out. To help prevent this, script_timeout should be about equal to socket_timeout times the number of News Clipper tags in the input files. Note that on the Windows platform, timeouts for the entire script are not supported.

'proxy' => "your.proxy.com",
'proxy_username' => "username",
'proxy_password' => "password",

If a proxy is being used to communicate to the outside world, these values need to be set. proxy is the URL of the proxy server, such as http://proxy.­host.­com:8080/.

If a password is not supplied and News Clipper is not run from a command shell, News Clipper will attempt to log in to the proxy server without a password, which will most likely fail. However, if a password is not supplied and News Clipper is run from a command shell, News Clipper will prompt the user to enter a password.

If a password is supplied for proxy_password, this password will be used.  If using this option, please make sure to take appropiate measures to make sure that the configuration file can not be read by others.

'auto_download_bugfix_updates' => 'yes',

This setting tells News Clipper whether you want it to automatically download updates to handlers that correct bugs, but which do not contain changes that will break existing News Clipper commands.

'tag_text' => 'newsclipper',

The tag_text value allows you to specify what keyword News Clipper should use when determining if an HTML comment is really a set of News Clipper commands. For example, changing this value to “nccommand” would tell News Clipper to look for commands inside comments that look like “<!--nccommand ... -->”.

'make_output_files_executable' => 'yes',

Some web servers are configured to process server-side includes or other dynamic content only if the file is executable. This setting tells News Clipper whether to make the output files that it genereates executable.

'debug_log_file' => '$home/.NewsClipper/logs/debug.log',
'run_log_file' => '$home/.NewsClipper/logs/run.log',

These values determine the location of the debug and run logs

'max_number_of_log_files' => '7',
'max_log_file_size' => '1000000',

These values control how log files are rotated. Old log files are compressed and saved, and when the current log file reaches max_log_file_size in bytes, then the oldest log file is deleted and the rest are renamed in order to create a new log file. Up to max_number_of_log_files are kept.

The configuration file may also have one or more configuration sections for individual handlers. For example, News Clipper’s example configuration file already has configuration options for the cacheimages filter.

'imgcachedir' => "$home/path/to/imagecache",
'imgcacheurl' => "http://www.yourserver.com/imagecache",
'maximgcacheage' => 7,

The cacheimages handler caches remote images locally for faster viewing. Some input handlers use cacheimages internally, so these variables will need to be set even if you are not using the handler explicitly.

imgcachedir is the directory where image files should be stored. imgcacheurl is the URL that corresponds to this directory. maximgcacheage is the number of days that an image remains in the cache before being deleted.

For more information about these options, see the description of the cacheimages filter in the section entitled Overview of the Standard Handlers.



Trying Handlers with the -e Flag

First let’s use News Clipper to print the date using the date handler.

NewsClipper –e date

This will cause News Clipper to automatically download and install the handler if it has not already been installed. It will then run the date handler and output the results to the screen.

Normally the handler will be saved as $home/.Newsclipper/­NewsClipper/­Handlers/­Acquisition/date.pm. You can use a normal text editor to view this file. At the top of the file is a block of information about the handler that looks like this:

$handlerInfo{'Author Name'}    = 'David Coppit';
$handlerInfo{'Author Email'}   = 'david@coppit.org';
$handlerInfo{'Maintainer Name'}= 'David Coppit';
$handlerInfo{'Maintainer Email'}='david@coppit.org';
$handlerInfo{'Description'}    = <<'EOF';
The current date
EOF
$handlerInfo{'Category'}       = 'News Clipper';
$handlerInfo{'URL'}            = <<'EOF';
EOF
$handlerInfo{'License'}        = 'GPL';
$handlerInfo{'For News Clipper Version'} = '1.18';
$handlerInfo{'Language'}       = 'English';
$handlerInfo{'Notes'}          = <<'EOF';
EOF
$handlerInfo{'Syntax'}         = <<'EOF';
<input name=date style=X>
  Returns a string
  X=day: Wednesday, November 7 (default)
  X=numeric: 11/7/98
  X=long: November 7, 1995
  X: Or any strftime-like time specification
EOF

This information can be viewed in a more easy to read format by searching for the date handler at http://www.newsclipper.com/handlers.html.

From the syntax description above and on the description of the handler on the website, we see that the date handler supports a number of styles. We can try the “numeric” style out like so:

NewsClipper –e 'date style=numeric'

A Single Command in an Input File

Now let’s get News Clipper working for a trivial file. Create a file with the following text:

<html>
<head><title>A Test page</title></head>
<body>
Hello world! The current date is:
<!-- newsclipper
  <input name=date>
-->
</body>
</html>

Save this file as “input.txt”. The name really does not matter, but it will be referenced in the next step.

Now run News Clipper like this:

NewsClipper –i input.txt –o output.html

After the program runs, there will be a file called output.html in the directory. When this file is opened in a browser, the result will look similar to:

Hello world! The current date is: Sunday, June 20

Figure 31: Sample output for a trivial input file

 

The News Clipper input command used in the above example names the information source to be date. This means that when News Clipper encounters this command in the input file, it attempts to find a handler that is able to get this information. In this case, it finds the date.pm handler file in the  $home/NewsClipper/Handler/Acquisition directory.

The Commands

Every HTML-like command in the newsclipper tag tells it how to get, process, and output information. This example has shown one command – an input command. There are also filter and output commands that can be used to customize the information that is inserted into a web page.

Default Commands

Almost every newsclipper tag consists of one input command, zero or more filter commands, and one output command. The previous example did not use an output command, but instead relied on the handler’s default filter and output commands.

Every input handler has a set of built-in defaults that are activated whenever the input command is the only command given. This allows the filter and output commands to be used only when it is desired to have a look that is different from the one the handler writer has given.

The default commands for the date handler simply consist of a output command for strings (a date is a string). That is, the following tells News Clipper explicitly that the output command is “string”.

NewsClipper –e 'date style=numeric,string'

In an input file, this would be specified as.

<!-- newsclipper
<input name=date style=numeric>
<output name=string>
-->

Warning: If a filter command is specified, an output command must also be specified.


The date example demonstrated News Clipper’s ability to insert dynamic information. It did not actually ship content from a remote location.

Create a file with the following text:

<html>
<head><title>A Test page</title></head>
<body>
The current AP headlines are:
<!-- newsclipper
  <input name=yahootopstories source=ap>
-->
</body>
</html>

Save this file as “input2.txt”, and run News Clipper:

NewsClipper –i input2.txt –o output2.html

After the program has run, there will be a file called output2.html in the directory. When this file is opened in a browser, the result will look similar to:

The current AP headlines are:

NATO Officially Ends Air Campaign

Strong Action Pledged for Kosovo

KLA Agrees To Lay Down Weapons

Stephen King Alert After Surgery

Saddam Among Richest Leaders

Study: Minorities Prone to Diabetes

Red Cross Workers Leave Afghanistan

Stars 2, Sabres 1, 3OT

U.S. Sets Goal For Wind Power

Stewart Beats Par, Leads U.S. Open

Figure 32: Sample output for remote content


Below is a more complex example in which we use filters to modify the information to be displayed.

Filter and Output Commands

Create a file with the following text:

<html>
<head><title>A Test page</title></head>
<body>
The filtered AP headlines are:
<!-- newsclipper
  <input name=yahootopstories source=ap>
  <filter name=map filter=hash2string
    format=’<a href="%{url}">%{headline}</a>’>
  <filter name=grep words=”SOMEWORDS”>
  <output name=array numcols=1>
-->
</body>
</html>

In the above file, replace SOMEWORDS with words that were containd in the headlines from the last example. In Figure 32, for example, “nato,kosovo” could be used.

After the program has run, the result shows that the grep filter has removed any headlines that did not have the words that were specified. Also, the array output handler displayed the results in a single column. It might look something like this:

The filtered AP headlines are:

NATO Officially Ends Air Campaign

Strong Action Pledged for Kosovo

KLA Agrees To Lay Down Weapons

Figure 33: Sample filtered output

(Although neither kosovo nor nato are in the last headline, the word kosovo does appear in the hyperlink.)

Note also that we specified the “numcols=1” option for the array output command, which caused the output to be displayed in a single column.

For more information on how to effectively use the News Clipper commands, see the section entitled, The News Clipper Tag Language.



MakeHandler is a “wizard” that steps you through a series of questions, generating a handler based on your responses. The questions you are asked depend partly on the responses you give for earlier questions.

MakeHandler lets you enter most information using your favorite text editor. MakeHandler asks certain questions, then invokes the text editor with a file for you to edit. After you have finished answering the question, save the file and close your editor, and MakeHandler will continue. Responses should be kept between the prompt arrows located in the file.

The first question MakeHandler asks is the name of the text editor to use. On Windows, notepad is a good choice. On Unix, pico works well.

After all of the necessary questions have been answered, MakeHandler will write a handler to the disk. You can then copy the handler to your $home/.NewsClipper/NewsClipper/Handler/Acquisition directory and begin using it.


MakeHandler begins by asking some general information:

name: Your name. You will be indicated as the creator and maintainer of the handler.

email: Your email address

handler name: The handler’s name. This is a lowercase string without spaces or non-alphanumeric characters.

handler URL: A URL that best shows where the data comes from.

language: The language of the fetched data

license: The license for the generated handler code. This describes how others can modify, copy, or otherwise use your handler.

category: The category helps people find your handler while browsing on the website


The URL from which the handler should fetch information may depend on an attribute that the user supplies. For example, many news sites have different sections on different pages. You may wish to have an attribute for the handler (e.g. “source”) that determines the URL.

Answering “no” to the question “Will your URL depend on a parameter” will cause MakeHandler to ask you the URL of the website. Answering “yes” will cause MakeHandler to ask you for the attribute name, and for a list of the possible values and their associated URLs. MakeHandler will also ask you for the default value of the attribute. Make sure the default value is one of the possible attribute values you specified.


The next step is to choose the acquisition function based on the type of information to be acquired. See the section entitled Choosing the Acquisition Function in SECTION 7.


Depending on your choice, you may be asked to supply a starting and ending pattern. See the section entitled Choosing the Starting and Ending Patterns in SECTION 7 for more information on patterns.

Some web sites place different types of information on the same web page. In this case, you can fetch information from the web page based on a handler attribute. For example, you may want to fetch links from the “recent news” section or the “older news” section of a web page.

Answering “no” to the question “Do you want the grabbed data to depend on a parameter” will cause MakeHandler to ask you for a single pair of starting and ending patterns. Answering “yes” will cause MakeHandler to ask you for the attribute name, and for a list of the possible values and their associated starting and ending patterns. MakeHandler will also ask you for the default value of the attribute. Make sure the default value is one of the possible attribute values you specified.


As the handler writer, you specify the times when News Clipper should fetch new data from the remote website. See the section entitled GetUpdateTimes in SECTION 7 for more information on patterns.


The final step is to edit the handler file manually, adding any additional processing. News Clipper will load the new handler into the text editor, adding some additional commentary that will be stripped out when the final handler file is written.

You can add additional processing to extract the data you want from the fetched HTML, or filter out certain types of links, etc. See the section entitled Additional Processing in SECTION 7 for more information.



News Clipper provides users with flexibility when it comes to choosing how data should be displayed on web pages. This is achieved by separating data acquisition, modification, and output into distinct steps.

Below is an example tag:

<!-- newsclipper
  <input name=slashdot type=articles>
  <filter name=slashdot type=LinksAndText>
  <filter name=limit number=4>
  <filter name=map filter=limit number=200 chars>
  <output name=array numcols=2 prefix="<p>--&gt;" suffix="</p>">
-->

This tag specifies nearly everything, including values that already have defaults. According to the documentation for the slashdot handler, the first command results in an array of hashes containing information about the current Slashdot articles. The next command is a filter, which uses one of the filters in the slashdot handler. According to the documentation, the slashdot filter returns an array of strings, which is then sent to the generic limit filter to reduce the number of strings to four.

At this point, there is an array of four (or less) strings containing Slashdot links and text. The next command is a map filter, which applies another filter to the contents of a data structure. In this case, the map filter is applying the limit filter to the text in each item of the array. (“number=200 chars” tells the limit filter to limit the number of characters, not the number of lines, which is the default behavior for strings.)

The final step is to print the array of shortened strings.  The data is sent to the array handler, and it is printed in two columns using the selected special bullets and spacing.

The output might look something like this:

 

->Is Red Hat the Next Microsoft?

Patrick Dunn writes “On ZDNET's Smart Reseller they have a story about Red Hat maybe being a mini-Microsoft by it's business practices.” I'd guess that the 2 most common c...

->Mozilla M3 Release Available Now

Makali writes “Just took a quick peek at the Sunsite FTP mirror of ftp://ftp.mozilla.org/pub/mozilla/releases/m3 and Sunsite.doc.ic.ac.uk is up and contains tarballs for several platforms. Fetch! “ ...

->Wired on Kipling

The Dodger writes “The Kipling 'Hacker' luggage debacle gets coverage in Wired, along with slightly derogatory references to the Slashdotters' ability (or rather lack of it) to 'crack ...

->CeBIT Tidbits

MadMan2 has sent us a report from CeBIT. Little bits about bigass Samsung Dimms, Not so upgradable Palm Pilots, SuSE, AOL-Scape and Applix. Hit the link below to read MadMan2's machine g...

Figure 51: Complex Slashdot output

 

 

Of course, the power that News Clipper gives you to format your output can be confusing. If this is too complicated, the default filters and output of the handlers can be used. In the case of Slashdot, it would look like this:

<!-- newsclipper
  <input name=slashdot>
-->

And the default output would look like this:

 


A newsclipper tag is composed of three types of commands: <input name=...>, <filter name=...>, and <output name=...>. The first part of the command tells News Clipper how to execute the command. The name attribute tells News Clipper the fetching, filtering, or output action to be taken. Additional attributes can also be specified for the command.

Each instruction in the command is implemented in terms of a handler. Handlers are bits of Perl code that encapsulate the details of a particular operation, and are stored in files that are in subdirectories of $home/.NewsClipper/NewsClipper/Handler. Usually one handler implements one News Clipper instruction, but occasionally a single handler can fulfill multiple roles. For example, a handler may be used as an input handler that fetches data in a particular format, then immediately used to filter the fetched data into a more standard format that can be used by standard output handlers.

Data processed by News Clipper commands is typed. A type is merely the name of a group of data. For example, News Clipper uses the type Link to describe strings of the form “<a href=http://www.yahoo.com/>Yahoo!</a>”. Types can also be subtypes of other types. For example, the Link type is a subtype of the more general SCALAR type.

News Clipper will complain if you try to give data of a particular type to a handler that does not support that type. For example, consider the following commands:

<!-- newsclipper
  <input name=date>
  <filter name=hash2string format=’%{date}’>
  <output name=string>
-->

This tag will cause News Clipper to complain, because the date handler outputs a string, and the hash2string handler accepts only hashes. However, if the output of the date instruction were sent to the string handler, it would work correctly because a date is a subtype of the type SCALAR:

<!-- newsclipper
  <input name=date>
  <output name=string>
-->

The input and output types are documented in the comments of the handler.pm file located in the handlers directory, and also at the handler webpage.


Every input command generates data that is then passed on to the filters and output command that follow. Each filter takes data, modifies it, and then outputs the data (potentially of a different type). Finally, every output command takes data and inserts it into the HTML file.

Every time data is created, it is “stamped” with a type that describes the data. Because filters and output handlers can only work on certain types of data, care must be taken on which commands are used, and in what order they are being used. For example, trying to use a filter on data that is of the wrong type results in an error and News Clipper stops.

Basic and Built-in Types

There are several basic types: SCALAR (usually called a string), ARRAY, HASH, Table and Thread. (SCALAR, ARRAY, and HASH are capitalized because they are the building blocks for other complex types such as Table and Thread.) In addition, there are several subtypes. Subtypes are types that are considered to be more specialized than the general type. For example, a Link is a subtype of SCALAR, because a link is a SCALAR, but a special kind. Likewise, Image is a subtype of SCALAR also.

Complex data structures are identified by their structure and the types of the data in the structures. For example, the type “array of links” is a subtype of “array of strings” because the structures are the same, and the type Link is a subtype of SCALAR. Some input handlers create their own types. For example, the slashdot handler creates a type called Slashdot, which is a subtype of HASH, and Slashdot Poll, which is a subtype of SCALAR.

Type Checking

News Clipper checks types as it executes the commands in the <!-- newsclipper ... --> tags. When it discovers that the current data does not match the data expected by the next command, it aborts. Consider the following tag:

<!-- newsclipper
  <input name=slashdot type=articles>
  <filter name=slashdot type=LinksAndText>
  <output name=string>
-->

The data that results from the input command is of type “Array Of Slashdot Hash. The built-in filter that comes with the Slashdot handler accepts data of type “Array Of Slashdot Hash, so it can be applied. The result of the filter is data that is of type “Array Of SCALAR. Unfortunately, the output handler string expects data of type SCALAR, not Array Of SCALAR, so News Clipper stops and displays an error.

If you are ever unsure about the data you are using, you can use the “dumpdata” output handler to find out the type and the contents of the data. (See below.)


Input Handlers

There are over 200 handlers that can be used in input commands. Some handlers also perform filtering and output commands if the data that they generate is very specific to the handler. The majority of handlers, however, generate strings, arrays, and hashes that can be manipulated using generic filters and output using generic output handlers.

Keep in mind that this list may not be complete, as new filters and output handlers are constantly being added to expand the capabilities of News Clipper. Visit the handler webpage to check for new handlers. Below are a few of the more interesting acquisition handlers.

<input name=slashdot type=X>
<filter name=slashdot type=X>
<output name=slashdot style=slashdot nolink noicons
  nocomments noposter nodate nodept nodesc>

Arguably the most complex News Clipper handler, the slashdot handler can fetch either a hash of article information or the daily poll based on the “type” parameter. When used as a filter, the “type” parameter specifies whether the array of hashes should be turned into an array of links, or an array of links with descriptions. (Applying the slashdot handler as a filter to the daily poll reformats the poll.) The handler can also function as an output handler, printing the array of hashes in different formats.

<input name=lastupdated luhandler=X attributes...>

This handler determines when the data for another handler was last updated. The luhandler parameter is the name of the acquisition handler. The remaining attributes of the lastupdated handler must match those of the handler of interest, since some attributes may determine when and where data was fetched.

<input name=uexpresscomic strip=X>

This handler can fetch dozens of comics and cache them locally. Also see the unitedmediacomic handler.

<input name=moreovercategory category=X>

This handler can fetch headlines from one of MoreOver.com’s many news categories.

<input name=moreoversearch query=X>

Performs a search on MoreOver.com’s website, retrieving all headlines that match the query pattern.

<input name=rdf source=X>

These handlers allow XML news feeds such as Netscape channels to be used by News Clipper, expanding the available news sites by hundreds. Also see the rss handler.

<input name=include file=X>

This input handler reads a file and outputs the contents of the file into the output web page.

<input name=usenet server=A group=B refs=C xposts=D
  since=E articles>
<filter name=usenet filter=X>

This handler fetches articles and article information from Usenet newsgroups, returning an array of hashes containing article information. The attributes allow the news server and newsgroup to be specified, as well as the depth at which replies are ignored, the number of crossposts before an article is ignored, the time at which to begin retrieving articles, and whether the text of the articles should be fetched.

When used as a filter, the handler accepts usenet information and converts it into a Thread data type suitable for output by the thread output handler. The filter attribute specifies the filter handler that should be used to convert the hash of article information into a string (typically hash2string).

Filter Handlers

<filter name=grep words=X invert>

grep is named after the Unix command for finding lines in a file that contain a pattern. It takes a string, array, or hash, and returns the data that contain one of a set of words. The “invert” attribute can be used to return the data that does not contain the keyword. (Note that in the case of the hash, it is not the keys, but the values that are searched.)

<filter name=caps case=X>

This filter changes the capitalization of a string. It is “smart” in that it ignores HTML tags.

<filter name=selectkeys keys=X invert>

This takes and returns a smaller hash with the given keys. “invert” returns the hash that does not contain the keys.

<filter name=highlight style=X words=Y>

Highlight surrounds the specified words with HTML tags. The style is “strong” by default.

<filter name=limit number=X chars>

Accepts a string, array, or hash, and returns the same. This filter trims the number of characters, lines, items, or keys to the number specified. “chars” must be specified to treat strings as sequences of characters instead of lines.

<filter name=hash2array order=X>

hash2array takes a hash and a given key ordering, and returns an array whose items are the hash values in the specified order.

<filter name=map depth=X filter=Y [...]>

Suppose the data consists of an array of strings, and the highlight filter needs to be applied to the strings. Unfortunately, highlight does not take arrays of strings. That is the purpose of this filter. The “depth” tells “map” how many levels into the data structure to go before applying the filter given by “filter”. Any additional arguments are passed on to the filter.

<filter name=maphash key=X hashfilter=Y [...]>

maphash applies another filter to only one member of a hash.

<filter name=cacheimages maxage=X dir=Y url=Z>

When given a bit of HTML like <img src="http://­somelocation/image.jpg">, cacheimages stores the image pointed to by the URL into a local cache directory such as $home/public_html/imagecache and substitutes the URL that corresponds to that directory so that the HTML looks like <img src="http://www.yourserver.com/imagecache/image.jpg">.

The image cache directory is specified by the imgcachedir option in the News Clipper configuration file, or overridden with the dir attribute of the handler. The corresponding URL is specified by the imgcacheurl option in the configuration file, or overridden with the URL attribute of the handler. The length of time that the image is cached is specified by the maximgcacheage option in the configuration file, or overridden with the maxage attribute of the handler. (The age is specified in days.)

<filter name=convertdate format=X zone=Y>

This handler converts dates from one format to another, where the format is specified using a special notation. See the handler documentation for details.

<filter name=array2string prefix=X suffix=Y infix=Z>

array2string provides a quick and easy way to convert an array of strings to a string. The user can control the text printed before, after, and between each item in the array.

<filter name=hash2string format=X>

This handler provides a quick and easy way to convert a hash to a string. It is often used with the map handler to convert data from specialized types like “array of FreshMeat hashes” to more standard types like “array of String” before printing them with handlers like array.

<filter name=string2hash regexp=X labels=Y>

This handler is used to turn a string into a hash. Every ( ) subexpression in the regular expression regex is converted into a value in the hash. There should be the same number of labels as ( )’s, because each label is matched to each subexpression and placed into the hash. For example, a pattern of “(1.*)#(2.*)#(3.*)” and a label attribute of “a,b,c” will result in the pattern starting with “1” being the value of the key “a”, the pattern starting with “2” being the value of the key “b”, and the pattern starting with “3” being the value of the key “c”.

<filter name=searchrep search=X replace=Y>

This powerful handler allows all occurrences of a pattern to be replaced with a string.

<filter name=resizeimage height=X width=Y>
<filter name=resizeimage factor=X>

This handler takes an Image string and returns an Image string, but with the size of the image changed. Both the new height and width, or a multiplicative scaling factor can be specified.

Output Handlers

<output name=dumpdata>

The dumpdata output handler is a very useful output handler when you are debugging your sequence of News Clipper commands. It dumps the type of the current data and then the data itself.

<output name=string>

Prints a string.

<output name=table header=X border=Y>

Takes a two-dimensional array and outputs a table having a border size as given by “border”.  The “header” allows for specifying whether the top and/or left sides of the table should be headers.

<output name=array numcols=W prefix=X suffix=Y separator=Z>

Output an array of strings.  The “numcols” is the number of columns. The “prefix”, “separator” and “suffix” are strings to print before, between, and after each item. If prefix is “ul” or “ol”, a bulletted or numbered list is created, respectively.

<output name=thread style=X>

Takes a Thread data type, similar to those seen in discussion lists. Outputs using numbered or unnumbered lists, depending on whether the style is “ol” or “ul”.


You may notice that the content generated by News Clipper is not being updated, or that no content is present at all. In this case, check the run log in your $home/.NewsClipper/logs directory to see if there are any errors. This is a text file that you can open using your favorite text editor.

Bugfix and Functional Updates

If you find an error, then it may be time to update your handler. There are two kinds of updates to a handler: bugfix updates and functional updates. Bugfix updates are updates to a handler that are guaranteed not to break existing commands. Functional updates change the handler in such a way that existing News Clipper commands may break. For example, the handler may change the type of the output data.

News Clipper is designed to automatically download and install bugfix handlers if the auto_download_bugfix_updates value in the configuration file is set to “yes”. If a bugfix update is available, News Clipper will recover automatically within a few hours of detecting the problem.

Functional updates are potentially dangerous because they may break your existing News Clipper commands. To force News Clipper to check for functional updates, run News Clipper with the -n flag.

Updating a Handler Manually

If you want to update a handler manually, first check the handler webpage on the News Clipper website. Compare the version number of the handler on the site to the version you have installed, keeping in mind that the version “1.3.5” is represented as “1.0305” in the handler database. You can download the new handler using your browser from the link provided in the handler description. Save the handler into the proper subdirectory of $home/.NewsClipper/NewsClipper/Handler, overwriting your old handler.

If There is No Update…

Sometimes a handler is broken but there is no update. In this case, contact the handler maintainer, whose email is listed in the text of the handler file as well as in the description of the handler on the website.


There are over 200 handlers that can be used in input commands. Some handlers also perform filtering and output commands if the data that they generate is very specific to the handler. The majority of handlers, however, generate strings, arrays, and hashes that can be manipulated using generic filters and output using generic output handlers.

One way to get information about the handlers on the system is to view them using a text editor. Usually they can be found in the directory $home/.NewsClipper/NewsClipper/Handler/Acquisition/. There is documentation at the top of each handler.

If an Internet connection is present, a better way to view the same information is to visit the handlers webpage. In addition to viewing information about the handlers already installed, information about other handlers can be viewed. There are sample pages set up which demonstrate the default behavior of the handlers.


This section shows several examples of News Clipper tags, which should help in understanding how commands work together. After each example is a short description explaining the commands.

<!--newsclipper
  <input name=weatherchannel zip=27612 numdays=5>
  <filter name=selectkeys keys=forecast>
  <filter name=cacheimages maxage=7
    dir=comm01/archane/public_html/DailyCache
    url=http://www.archane.com/DailyCache/>
  <output name=string>
-->

This handler uses selectkeys to choose the forecast from the output of the weatherchannel handler. They are then cached in the directory comm01/­archane/­public_html/DailyCache, and the data is modified to use the URL http://www.archane.com/DailyCache/ for images. During the caching process, any images older than 7 days old are deleted.

<!--newsclipper
  <input name=bbcnews category=business>
  <filter name=map filter=string2hash
    regexp='<A HREF="([^\"]+)">(.*?)</A>'
    labels='url headline'>
  <filter name=map filter=hash2string
    format='URL:%{url} headline:%{headline}'>
  <output name=array numcols=1>
-->

For each BBC news link in the array returned by bbcnews, string2hash is used to turn the string into a hash containing the URL and headline. Then the hash2string filter is applied to each hash in the resulting array, creating a line with the URL and headline. Finally, the output is printed as a single column.

<!--newsclipper
  <input name=bbcnews category=business>
  <filter name=map filter=searchrep
    search='(?i)<a ' replace='<a target=_top '>
  <output name=array numcols=1>
-->

This tag adds “target=_top” to every link returned from bbcnews, and prints the results as a single column.

<!--newsclipper
   <input name=linuxtoday>
   <filter name=map filter=hash2string
  format='<a href="%{url}">
     %{name}</a><br>%{description}'>
   <filter name=limit number=20>
   <output name=array numcols=2>
-->

This tag takes the array of hashes returned by linuxtoday and converts it to an array of links. (“url”, “name” and “description” are keys in the hash.) The number of links is limited to 20, and the output is printed in two columns.

<!--newsclipper
  <input name=memepool>
  <filter name=map filter=maphash key=description
    hashfilter=limit number=30 chars>
  <filter name=map filter=maphash key=date
    hashfilter=convertdate format="%m/%d %H:%M:%S">
  <filter name=map filter=hash2string
    format="%{description}<br>%{date}">
  <output name=array numcols=3>
-->

The description in each hash of information is limited to thirty characters, and the date is reformatted as well. Then the array of hashes is converted into an array of strings, where the description is followed by the date. The result is printed in three columns.

<!--newsclipper
  <input name=slashdot>
  <filter name=map filter=hash2string
    format='[%{department}]<a
    href="%{url}">%{name}</a><br>%{description}'>
  <filter name=map filter=limit number=100 chars>
  <filter name=limit number=10>
  <output name=array numcols=2
    prefix="<font size=-1>-&amp;gt;" suffix="</font><br>">
-->

The normal Slashdot output is converted to an array of strings in a custom format. Each string is limited to 100 characters, and the number of strings is limited to 10. The result is printed in two columns with special bullets and formatting.


When editing a News Clipper input file, it is sometimes useful to add “dummy content” that allows you to see how the real output will look. In this case, you can use the special News Clipper comment tag:

<!-- newsclipper startcomment -->
This is dummy content.
Which will be removed by <a
href=”http://www.newsclipper.com”>News Clipper</a>
during processing of the input file.
<ul>
  <li> <a href="http://dailynews.yahoo.com/h/ap/20000701/ts/clinton_gas_prices_1.html">Clinton Attacks GOP on Energy</a>
  <li> <a href="http://dailynews.yahoo.com/h/ap/20000701/ts/denmark_concert_deaths_10.html">8 Dead in Denmark Pearl Jam Concert</a>
</ul>
<!-- newsclipper endcomment -->
<!—newsclipper
  <input name=yahootopstories
>
  <filter name=map filter=hash2string

    format=’<a href=”%{url}”>%{headline}</a>’>
  <filter name=limit number=2>
  <output name=array
 numcols=1>
-->

Everything between the starting tag of the comment <!-- newsclipper startcomment --> and the ending tag of the comment <!-- newsclipper endcomment --> will be removed by News Clipper during processing.

 


SECTION 6
Running
 News Clipper


The easiest way to run News Clipper is by editing the configuration file to set the input and output files, then setting it up to run periodically. During web page development, however, it is better to run News Clipper from the command line. If more information is needed, News Clipper can also be run in debug mode using the -d flag, or with the -v flag, which will mirror output to the screen as well as output file.


Logging

When News Clipper runs it logs errors and other routine information to its run log. This log is a text file named run.log, and is located in the directory $home/.NewsClipper/logs. Similarly, when you run News Clipper in debug mode, debugging information is stored in a debug log file named debug.log.

Each log entry begins with a line that says something like “News Clipper 1.33 started: Sat Apr  7 17:47:07 2001”. Additionally, any errors encountered while processing News Clipper will result in a message in the output file which references a log entry in the run log.

The log files are rotated to files named “run.log.0”, “run.log.1.gz”, etc. whenever the log file size exceeds the maximum size specified in the configuration file. Files ending with .gz are compressed using the zlib compression format. You will need a tool like Unix’s gunzip utility to uncompress the files for viewing.

For more information about configuration options related to logging, see the section entitled Configuring the Software.


If there are problems with News Clipper, try running it in debug mode. This mode can be enabled with the -d flag on the command line. In addition to normal output, debug generates quite a bit of extra information about the execution of the program.

When running in debug mode, debug messages are saved in the file debug.log in the $home/.NewsClipper/logs directory. Additionally, the output created by News Clipper is sent to the screen instead of being saved to the output file. If the output of the program goes by too quickly and you can not scroll back to read text that has moved off the top of the screen, you can redirect the output to a file using a command like NewsClipper –d > output.txt.

Be sure to send the debug log file when submitting a bug report or asking for technical support.


usage: NewsClipper.pl [-adnrv] [-i inputfile] [-o outputfile] [-c configfile] [-e command]

-i           Override the input file specified in the configuration file.

-o           Override the output file specified in the configuration file.

-e           Run the specified handler and output the results. (Overrides -i and -o.)

-c           Use the specified file as the configuration file, instead of NewsClipper.cfg.

-a           Automatically download all handlers that are not installed locally.

-n           Check for new versions of handlers while processing the input file. (This will cause News Clipper to download handlers that may break existing News Clipper commands.)

-r           Reload the content from the proxy server even on a cache hit. This prevents News Clipper from using stale data when constructing the output file.

-d           Enable debug mode, which prints extra information about the execution of News Clipper. Output is sent to the screen instead of the output file.

-P           Pause after completion.

-v           Verbose output. Output a copy of the information sent to the output file to standard output. (Does not work in Windows.)

-C           Clear the cache, handler state, News Clipper state, or debug and run logs. When run with this flag, News Clipper will ask the user if the HTML cache should be cleared, then the handler-specific information, then the News Clipper information, and lastly the log files. No files are processed when this flag is used.

-H           Allows the home directory to be specified on the command line.


Although News Clipper can be run manually only when needed, the best way to get up-to-date information is to run the program at regular intervals. Doing this typically requires a third-party program that is either present on the platform, or must be installed. News Clipper does not have built-in support for automatic execution because the method for achieving this is varies depending on the operating system.

Windows 95/98/2000/ME

On Windows 95, 98, 2000, or ME, one way to run News Clipper periodically is to use the Task Scheduler. Follow these steps:

1.        Start Windows Explorer

2.        If necessary, expand the tree under “My Computer” by clicking the “+” next to it.

3.        Click on “Scheduled Tasks”, near the bottom of the subtree under My Computer.

4.        Double-click the “Add Scheduled Task” icon on the right pane of the Explorer window.

5.        Click “Next”.

6.        Click “Browse”, and select “News Clipper” from the directory in which the software is installed.

7.        Select the times at which to run the program.

8.        Check the “Open advanced properties...” checkbox and click “Finish”.

9.        Add any flags to the “Run” line that you want to use when running News Clipper, and click “OK”.

Windows NT

Windows NT has a command line scheduler called “at” . While this scheduler can be used to automatically run News Clipper, we recommend using a third-party scheduler that has a graphical interface. A list of third-party schedulers is at http://www.softseek.com/­Utilities/­Application_Schedulers/.

Unix or Linux

On Unix-style systems, the best way to run News Clipper periodically is with the “cron” utility. Cron runs under a restricted environment, where it basically runs /bin/sh, and doesn't process any .profile files. This means that your environment variables will not be set, such as your PATH. What people usually do is put their commands in an sh script that “sources” their profile and then executes the command, like so:

. /etc/profile
. /home/username/.profile
/path/to/command/thecommand

If you want to set the NEWSCLIPPER environment variable (for specifying a global configuration file), you would have to use the above technique. Alternatively, the GNU version of cron that comes with operating systems like Linux allows you to set environment variables in the crontab file. You can simply insert "NEWSCLIPPER=/some/path" directly before you call News Clipper

To set up a cron job, follow these steps:

1.        Create a file in your home directory called .crontab, or edit the existing one.

2.        Add the following line to the .crontab file:

0 7,10,13,16,19,22 * * * /path/NewsClipper.pl

3.        Replace path with the complete path to News Clipper. Make sure the entire text fits on one line.

4.        Save the file and exit the text editor.

5.        On the machine which News Clipper will run, type crontab .crontab.

To test your cron job, first run date to get the current system time. Then re-edit the .crontab file and change the minute field (“0” above) to a time a couple minutes from the current system time, and add the current hour to the hour field. Then re-run crontab .crontab. After the time has elapsed, check the date on an output file with ls –l outputfile. The last modified time should be changed. If something goes wrong with your cron job, an email will be automatically sent to your account. When you have verified that the cron job is working correctly, re-edit .crontab to restore the previous time values and re-run crontab .crontab.

Run man cron and man crontab for more information about the cron utility.


Sometimes you may not have telnet access to their servers, or do not have the ability to install programs such as News Clipper. In such cases, you can install News Clipper on a different machine, and have it transfer the output files to your web server. In fact, you can send each output file to a different server.

In the News Clipper configuration file is a section that allows you to specify the eventual destination of each output file. At installation, there is no FTP information specified, which means that no output files are sent to a remote server. However, you can add information for each file that specifies the remote server name (or IP address), username, password, and remote directory.

After News Clipper processes each output file, it will consult this information and attempt to send the output file via FTP to the remote server. If you do not want to FTP a particular output file, you can simply use empty brackets (“{}”) in the configuration information for that file. See the section entitled Configuring the Software.


News Clipper has the ability to automatically email output files to one or more users. This is perfect for using News Clipper to create a daily HTML newsletter for your users.

In the News Clipper configuration file is a section that allows you to specify the information for emailing each output file. At installation, there is no email information specified, which means that no output files are emailed.

After News Clipper processes each output file, it will consult this information and attempt to email the file. If you do not want to email a particular output file, you can simply use empty brackets (“{}”) in the configuration information for that file. See the section entitled Configuring the Software.

 

 



Writing handlers is considerably simplified by the MakeHandler utility. It semi-automatically generates handlers that grab text, HTML, links or images. In some special cases, however, more is needed than what MakeHandler can offer.

In these cases, the handler that is generated by MakeHandler may need to be edited by hand. Familiarity with the Perl programming language, or programming languages in general, is usually needed to accomplish this.

Users report that MakeHandler is easy to use for the cases for which it was designed. People who know Perl also report that editing the handler by hand is not that difficult. If trouble is encountered, please contact technical support or the mailing list. If you have no programming experience at all, and can not make a handler using MakeHandler, you should consider hiring a consultant to help you. See http://www.newsclipper.­com/­services.html.

Every handler operates in the context of an interface that News Clipper provides. This interface provides functions for acquiring remote information, as well as functions for manipulating the acquired information. This chapter provides a tutorial on how to go about developing handlers.


Acquisition Handler Components

Below is the complete text of a skeleton acquisition handler, with commentary describing the various parts of the handler. MakeHandler would automatically generate the bulk of this computer code. We are presenting this handler in order to discuss the various components. The following subsections provide more detail about the key aspects of handlers.

# -*- mode: Perl; -*-

package NewsClipper::Handler::Acquisition::HANDLERNAME;
use vars qw( @ISA $VERSION %handlerInfo );

$handlerInfo{'Author Name'}              = 'Joe Shmoe';
$handlerInfo{'Author Email'}             = 'joe@shmoe.com';
$handlerInfo{'Maintainer Name'}          = 'Bob Maintainer';
$handlerInfo{'Maintainer Email'}         = 'bob@maintainer.com';
$handlerInfo{'Description'}              = <<'EOF';
This handler does not work. It’s just a sample
EOF
$handlerInfo{'Category'}                 = 'General';
$handlerInfo{'URL'}                      = <<'EOF';
http://www.thenewssource.com/
EOF
$handlerInfo{'License'}                  = 'GPL';
$handlerInfo{'For News Clipper Version'} = '1.18';
$handlerInfo{'Language'}                 = 'English';
$handlerInfo{'Notes'}                    = <<'EOF';
This handler was originally written by Joe Shmoe, and is now maintained
by Bob Maintainer.
EOF
$handlerInfo{'Syntax'}                   = <<'EOF';
<input name=HANDLERNAME source=X>
  Returns an array of links
  X: either headlines or sports (the default is headlines)
EOF

The package declaration describes the handler type (in this case an acquisition handler) and the handler name. The handler information block provides information about the handler, such as the handler’s maintainer, the language of the data it acquires, and the syntax describing how to use the handler.

From the syntax, we can see that the handler has an optional “source” attribute whose value can either be “headlines” or “sports”, and defaults to headlines.

use strict;
use NewsClipper::Handler;
@ISA = qw(NewsClipper::Handler);

# - The first number should be incremented when a change is made to the
#   handler that will break people's input files.
# - The second number should be incremented when a change is made that won't
#   break people's input files, but changes the functionality.
# - The third number should be incremented when only a bugfix is applied.

$VERSION = do {my @r=('0.4.1'=~/\d+/g);sprintf "%d."."%02d"x$#r,@r};

This section contains details that aren’t relevant to most users, except for the version number. The version number of this handler is 0.4.1.

sub ProcessAttributes
{
  my $self = shift;
  my $attributes = shift;
  my $handlerRole = shift;

  $attributes->{'source'} = 'headlines'
    unless defined $attributes->{'source'};

  unless ($attributes->{source} eq 'headlines' ||
          $attributes->{source} eq 'sports')
  {
    error "The \"source\" attribute for handler \"HANDLERNAME\" " .
      "should be either \"headlines\" or \"sports\".\n";
    return undef;
  }

  return $attributes;
}

The ProcessAttributes subroutine is used to provide attributes their default values, and verify that the attributes are valid. This subroutine is executed before the URL is computed, and before default handlers are computed (see below). The handler role determines how the handler is being used – the valid values are “input”, “filter”, and “output”.

This subroutine demonstrates the use of the error subroutine, which should be used any time an error occurs in the handler and a message is to be displayed to the user.

sub GetDefaultHandlers
{
  my $self = shift;
  my $attributes  = shift;

  my $returnVal = <<'  EOF';
    <filter name='limit' number='10'>
    <output name='array
'>
  EOF

  return $returnVal;
}

GetDefaultHandlers describes the default filter and output handlers for the acquisition handler.

sub ComputeURL
{
  my $self = shift;
  my $attributes = shift;

  my $source = $attributes->{source};

  my %urlMap = (
    'headlines'  => 'hl/',
    'sports'     => 'sp/',
  );

  my $url = 'http://www.thenewssource.com/' . $urlMap{$source};

  return $url;
}

The ComputeURL subroutine is used to compute the URL from which to acquire the data. In this case, the URL is computed using the “source” attribute by looking up the URL ending in %urlMap and appending it to the base URL.

sub Get
{
  my $self = shift;
  my $attributes = shift;

  my $source = $attributes->{source};

  my %patternMap = (
    'headlines'  => ['headlines start','headlines end'],
    'sports'     => ['sports start','sports end'],
  );

  my $url = $self->ComputeURL($attributes);

  my $data = &GetLinks($url,$patternMap{$source}[0],$patternMap{$source}[1]);
  return undef unless defined $data;

  @$data = grep {$$_ !~ /<img/i} @$data;

  return $data;
}

The Get subroutine does most of the work of a handler. It calls ComputeURL to determine the URL from which to fetch information, and then calls one of the built-in routines to fetch the data. In this case, the routine is GetLinks, which is called with a starting and ending pattern that is dependent on the source of the news.

Additional processing can also occur in the Get subroutine. For example, this handler removes links to images from the array of links that is returned after GetLinks is called.

sub GetUpdateTimes
{
  my $self = shift;
  my $attributes = shift;

  return ['2,5,8,11,14,17,20,23'];
}

The GetUpdateTimes subroutine encodes the times at which the content on the remote server needs to be fetched.

1;

Every handler ends with a “1;”.

Other Handler Components

Filter and output handlers have components in addition to the ones listed above. Normally users will not need to know about these components. Advanced users will want to read this section, but other users should skip it.

sub FilterType
{
  my $self = shift;
  my $attributes = shift;
  my $data = shift;

  return ’$Link | @Link’;
}

Filter handlers have a FilterType function that describes the type of data that the handler accepts. In this case, the type description (or type signature) of data accepted by the handler is “$Link | @Link”, which means that the handler accepts either a link, or an array of links.

Type signatures describe the structure and data of a complex data structure. For example, “$” means a scalar (e.g. a string), “@” means an array, and “%” means a hash. The name of the subtype can follow the main type symbol for scalars. For example “$Link” is a link.

Complex data structures can be described using nested symbols. For example, “@$Link” indicates an array of links. One can also describe alternatives using the “|” symbol: “@Link | %” means an array of links or a hash. Mandatory elements can be expressed with the “&” symbol: “@($Link & %Slashdot)” means an array consisting of at least one link and at least one Slashdot hash. (Note the use of parentheses to group items.)

sub Filter
{
  my $self = shift;
  my $attributes = shift;
  my $data = shift;

  if (TypesMatch($data,’@Link’))
  {
    @$data = grep { $$_ =~ s/<a/<a target=”_top”/si } @$data;
  }
  elsif (TypesMatch($data,’$Link’))
  {
    $$data =~ s/<a/<a target=”_top”/si;
  }

  return $data;
}

Filter handlers have a Filter function that accepts data, transforms it in some way, and then returns the data. In this example we use the TypesMatch function to determine to which of the two possible types the data conforms.

sub OutputType
{
  my $self = shift;
  my $attributes = shift;
  my $data = shift;

  return ’$Link | @Link’;
}

Output handlers use the OutputType function to describe the type of data they accept.

sub Output
{
  my $self = shift;
  my $attributes = shift;
  my $data = shift;

  if (TypesMatch($data,’@Link’))
  {
    foreach my $link (@$data)
    {
      print ”URL: $$link<br>\n”;
    }
  }
  elsif (TypesMatch($data,’$Link’))
  {
    print ”URL: $$data<br>\n”;
  }
}

Output handlers have an Output function that accepts data and formats it for output


Usually finding news to be viewed every day is easy. But finding the best format for that information may not be. For example, the National Weather Service weather is available in many formats, but some users prefer “raw” NWS text, which is easier to parse. Conversely, others might prefer pictures, which mean they should find a weather site that prints the NWS weather along with images.

Sometimes sites have “low graphics” versions of their web pages. This can be used to grab the information of interest, and then filter the results to have them point to the “high graphics” web pages. For example, some news sites simply use “low” or “hi” in the URL to distinguish the two types of web pages. Fetching the low graphics links and then replacing “low” with “hi” may be easier than extracting the links directly from the high graphics web page.

Other web sites actually provide a special back-end specifically designed to make the job of extracting information easier. They realize that the more people link to their sites and use their content, the more traffic they generate. This is the exception more than the rule.

One special type of back-end is one that is based on XML, a standard for interchange of data. The rss and rdf input handlers are designed to allow you to easily extract links from site that use XML, and they significantly expand the set of news sources available to News Clipper. A list of sites that export XML files with news information is located at http://www.xmltree.com/, http://static.userland.com/­myUserLandServices/­sericeList2.xml, and http://theweb.startshere.net/­channels.phtml. See the documentation of the rdf and rss handlers for more information.

When deciding to build a handler for a specific site, avoid creating individual handlers for the different departments of a web site. Instead, try to exploit commonality in the web pages. Create one handler with a “source” attribute that allows people to select the department they want.


In order to keep the quality of handlers high, please follow the following guidelines:

·         Every data element (and internal data element of a complex data structure) is a reference – Get and Filter functions return references to strings, arrays, and hashes, and arrays and hashes contain references.

·         Whenever possible, provide defaults for any attributes. The handler should generate output when run as “NewsClipper -e –yourhandler”. (This is how the example web page for the handler will be generated when you submit the handler to the database.) When defaults are not possible, print an error message from within ProcessAttributes and return undef.

·         If there is a problem, use the error function to log the error, and do not insert visible text into the output.

·         While an acquisition handler can operate as a filter or output handler, try to avoid writing a Filter or Output function unless the data structure is totally unique.

·         If the data structure is unique, provide filters that let people translate it into common data structures. For example, the slashdot handler returns an array of hashes, and has a Filter function that can convert data of the type “array of Slashdot hashes” to “array of links”.

·         If writing a filter, be careful to change the output type if necessary. For example, if the filter turns an array of hashes into an array of links, “bless $link,’Link’” for each link in the array before returning the data. (But don’t bless the array as “ArrayOfLinks” or anything like that.)

·         Try to use the other built-in filters whenever possible. (See the uexpresscomic handler for an example.) News Clipper has a helper function called RunHandler (see below) which you can use to invoke other handlers to process the data.

·         Always include the line “return undef unless defined $data” after a call to GetHtml, GetText, etc., since these functions return undef when they fail, and you should too.

·         If a web site has common formatting, consider using a “source” parameter to choose among the different data types. (See the maximumpc handler, for example.)

·         Always return “clean” HTML without unopened or unclosed tags, like <b> but no </b>. See TrimOpenTags, as well as StripTags.

·         Only rarely is it necessary to use GetUrl to grab HTML, because it doesn't make links absolute. Use GetHtml($url,'^','$')  instead.

·         Try to specify the beginning of document (“^”) and end of document (“$”) for the start and end patterns of the acquisition functions whenever possible. Experience has shown that when handlers break, it’s usually because the start or end pattern doesn’t work anymore. A good strategy is to use “^” and “$” to grab everything on the page, identify something that is unique about the links or other data you are trying to capture, and weed out the results that do not match.

·         When checking attributes, do something like if (lc($attributes->{'source'}) eq 'headlines') to make the attribute case insensitive.

·         Try to make regular expressions robust. Generally, the longer the pattern, the more chance it will fail. Also, try to store matches in variables one at a time. If you try to match many items with one pattern, they will all fail if the pattern does not match.

·         Every filter must have a FilterType function that returns a type specifier that says what types of input it can handle. Likewise, any handler with an Output function must also have an OutputType function.


First, find the web page that has the data. Then, decide what type of information News Clipper will retrieve, which will indicate which acquisition function to use:

GetUrl:   Grabs all the content from a URL, in totally raw form. Usually this is used to grab a text file.

GetText:                 Grabs text data from a block of HTML, stripping HTML tags out

GetHtml:                Grabs a block of HTML from a URL's content, making links absolute rather than relative

GetImages:            Grabs images from a block of HTML, making links absolute

GetLinks:              Grabs hyperlinks from a block of HTML, and makes them absolute

 


If GetLinks was used to get the links off a web page and only a few links are to be selected, consider the following HTML:

<html>
<head><title>Title</title></head>
<body>
<p>This is some text
An <a href="unwanted.html">unwanted link</a>.
<!-- Insert links here -->
<a href="/news/somewhere.html">Somewhere</a><br>
<a href="/news/somewhereelse.html">Somewhere Else</a><br>
<!-- End links -->
An <a href="mailto:webmaster@asdf.com">email link</a>.
</p>
</body>

If the HTML designers were nice enough to use the comments shown, simply use "Insert links here -->" and "<-- End links" as the start and end patterns. Otherwise, some other marker text will need to be found.

The starting and ending patterns are expressed as Perl regular expressions . For example, “.” matches any single character, “a.*b” matches an “a” followed by any number of characters (including an “a” or “b”) followed by a “b”. For more information about Perl’s support for regular expressions, run perldoc perlre.

Each of the acquisition functions works by first searching for the first match to the start pattern, and then searching for the first match of the end pattern after the start pattern. To find good starting and ending patterns, try the following.

1.        Use the "View Source" option of the web browser to see the HTML.

2.        Search for the information News Clipper should grab.

3.        Look for something right above the information that can be used as the start pattern. (The GetLinks or GetImages functions are less particular, since they generally ignore any extra items at the beginning of the grabbed data.)

4.        If it is a simple bit of text, and not a full-blown regular expression, scroll to the top of HTML and use the browser's find feature to see if the chosen pattern shows up earlier in the HTML.

5.        Now go to the end of the content to be grabbed and find a good ending pattern.

6.        As before, use the browser's find feature to make sure the end pattern does not show up somewhere in the middle of the content to be grabbed.

Unfortunately, start and end patterns often change on web sites, which means your handler will break more easily. A better strategy is to see if the links have anything in common, fetch every link in the web page, and then filter out the ones we don’t want.

For example, in the above sample web page, we see that all the interesting links have the string “/news/” in them. As a result, we can tell GetLinks to get every link on the page:

my $data = GetLinks($url,’^’,’$’);

and then extract only the ones with “/news/” in them:

@$data = grep { $$_ =~ /\/news\//i ) @$data;

Be sure to use “(?i)” at the start of patterns to make them case insensitive. “\n” can be used to indicate a newline in the pattern.


To specify when the server updates its information, add a GetUpdateTimes function. This function tells News Clipper when to refresh its cached data. For example, when are making a handler for a daily comic, consider using “7”, since the comic changes at 6 am PST every day.

It is important to set the times as close as possible to the actual update time. For example, if the data gets updated at 1am PST, and the update is set to 3am PST, visitors in England will have stale data between 9am and 11am. On the other hand, setting the time too early could mean getting the data from the day before.

It pays to be a little conservative here – specifying every hour of the day, means lots of people will be hitting their server when they probably are not even looking at their News Clipper webpage.

Date specifications are of the form “[day] hour,hour,hour [time zone]”. If the day is omitted, every day is assumed. If the time zone is omitted, Pacific Standard Time is assumed. A special value of LOCAL_TIME_ZONE can be used to specify the local time zone. If no GetUpdateTimes function is provided, the default of “2,5,8,11,14,17,20,23 PST” is used.

The days are: sun, mon, tue, wed, thu, fri, sat. Multiple times can be specified, for example:

sub GetUpdateTimes
{
  return ['1 EST','mon 6,8 EST','tues 16 CST','20'];
}

will update Mondays at 6am and 8am EST, Tuesdays at 4pm CST, and every day at 8pm PST.

You can specify “always” as the update time to have News Clipper update the data every time it is run. However, please do not do this unless the data being queried really does change by the minute. Headlines, for example, do not change constantly, and specifying “always” will cause News Clipper to hit the remote server repeatedly, causing the system adminstrators to send you a nasty email.


News Clipper provides additional functionality for handler writers. This includes the following functions:

RunHandler($handlerName,$handlerType,$data,$attributes) : runs the handler specified by $handlerName as the type specified by $handlerType (“input”, “filter”, or “output”). $data is used to pass the data to the filter or output handler to be run, or undef should be used if the handler to be run is an input handler. Finally, $attributes should contain the hash of attributes for the handler.

TypesMatch($data,$typeSignature) : compares the actual type signature of $data to the type signature specified by $typeSignature. The function is useful for filter and output handlers that accept more than one kind of data.

MakeSubtype($subType,$baseType) : makes one type a subtype of another. For example, a handler that creates a hash of information from a website should call “MakeSubtype(’Name’, ’HASH’);” to let News Clipper know that data of type “Name” can be used wherever a hash is expected. As the hashes are created, they should be declared as being of type Name by calling bless: “bless \%hash, ’Name’;”.

dprint($text) : outputs a message to the debug log if debugging is enabled.

error($text) : logs an error that will be saved to the run log file, and which will result in an error comment in the output HTML.

ExtractText($text,$beginPattern,$endPattern) : extracts text between the beginning and ending patterns.

MakeLinksAbsolute($baseurl,$html) : Finds all “a href=” and “img src=” tags in the HTML and makes the URLs absolute.

EscapeHTMLChars($text) : Escapes all “<”, “>”, and “&” characters in the text.

lprint($text) : outputs a message to the run log file.

StripTags($html,’tag1’,’tag2’,…) : Removes the specified HTML tags from the HTML. This is normally used to remove formatting. The default tags are “strong”, “h1”, “h2”, “h3”, “h4”, “h5”, “h6”, “b”, “i”, “u”, “tt”, “font”, “big”, “small”, and “strike”.

StripAttributes($html,’att1’,’att2’,…) : Removes attributes from HTML tags. By default, the tags are “alt” and “class”.

HTMLsubstr($html,$offset,$length) : Extracts a substring from the HTML, counting only the non-tag characters. Also removes open tags from the beginning and end.

TrimOpenTags($html,’tag1’,’tag2’,…) : Removes open tags from the beginning and end of a block of HTML. By default the tags are every possible HTML tag.

GetAttributeValue($html,’tag1’,’attr’) : Searches the HTML for a tag and an attribute, and returns the value of the attribute for the first tag encountered. Returns undef if the value can’t be found.


Additional processing may be needed at the end of the Get function. For example, text can be split into several segments and stored in an array, or just the third image returned from GetImages could be used.

Keep in mind that the GetUrl, GetText, and GetHtml functions return a reference to text, and others return a reference to an array (GetImages, GetLinks). Use $$data to manipulate the text, and @$data to manipulate the array. Here are some examples:

@$data = grep { $$_ =~ /$pattern/ } @$data : weeds out all elements in the @$data array that do not match pattern $pattern. This is useful for removing all URLs from GetLinks that do not match a pattern.

@$data = grep { $$_ =~ /\d{5}.gif/i } @$data: weed out any images that do not have five digits in them.

@$data = grep { $$_ =~ !/$pattern/ } @$data : weeds out all elements in the @$data array that match pattern $pattern. This is useful for removing all URLs from GetLinks that match a pattern.

@$data = grep { $$_ =~ s/$pattern1/$pattern2/ } @$data : swap $pattern2 for $pattern1 in every member of @$data.

$$data =~ s/sometext/othertext/gsi: Replace sometext for othertext everywhere it appears. The “g” in gsi means do it for all occurences. The “i” means don't use case when matching sometext.

When referring to attributes that a user may set in the tag, refer to them as $attributes->{X}. Set a logical default in ProcessAttributes in case they do not specify the attribute.


Default Filter and Output Handlers

MakeHandler tries to make a logical guess for the default filter and output functions. See the section Overview of the Standard Handlers for a description of the built-in filters and output functions.

Only create a Filter and Output function if the data type being returned from Get is nonstandard. Do not restrict the user's ability to manipulate the data and output on their own. If filters are used, they should convert from special data to a standard one, like a string or array. See the freshmeat and slashdot handlers for examples.

Be sure that the handler works when run as below:

NewsClipper –e yourhandler

This is how the example file will be generated by our server for users who want to view the sample output.


Once the handler is finished, please consider submitting it to the News Clipper database. It will then be available for other people to use and enjoy. For instructions on how to submit, visit http://www.newsclipper.com­/handlers.html.

 


SECTION 8
Appendix


The Benefits of Collaborative Software Development

“With enough eyes, all bugs are shallow.” Open Source is a logical choice in terms of technical excellence, because it allows you to leverage the aid of hundreds or thousands of people to increase the quality of the product. It yields very fast product development, because it allows anyone to contribute. It yields robust code. Every customer is a beta tester. Every customer is a potential code maintainer, because they can fix bugs that they see in the code.

Overhead is lower because you can outsource some of the work to people who will work for free because they have a personal stake in seeing the work done. The market will be broader because people will port the code to numerous platforms (for free).

Customers benefit because they can modify the code to deal with a particular need, and 9 times out of 10 they'll send the modifications back to the developers. Customers value the source code, because they know the software will live even if the company dies.

It's “proven”. See Linux, gcc, CVS, Perl (the "glue" of the Internet), Apache (web server with 53% of the market), Samba, Sendmail (backbone of email). Open Source development has helped News Clipper significantly. Besides the 200+ handlers that have been written by the News Clipper community, there have been a number of bug fixes and a patches for new functionality like proxy support.

A Brief History of News Clipper

For the first few years of its life, News Clipper was known as Daily Update. Released from the start as Open Source software, Daily Update grew based on the suggestions and (in some cases) the modifications from the users. By the first half of 1999, Daily Update had reached a point where it was usable for more technical people, but sorely lacking for less experienced users.

In April of 1999, David Coppit, the author of Daily Update, was contacted by Binary Research International Incorporated about the possibility of marketing and selling the software as a commercial product. This proposal was attractive for a number of reasons. There was a certain amount of “grungy” work that had to be done, which meant that someone had to be paid to do it. If David Coppit was able to derive income from Daily Update, then he could devote more time to it. Lastly, making Daily Update a top-notch piece of software required the purchase of resources.

After consulting with the core members of the Daily Update community, David founded Spinnaker Software, Inc. and signed a marketing and distribution agreement with Binary Research International Incorporated. In June 1999, “News Clipper” was introduced at PCExpo in New York City.

How to Build a Business

Even as it became a commercial product, News Clipper did not forget its roots. The software is, and will remain, Open Source software. However, building a business around software whose source code is openly available is tricky. The key is to identify things that people are willing to pay for which your potential competitor has trouble producing.

In the short term, these offerings are: pre-compiled binaries, documentation, streamlined installation, and technical support. These benefits are supported by consulting services such as installation help, customization of the product, custom handler writing, web page development with the product, etc. In the long term, software “extras” will be offered, like a graphical configuration management tool, a graphical front-end to the software, a graphical version of MakeHandler, and a handler browser.

The GNU Public License protects the source code from being blatantly stolen, because it requires that the code and all derivative works be made freely available. Any competitor who tries to co-opt the code will have to make their modifications free, and then they will be incorporated back into News Clipper.

 



$

$home, 2-4

<

<!-- newsclipper ... -->, 1-8

A

-a flag, 6-3

acquisition functions, 7-10

array handler, 1-8, 1-9, 3-5, 5-1, 5-10, 5-14, 5-15, 5-16, 7-3

array2string handler, 5-9

at scheduler, 6-4

auto_download_bugfix_updates, 2-5, 5-12

B

bbcnews handler, 5-14

bug reporting, 1-2

C

-c flag, 6-3

-C flag, 6-3

cache_location, 2-4

cacheimages, 2-6

cacheimages handler, 2-6, 5-9, 5-14

caching, 1-10, 2-4

caps handler, 5-8

CGI, 1-4

commands, 1-5, 1-8, 3-2, 3-5, 5-3

default filter and output handlers, 1-8, 3-2, 7-17

filter, 1-9, 3-5, 5-3, 5-8, 7-5

input, 1-9, 5-3, 5-7

output, 1-9, 3-5, 5-3, 5-10, 7-6

configuration, 2-3, 6-6

NewsClipper.cfg file, 1-7, 2-2, 2-3, 6-3

consulting services, 1-2

convertdate handler, 5-9, 5-15

copyrights, 1-6

fair use, 1-6

cron, 6-4

D

-d flag, 6-1, 6-3

date handler, 3-1, 3-3, 5-3

debug mode, 6-2, 6-3

debug_log_file, 2-6

dir, 2-4

dumpdata handler, 5-10

E

-e flag, 1-7, 3-1, 6-3

email, 2-3

endcomment, 5-16

error, 7-3, 7-8, 7-14

EscapeHTMLChars, 7-14

ExtractText, 7-14

F

FAQ. See Frequently Asked Questions

Filter, 7-5, 7-8, 7-17

FilterType, 7-5, 7-9

flags, 6-3

Frequently Asked Questions, 1-1

freshmeat handler, 7-17

FTP, 2-4, 6-6

ftp_files, 2-4

G

Get, 7-4, 7-8, 7-16

GetAttributeValue, 7-15

GetDefaultHandlers, 7-3

GetHtml, 7-8, 7-9, 7-10, 7-16

GetImages, 7-10, 7-11, 7-16

GetLinks, 7-4, 7-10, 7-11, 7-12, 7-16

GetText, 7-8, 7-10, 7-16

GetUpdateTimes, 4-6, 7-4, 7-13

GetUrl, 7-9, 7-10, 7-16

grep handler, 1-8, 1-9, 3-5, 5-8

H

-H flag, 6-3

handler_locations, 2-4

handlers, 1-9, 5-13

automatic update, 1-10

bugfix updates, 5-12

components, 7-2

database, 1-5, 1-10, 5-4, 5-7, 5-12, 5-13, 7-8, 7-18

functional updates, 5-12

standards, 7-8

submitting, 7-18

writing, 1-10, 7-1

hash2array handler, 5-9

hash2string

handler, 5-8

hash2string handler, 5-3, 5-9, 5-14, 5-15, 5-16

highlight handler, 5-8

HTMLsubstr, 7-14

I

-i and -o flags, 1-7, 6-3

imgcachedir, 2-6

imgcacheurl, 2-6

include handler, 5-8

input_files, 2-3

L

lastupdated handler, 5-7

limit handler, 5-1, 5-8, 5-15

linuxtoday handler, 5-15

M

mailing lists, 1-1

archives, 1-1

make_output_files_executable, 2-5

MakeHandler, 1-10, 4-1, 7-1, 7-17, 8-2

MakeLinksAbsolute, 7-14

MakeSubtype, 7-14

map handler, 5-1, 5-9, 5-14, 5-15

maphash handler, 5-9, 5-15

max_cache_size, 2-4

max_log_file_size, 2-6

max_number_of_log_files, 2-6

maximgcacheage, 2-6

maximumpc handler, 7-8

memepool handler, 5-15

module_path, 2-4

N

-n flag, 5-12, 6-3

News Clipper, 1-4

history, 8-1

installation, 2-1

operation, 1-4, 1-7

requirements, 2-1

running, 1-7, 1-8, 6-1

Running, 1-5

O

Open-Source Software, 8-1

Output, 7-6, 7-8, 7-9, 7-17

output_files, 2-3

OutputType, 7-6, 7-9

P

-P flag, 6-3

password, 2-4

Perl, 1-5, 1-10, 2-1, 5-3, 7-1, 7-11, 8-1

ProcessAttributes, 7-3, 7-8, 7-16

proxy, 2-5

proxy_password, 2-5

proxy_username, 2-5

R

-r flag, 6-3

rdf handler, 5-7, 7-7

registration, 2-3

registration_key, 2-3

regular expressions, 5-10, 7-9, 7-11

resizeimage handler, 5-10

rss handler, 5-7, 7-7

run_log_file, 2-6

RunHandler, 7-8, 7-14

S

script_timeout, 2-4

searchrep handler, 5-10, 5-14

selectkeys handler, 5-8, 5-14

server, 2-4

slashdot, 7-8

slashdot handler, 5-1, 5-5, 5-7, 7-17

socket_timeout, 2-4

socket_tries, 2-4

startcomment, 5-16

starting and ending patterns, 4-5, 7-9, 7-11

STDIN, 1-7

STDOUT, 1-7

string handler, 5-10

string2hash handler, 5-10, 5-14

StripAttributes, 7-14

StripTags, 7-8, 7-14

T

table handler, 5-10

tag_text, 2-5

Task Scheduler, 6-4

technical support, 1-2, 6-2

thread handler, 5-10

timeouts, 1-11

TrimOpenTags, 7-8, 7-15

types, 5-3, 5-5

ARRAY, 5-5

checking, 5-3, 5-5

HASH, 5-5

Image, 5-5

Link, 5-5

SCALAR, 5-5

signatures, 7-5

subtypes, 5-5

Table, 5-5

Thread, 5-5

TypesMatch, 7-5, 7-14

U

uexpress handler, 5-7

uexpresscomic handler, 7-8

unitedmediacomic handler, 5-7

Unix, 1-3, 2-2, 2-3, 2-4, 5-8, 5-13, 6-4

usenet handler, 5-8

username, 2-4

V

-v flag, 6-1, 6-3

W

weatherchannel handler, 5-14

Windows, 1-3, 1-11, 2-1, 2-2, 2-3, 2-4, 2-5, 5-13, 6-4

Y

yahootopstories handler, 1-7, 1-8, 1-9, 3-4, 3-5, 5-16


 


array:                          an ordered list of items

browser:                     a program used to interact with the Internet. Most browsers can view web pages, use FTP, and invoke CGI scripts.

caching:                     temporarily storing commonly used information in order to speed up the acquisition of that information.

CGI script:                 “common gateway interface”. This is a program that can be run by typing a URL in a browser.

cgiwrap:                     a program that adds security to the execution of CGI scripts.

commands:                commands in News Clipper are a set of instructions that tell News Clipper how to get, modify, and embed information into a web page.

command line:           the text-based interface to the operating system of a machine. In Windows, one runs the “MS-DOS Prompt” application from the start menu to use the command line. In Unix, one would normally telnet to the machine.

copyright:                  the exclusive legal right to reproduce, publish and sell works. Copyrights are automatically granted to the author/creator of a work once it is recorded.

debug log:                 a log containing debugging information which is generated when News Clipper is run in debug mode.

fair use:                      exceptions that the law provides so that copyrighted work can be reproduced. For example, educational purposes are “fair use” under copyright law.

FTP:                            “file transfer protocol” A means of transferring files over the Internet

handler:                      a “plugin” to News Clipper that extends its capabilities by adding new commands.

handler database:     an on-line repository of handlers maintained by Spinnaker Software.

hash:                          an unordered list where items can be retrieved by name

HTML:                       “hypertext markup language”. A formatting language used to write web pages so that they can be viewed using a variety of browsers.

HTML comment:      comments are used to embed notes in HTML, and are ignored by browsers. The HTML comment tag looks like this: <!-- this is a comment -->.

interpreter:                 an alternative to compilation, where a program’s source code is translated on-the-fly into machine code.

MakeHandler:           a “wizard” that helps people create new handlers.

Open Source:            a method of developing software in which the software is exposed to be reviewed and modified by anyone in the world.

Perl:                            a computer programming language.

personal use:            personal use is fair use under copyright law.

pipe:                           in a shell program, it is the ability to send STDOUT to the STDIN of another program. Signified by the vertical bar between programs “|”.

pre-compiled binaries: computer programs are writting in source code, which is not directly usable by the computer. One method of making the source code usable is called compilation, where the source code is translated into machine code.

proxy server:             a computer that manages communication between two other computers. A proxy server is often used to increase security by shielding company intranets from the world-wide Internet.

redirect:                      in a command shell, this is a method of sending STDOUT to a file to be read later.

regular expressions: a regular expression is a pattern expressed using a special notation.

scalar:                         a string or number.

run log:                      a log containing information related to the routine execution of News Clipper, as well as any errors encountered.

shell prompt:             an interface that allows the user to type commands to interact with the computer.

Slashdot:                   a popular news site for technical people.

source code:             the human-readable text format for describing the instructions that tell a computer how to execute.

STDIN:                       standard input. Typically the keyboard, as read from a shell prompt.

STDOUT:                  standard output. Typically the terminal screen.

subtype:                    one type of data can also be another type. This is similar to how a dog is a “canine” type, which is a subtype of the “animal” type.

tag:                             in HTML, a tag is used to describe the text being written. For example, a <p> tag documents the start of a paragraph.

telnet:                         a program for interacting with a remote machine via a text-based interface.

type signature:         an unambiguous, encoded description of the type of a data element.

timeout:                      a timer that, when it expires, causes some action to occur

type:                           a type specifies the “kind” of data. It is used to ensure that News Clipper commands can not be executed on data that is of the wrong type.

URL:                           “uniform resource locator”. This an “address” on the Internet. It tells a browser where to find information.

Usenet:                      an Internet discussion forum, where people post messages on various topics.

web page:                  a document in HTML, which is available for viewing on the World Wide Web using a browser

XML:                          an emerging standard for the description and exchange of data between computers