|
INTRODUCTION
Writing handlers
is considerably simplified by the MakeHandler utility. It semi-automatically
generates handlers that grab text, HTML, links or images. In some special
cases, however, more is needed than what MakeHandler can offer.
In these cases,
the handler that is generated by MakeHandler may need to be edited by hand.
Familiarity with the Perl programming language, or programming languages
in general, is usually needed to accomplish this.
Users report
that MakeHandler is easy to use for the cases for which it was designed.
People who know Perl also report that editing the handler by hand is not
that difficult. If trouble is encountered, please contact technical support
or the mailing list. If you have no programming experience at all, and
can not make a handler using MakeHandler, you should consider hiring a
consultant to help you. See the services webpage
for more information.
Every handler
operates in the context of an interface that News Clipper provides. This
interface provides functions for acquiring remote information, as well
as functions for manipulating the acquired information. This chapter provides
a tutorial on how to go about developing handlers.
BASIC HANDLER
STRUCTURE
ACQUISITION
HANDLER COMPONENTS
Below is the
complete text of a skeleton acquisition handler, with commentary describing
the various parts of the handler. MakeHandler would automatically generate
the bulk of this computer code. We are presenting this handler in order
to discuss the various components. The following subsections provide more
detail about the key aspects of handlers.
# -*- mode: Perl; -*-
package NewsClipper::Handler::Acquisition::HANDLERNAME;
use vars qw( @ISA $VERSION %handlerInfo );
$handlerInfo{'Author Name'} = 'Joe Shmoe';
$handlerInfo{'Author Email'} = 'joe@shmoe.com';
$handlerInfo{'Maintainer Name'} = 'Bob Maintainer';
$handlerInfo{'Maintainer Email'} = 'bob@maintainer.com';
$handlerInfo{'Description'} = <<'EOF';
This handler does not work. It’s just a sample
EOF
$handlerInfo{'Category'} = 'General';
$handlerInfo{'URL'} = <<'EOF';
http://www.thenewssource.com/
EOF
$handlerInfo{'License'} = 'GPL';
$handlerInfo{'For News Clipper Version'} = '1.18';
$handlerInfo{'Language'} = 'English';
$handlerInfo{'Notes'} = <<'EOF';
This handler was originally written by Joe Shmoe, and is now maintained
by Bob Maintainer.
EOF
$handlerInfo{'Syntax'} = <<'EOF';
<input name=HANDLERNAME source=X>
Returns an array of links
X: either headlines or sports (the default is headlines)
EOF
The package declaration
describes the handler type (in this case an acquisition handler) and the
handler name. The handler information block provides information about
the handler, such as the handler’s maintainer, the language of the data
it acquires, and the syntax describing how to use the handler.
From the syntax,
we can see that the handler has an optional "source" attribute whose value
can either be "headlines" or "sports", and defaults to headlines.
use strict;
use NewsClipper::Handler;
@ISA = qw(NewsClipper::Handler);
# - The first number should be incremented when a change is made to the
# handler that will break people's input files.
# - The second number should be incremented when a change is made that won't
# break people's input files, but changes the functionality.
# - The third number should be incremented when only a bugfix is applied.
$VERSION = do {my @r=('0.4.1'=~/\d+/g);sprintf "%d."."%02d"x$#r,@r};
This section contains
details that aren’t relevant to most users, except for the version number.
The version number of this handler is 0.4.1.
sub ProcessAttributes
{
my $self = shift;
my $attributes = shift;
my $handlerRole = shift;
$attributes->{'source'} = 'headlines'
unless defined $attributes->{'source'};
unless ($attributes->{source} eq 'headlines' ||
$attributes->{source} eq 'sports')
{
error "The \"source\" attribute for handler \"HANDLERNAME\" " .
"should be either \"headlines\" or \"sports\".\n";
return undef;
}
return $attributes;
}
The ProcessAttributes
subroutine is used to provide attributes their default values, and verify
that the attributes are valid. This subroutine is executed before the URL
is computed, and before default handlers are computed (see below). The
handler role determines how the handler is being used - the valid values
are "input", "filter", and "output".
This subroutine
demonstrates the use of the error subroutine, which should be used any
time an error occurs in the handler and a message is to be displayed to
the user.
sub GetDefaultHandlers
{
my $self = shift;
my $attributes = shift;
my $returnVal = <<' EOF';
<filter name='limit' number='10'>
<output name='array'>
EOF
return $returnVal;
}
GetDefaultHandlers
describes the default filter and output handlers for the acquisition handler.
sub ComputeURL
{
my $self = shift;
my $attributes = shift;
my $source = $attributes->{source};
my %urlMap = (
'headlines' => 'hl/',
'sports' => 'sp/',
);
my $url = 'http://www.thenewssource.com/' . $urlMap{$source};
return $url;
}
The ComputeURL
subroutine is used to compute the URL from which to acquire the data. In
this case, the URL is computed using the "source" attribute by looking
up the URL ending in %urlMap and appending it to the base URL.
sub Get
{
my $self = shift;
my $attributes = shift;
my $source = $attributes->{source};
my %patternMap = (
'headlines' => ['headlines start','headlines end'],
'sports' => ['sports start','sports end'],
);
my $url = $self->ComputeURL($attributes);
my $data = &GetLinks($url,$patternMap{$source}[0],$patternMap{$source}[1]);
return undef unless defined $data;
@$data = grep {$$_ !~ /<img/i} @$data;
return $data;
}
The Get subroutine
does most of the work of a handler. It calls ComputeURL to determine the
URL from which to fetch information, and then calls one of the built-in
routines to fetch the data. In this case, the routine is GetLinks, which
is called with a starting and ending pattern that is dependent on the source
of the news.
Additional
processing can also occur in the Get subroutine. For example, this handler
removes links to images from the array of links that is returned after
GetLinks is called.
sub GetUpdateTimes
{
my $self = shift;
my $attributes = shift;
return ['2,5,8,11,14,17,20,23'];
}
The GetUpdateTimes
subroutine encodes the times at which the content on the remote server
needs to be fetched.
1;
Every handler
ends with a "1;".
OTHER HANDLER
COMPONENTS
Filter and
output handlers have components in addition to the ones listed above. Normally
users will not need to know about these components. Advanced users will
want to read this section, but other users should skip it.
sub FilterType
{
my $self = shift;
my $attributes = shift;
my $data = shift;
return ’$Link | @Link’;
}
Filter handlers
have a FilterType function that describes the type of data that the handler
accepts. In this case, the type description (or type signature) of data
accepted by the handler is "$Link | @Link", which means that the handler
accepts either a link, or an array of links.
Type signatures
describe the structure and data of a complex data structure. For example,
"$" means a scalar (e.g. a string), "@" means an array, and "%" means a
hash. The name of the subtype can follow the main type symbol for scalars.
For example "$Link" is a link.
Complex data
structures can be described using nested symbols. For example, "@$Link"
indicates an array of links. One can also describe alternatives using the
"|" symbol: "@Link | %" means an array of links or a hash. Mandatory elements
can be expressed with the "&" symbol: "@($Link & %Slashdot)" means
an array consisting of at least one link and at least one Slashdot hash.
(Note the use of parentheses to group items.)
sub Filter
{
my $self = shift;
my $attributes = shift;
my $data = shift;
if (TypesMatch($data,’@Link’))
{
@$data = grep { $$_ =~ s/<a/<a target="_top"/si } @$data;
}
elsif (TypesMatch($data,’$Link’))
{
$$data =~ s/<a/<a target="_top"/si;
}
return $data;
}
Filter handlers
have a Filter function that accepts data, transforms it in some way, and
then returns the data. In this example we use the TypesMatch function to
determine to which of the two possible types the data conforms.
sub OutputType
{
my $self = shift;
my $attributes = shift;
my $data = shift;
return ’$Link | @Link’;
}
Output handlers
use the OutputType function to describe the type of data they accept.
sub Output
{
my $self = shift;
my $attributes = shift;
my $data = shift;
if (TypesMatch($data,’@Link’))
{
foreach my $link (@$data)
{
print "URL: $$link<br>\n";
}
}
elsif (TypesMatch($data,’$Link’))
{
print "URL: $$data<br>\n";
}
}
Output handlers
have an Output function that accepts data and formats it for output
CHOOSE YOUR
NEWS SOURCE
Usually finding
news to be viewed every day is easy. But finding the best format for that
information may not be. For example, the National Weather Service weather
is available in many formats, but some users prefer "raw" NWS text, which
is easier to parse. Conversely, others might prefer pictures, which mean
they should find a weather site that prints the NWS weather along with
images.
Sometimes sites
have "low graphics" versions of their web pages. This can be used to grab
the information of interest, and then filter the results to have them point
to the "high graphics" web pages. For example, some news sites simply use
"low" or "hi" in the URL to distinguish the two types of web pages. Fetching
the low graphics links and then replacing "low" with "hi" may be easier
than extracting the links directly from the high graphics web page.
Other web sites
actually provide a special back-end specifically designed to make the job
of extracting information easier. They realize that the more people link
to their sites and use their content, the more traffic they generate. This
is the exception more than the rule.
One special
type of back-end is one that is based on XML, a standard for interchange
of data. The rss and rdf input handlers are designed to allow you to easily
extract links from site that use XML, and they significantly expand the
set of news sources available to News Clipper. A list of sites that export
XML files with news information is located at http://www.xmltree.com/,
http://static.userland.com/myUserLandServices/serviceList2.xml/,
and http://theweb.startshere.net/channels.phtml/.
See the documentation of the rdf and rss handlers for more information.
When deciding
to build a handler for a specific site, avoid creating individual handlers
for the different departments of a web site. Instead, try to exploit commonality
in the web pages. Create one handler with a "source" attribute that allows
people to select the department they want.
STANDARDS
In order to
keep the quality of handlers high, please follow the following guidelines:
-
Every data element
(and internal data element of a complex data structure) is a reference
- Get and Filter functions return references to strings, arrays, and hashes,
and arrays and hashes contain references.
-
Whenever possible,
provide defaults for any attributes. The handler should generate output
when run as "NewsClipper -e yourhandler". (This is how the example web
page for the handler will be generated when you submit the handler to the
database.) When defaults are not possible, print an error message from
within ProcessAttributes and return undef.
-
If there is a
problem, use the error function to log the error, and do not insert visible
text into the output.
-
While an acquisition
handler can operate as a filter or output handler, try to avoid writing
a Filter or Output function unless the data structure is totally unique.
-
If the data structure
is unique, provide filters that let people translate it into common data
structures. For example, the slashdot handler returns an array of hashes,
and has a Filter function that can convert data of the type "array of Slashdot
hashes" to "array of links".
-
If writing a filter,
be careful to change the output type if necessary. For example, if the
filter turns an array of hashes into an array of links, "bless $link,’Link’"
for each link in the array before returning the data. (But don’t bless
the array as "ArrayOfLinks" or anything like that.)
-
Try to use the
other built-in filters whenever possible. (See the uexpresscomic handler
for an example.) News Clipper has a helper function called RunHandler (see
below) which you can use to invoke other handlers to process the data.
-
Always include
the line "return undef unless defined $data" after a call to GetHtml, GetText,
etc., since these functions return undef when they fail, and you should
too.
-
If a web site
has common formatting, consider using a "source" parameter to choose among
the different data types. (See the maximumpc handler, for example.)
-
Always return
"clean" HTML without unopened or unclosed tags, like <b> but no
</b>.
See TrimOpenTags, as well as StripTags.
-
Only rarely is
it necessary to use GetUrl to grab HTML, because it doesn't make links
absolute. Use GetHtml($url,'^','$') instead.
-
Try to specify
the beginning of document ("^") and end of document ("$") for the start
and end patterns of the acquisition functions whenever possible. Experience
has shown that when handlers break, it’s usually because the start or end
pattern doesn’t work anymore. A good strategy is to use "^" and "$" to
grab everything on the page, identify something that is unique about the
links or other data you are trying to capture, and weed out the results
that do not match.
-
When checking
attributes, do something like if (lc($attributes->{'source'}) eq 'headlines')
to make the attribute case insensitive.
-
Try to make regular
expressions robust. Generally, the longer the pattern, the more chance
it will fail. Also, try to store matches in variables one at a time. If
you try to match many items with one pattern, they will all fail if the
pattern does not match.
-
Every filter must
have a FilterType function that returns a type specifier that says what
types of input it can handle. Likewise, any handler with an Output function
must also have an OutputType function.
CHOOSING THE
ACQUISITION FUNCTION
First, find
the web page that has the data. Then, decide what type of information News
Clipper will retrieve, which will indicate which acquisition function to
use:
GetUrl:
Grabs all the content from a URL, in totally raw form. Usually this is
used to grab a text file.
GetText:
Grabs text data from a block of HTML, stripping HTML tags out
GetHtml:
Grabs a block of HTML from a URL's content, making links absolute rather
than relative
GetImages:
Grabs images from a block of HTML, making links absolute
GetLinks:
Grabs hyperlinks from a block of HTML, and makes them absolute
CHOOSING THE
STARTING AND ENDING PATTERNS
If GetLinks
was used to get the links off a web page and only a few links are to be
selected, consider the following HTML:
<html>
<head><title>Title</title></head>
<body>
<p>This is some text
An <a href="unwanted.html">unwanted link</a>.
<!-- Insert links here -->
<a href="/news/somewhere.html">Somewhere</a><br>
<a href="/news/somewhereelse.html">Somewhere Else</a><br>
<!-- End links -->
An <a href="mailto:webmaster@asdf.com">email link</a>.
</p>
</body>
If the HTML designers
were nice enough to use the comments shown, simply use "Insert links here
-->" and "<-- End links" as the start and end patterns. Otherwise, some
other marker text will need to be found.
The starting
and ending patterns are expressed as Perl regular expressions . For example,
"." matches any single character, "a.*b" matches an "a" followed by any
number of characters (including an "a" or "b") followed by a "b". For more
information about Perl’s support for regular expressions, run perldoc perlre.
Each of the
acquisition functions works by first searching for the first match to the
start pattern, and then searching for the first match of the end pattern
after the start pattern. To find good starting and ending patterns, try
the following.
-
Use the "View
Source" option of the web browser to see the HTML.
-
Search for the
information News Clipper should grab.
-
Look for something
right above the information that can be used as the start pattern. (The
GetLinks or GetImages functions are less particular, since they generally
ignore any extra items at the beginning of the grabbed data.)
-
If it is a simple
bit of text, and not a full-blown regular expression, scroll to the top
of HTML and use the browser's find feature to see if the chosen pattern
shows up earlier in the HTML.
-
Now go to the
end of the content to be grabbed and find a good ending pattern.
-
As before, use
the browser's find feature to make sure the end pattern does not show up
somewhere in the middle of the content to be grabbed.
Unfortunately,
start and end patterns often change on web sites, which means your handler
will break more easily. A better strategy is to see if the links have anything
in common, fetch every link in the web page, and then filter out the ones
we don’t want.
For example,
in the above sample web page, we see that all the interesting links have
the string "/news/" in them. As a result, we can tell GetLinks to get every
link on the page:
my $data = GetLinks($url,’^’,’$’);
and then extract
only the ones with "/news/" in them:
@$data = grep { $$_ =~ /\/news\//i ) @$data;
Be sure to use
"(?i)" at the start of patterns to make them case insensitive. "\n" can
be used to indicate a newline in the pattern.
GETUPDATETIMES
To specify
when the server updates its information, add a GetUpdateTimes function.
This function tells News Clipper when to refresh its cached data. For example,
when are making a handler for a daily comic, consider using "7", since
the comic changes at 6 am PST every day.
It is important
to set the times as close as possible to the actual update time. For example,
if the data gets updated at 1am PST, and the update is set to 3am PST,
visitors in England will have stale data between 9am and 11am. On the other
hand, setting the time too early could mean getting the data from the day
before.
It pays to
be a little conservative here - specifying every hour of the day, means
lots of people will be hitting their server when they probably are not
even looking at their News Clipper webpage.
Date specifications
are of the form "[day] hour,hour,hour [time zone]". If the day is omitted,
every day is assumed. If the time zone is omitted, Pacific Standard Time
is assumed. If no GetUpdateTimes function is provided, the default of "2,5,8,11,14,17,20,23
PST" is used.
The days are:
sun, mon, tue, wed, thu, fri, sat. Multiple times can be specified, for
example:
sub GetUpdateTimes
{
return ['1 EST','mon 6,8 EST','tues 16 CST','20'];
}
will update Mondays
at 6am and 8am EST, Tuesdays at 4pm CST, and every day at 8pm PST.
You can specify
"always" as the update time to have News Clipper update the data every
time it is run. However, please do not do this unless the data being queried
really does change by the minute. Headlines, for example, do not change
constantly, and specifying "always" will cause News Clipper to hit the
remote server repeatedly, causing the system adminstrators to send you
a nasty email.
OTHER FUNCTIONALITY
News Clipper
provides additional functionality for handler writers to help manipulate
acquired data. This includes the following functions:
RunHandler($handlerName,$handlerType,$data,$attributes)
: runs the handler specified by $handlerName as the type specified by $handlerType
("input", "filter", or "output"). $data is used to pass the data to the
filter or output handler to be run, or undef should be used if the handler
to be run is an input handler. Finally, $attributes should contain the
hash of attributes for the handler.
TypesMatch($data,$typeSignature)
: compares the actual type signature of $data to the type signature specified
by $typeSignature. The function is useful for filter and output handlers
that accept more than one kind of data.
MakeSubtype($subType,$baseType)
: makes one type a subtype of another. For example, a handler that creates
a hash of information from a website should call "MakeSubtype(’Name’, ’HASH’);"
to let News Clipper know that data of type "Name" can be used wherever
a hash is expected. As the hashes are created, they should be declared
as being of type Name by calling bless: "bless \%hash, ’Name’;".
error($text)
: logs an error that will be sent to the output as an HTML comment after
the News Clipper command has completed execution.
ExtractText($text,$beginPattern,$endPattern)
: extracts text between the beginning and ending patterns.
MakeLinksAbsolute($baseurl,$html)
: Finds all "a href=" and "img src=" tags in the HTML and makes the URLs
absolute.
EscapeHTMLChars($text)
: Escapes all "<", ">", and "&" characters in the text.
StripTags($html,’tag1’,’tag2’,...)
: Removes the specified HTML tags from the HTML. This is normally used
to remove formatting. The default tags are "strong", "h1", "h2", "h3",
"h4", "h5", "h6", "b", "i", "u", "tt", "font", "big", "small", and "strike".
StripAttributes($html,’att1’,’att2’,...)
: Removes attributes from HTML tags. By default, the tags are "alt" and
"class".
HTMLsubstr($html,$offset,$length)
: Extracts a substring from the HTML, counting only the non-tag characters.
Also removes open tags from the beginning and end.
TrimOpenTags($html,’tag1’,’tag2’,...)
: Removes open tags from the beginning and end of a block of HTML. By default
the tags are every possible HTML tag.
GetAttributeValue($html,’tag1’,’attr’)
: Searches the HTML for a tag and an attribute, and returns the value of
the attribute for the first tag encountered. Returns undef if the value
can’t be found.
ADDITIONAL
PROCESSING
Additional
processing may be needed at the end of the Get function. For example, text
can be split into several segments and stored in an array, or just the
third image returned from GetImages could be used.
Keep in mind
that the GetUrl, GetText, and GetHtml functions return a reference to text,
and others return a reference to an array (GetImages, GetLinks). Use $$data
to manipulate the text, and @$data to manipulate the array. Here are some
examples:
@$data = grep
{ $$_ =~ /$pattern/ } @$data : weeds out all elements in the @$data array
that do not match pattern $pattern. This is useful for removing all URLs
from GetLinks that do not match a pattern.
@$data = grep
{ $$_ =~ /\d{5}.gif/i } @$data: weed out any images that do not have five
digits in them.
@$data = grep
{ $$_ =~ !/$pattern/ } @$data : weeds out all elements in the @$data array
that match pattern $pattern. This is useful for removing all URLs from
GetLinks that match a pattern.
@$data = grep
{ $$_ =~ s/$pattern1/$pattern2/ } @$data : swap $pattern2 for $pattern1
in every member of @$data.
$$data =~ s/sometext/othertext/gsi:
Replace sometext for othertext everywhere it appears. The "g" in gsi means
do it for all occurences. The "i" means don't use case when matching sometext.
When referring
to attributes that a user may set in the tag, refer to them as $attributes->{X}.
Set a logical default in ProcessAttributes in case they do not specify
the attribute.
DEFAULT
FILTER AND OUTPUT HANDLERS
MakeHandler
tries to make a logical guess for the default filter and output functions.
Only create
a Filter and Output function if the data type being returned from Get is
nonstandard. Do not restrict the user's ability to manipulate the data
and output on their own. If filters are used, they should convert from
special data to a standard one, like a string or array. See the freshmeat
and slashdot handlers for examples.
Be sure that
the handler works when run as below:
NewsClipper -e yourhandler
This is how the
example file will be generated by our server for users who want to view
the sample output.
SUBMITTING
YOUR HANDLER
Once the handler
is finished, please consider submitting it to the News Clipper database.
It will then be available for other people to use and enjoy. For instructions
on how to submit, visit the handler submission
service webpage. |