NewsClipper.com ~ Available Handlers

NEWS CLIPPER - Snip and Ship content to your Web Site

Download News Clipper Now!

WRITING CUSTOM HANDLERS | MAKEHANDLER TUTORIAL


WRITING CUSTOM HANDLERS
	INTRODUCTION Writing handlers is considerably simplified by the MakeHandler utility. It semi-automatically generates handlers that grab text, HTML, links or images. In some special cases, however, more is needed than what MakeHandler can offer. In these cases, the handler that is generated by MakeHandler may need to be edited by hand. Familiarity with the Perl programming language, or programming languages in general, is usually needed to accomplish this. Users report that MakeHandler is easy to use for the cases for which it was designed. People who know Perl also report that editing the handler by hand is not that difficult. If trouble is encountered, please contact technical support or the mailing list. If you have no programming experience at all, and can not make a handler using MakeHandler, you should consider hiring a consultant to help you. See the services webpage for more information. Every handler operates in the context of an interface that News Clipper provides. This interface provides functions for acquiring remote information, as well as functions for manipulating the acquired information. This chapter provides a tutorial on how to go about developing handlers. BASIC HANDLER STRUCTURE ACQUISITION HANDLER COMPONENTS Below is the complete text of a skeleton acquisition handler, with commentary describing the various parts of the handler. MakeHandler would automatically generate the bulk of this computer code. We are presenting this handler in order to discuss the various components. The following subsections provide more detail about the key aspects of handlers. # -- mode: Perl; -- package NewsClipper::Handler::Acquisition::HANDLERNAME; use vars qw( @ISA $VERSION %handlerInfo ); $handlerInfo{'Author Name'} = 'Joe Shmoe'; $handlerInfo{'Author Email'} = 'joe@shmoe.com'; $handlerInfo{'Maintainer Name'} = 'Bob Maintainer'; $handlerInfo{'Maintainer Email'} = 'bob@maintainer.com'; $handlerInfo{'Description'} = <<'EOF'; This handler does not work. It’s just a sample EOF $handlerInfo{'Category'} = 'General'; $handlerInfo{'URL'} = <<'EOF'; http://www.thenewssource.com/ EOF $handlerInfo{'License'} = 'GPL'; $handlerInfo{'For News Clipper Version'} = '1.18'; $handlerInfo{'Language'} = 'English'; $handlerInfo{'Notes'} = <<'EOF'; This handler was originally written by Joe Shmoe, and is now maintained by Bob Maintainer. EOF $handlerInfo{'Syntax'} = <<'EOF'; <input name=HANDLERNAME source=X> Returns an array of links X: either headlines or sports (the default is headlines) EOF The package declaration describes the handler type (in this case an acquisition handler) and the handler name. The handler information block provides information about the handler, such as the handler’s maintainer, the language of the data it acquires, and the syntax describing how to use the handler. From the syntax, we can see that the handler has an optional "source" attribute whose value can either be "headlines" or "sports", and defaults to headlines. use strict; use NewsClipper::Handler; @ISA = qw(NewsClipper::Handler); # - The first number should be incremented when a change is made to the # handler that will break people's input files. # - The second number should be incremented when a change is made that won't # break people's input files, but changes the functionality. # - The third number should be incremented when only a bugfix is applied. $VERSION = do {my @r=('0.4.1'=~/\d+/g);sprintf "%d."."%02d"x$#r,@r}; This section contains details that aren’t relevant to most users, except for the version number. The version number of this handler is 0.4.1. sub ProcessAttributes { my $self = shift; my $attributes = shift; my $handlerRole = shift; $attributes->{'source'} = 'headlines' unless defined $attributes->{'source'}; unless ($attributes->{source} eq 'headlines' \|\| $attributes->{source} eq 'sports') { error "The \"source\" attribute for handler \"HANDLERNAME\" " . "should be either \"headlines\" or \"sports\".\n"; return undef; } return $attributes; } The ProcessAttributes subroutine is used to provide attributes their default values, and verify that the attributes are valid. This subroutine is executed before the URL is computed, and before default handlers are computed (see below). The handler role determines how the handler is being used - the valid values are "input", "filter", and "output". This subroutine demonstrates the use of the error subroutine, which should be used any time an error occurs in the handler and a message is to be displayed to the user. sub GetDefaultHandlers { my $self = shift; my $attributes = shift; my $returnVal = <<' EOF'; <filter name='limit' number='10'> <output name='array'> EOF return $returnVal; } GetDefaultHandlers describes the default filter and output handlers for the acquisition handler. sub ComputeURL { my $self = shift; my $attributes = shift; my $source = $attributes->{source}; my %urlMap = ( 'headlines' => 'hl/', 'sports' => 'sp/', ); my $url = 'http://www.thenewssource.com/' . $urlMap{$source}; return $url; } The ComputeURL subroutine is used to compute the URL from which to acquire the data. In this case, the URL is computed using the "source" attribute by looking up the URL ending in %urlMap and appending it to the base URL. sub Get { my $self = shift; my $attributes = shift; my $source = $attributes->{source}; my %patternMap = ( 'headlines' => ['headlines start','headlines end'], 'sports' => ['sports start','sports end'], ); my $url = $self->ComputeURL($attributes); my $data = &GetLinks($url,$patternMap{$source}[0],$patternMap{$source}[1]); return undef unless defined $data; @$data = grep {$$_ !~ /<img/i} @$data; return $data; } The Get subroutine does most of the work of a handler. It calls ComputeURL to determine the URL from which to fetch information, and then calls one of the built-in routines to fetch the data. In this case, the routine is GetLinks, which is called with a starting and ending pattern that is dependent on the source of the news. Additional processing can also occur in the Get subroutine. For example, this handler removes links to images from the array of links that is returned after GetLinks is called. sub GetUpdateTimes { my $self = shift; my $attributes = shift; return ['2,5,8,11,14,17,20,23']; } The GetUpdateTimes subroutine encodes the times at which the content on the remote server needs to be fetched. 1; Every handler ends with a "1;". OTHER HANDLER COMPONENTS Filter and output handlers have components in addition to the ones listed above. Normally users will not need to know about these components. Advanced users will want to read this section, but other users should skip it. sub FilterType { my $self = shift; my $attributes = shift; my $data = shift; return ’$Link \| @Link’; } Filter handlers have a FilterType function that describes the type of data that the handler accepts. In this case, the type description (or type signature) of data accepted by the handler is "$Link \| @Link", which means that the handler accepts either a link, or an array of links. Type signatures describe the structure and data of a complex data structure. For example, "$" means a scalar (e.g. a string), "@" means an array, and "%" means a hash. The name of the subtype can follow the main type symbol for scalars. For example "$Link" is a link. Complex data structures can be described using nested symbols. For example, "@$Link" indicates an array of links. One can also describe alternatives using the "\|" symbol: "@Link \| %" means an array of links or a hash. Mandatory elements can be expressed with the "&" symbol: "@($Link & %Slashdot)" means an array consisting of at least one link and at least one Slashdot hash. (Note the use of parentheses to group items.) sub Filter { my $self = shift; my $attributes = shift; my $data = shift; if (TypesMatch($data,’@Link’)) { @$data = grep { $$_ =~ s/<a/<a target="_top"/si } @$data; } elsif (TypesMatch($data,’$Link’)) { $$data =~ s/<a/<a target="_top"/si; } return $data; } Filter handlers have a Filter function that accepts data, transforms it in some way, and then returns the data. In this example we use the TypesMatch function to determine to which of the two possible types the data conforms. sub OutputType { my $self = shift; my $attributes = shift; my $data = shift; return ’$Link \| @Link’; } Output handlers use the OutputType function to describe the type of data they accept. sub Output { my $self = shift; my $attributes = shift; my $data = shift; if (TypesMatch($data,’@Link’)) { foreach my $link (@$data) { print "URL: $$link<br>\n"; } } elsif (TypesMatch($data,’$Link’)) { print "URL: $$data<br>\n"; } } Output handlers have an Output function that accepts data and formats it for output CHOOSE YOUR NEWS SOURCE Usually finding news to be viewed every day is easy. But finding the best format for that information may not be. For example, the National Weather Service weather is available in many formats, but some users prefer "raw" NWS text, which is easier to parse. Conversely, others might prefer pictures, which mean they should find a weather site that prints the NWS weather along with images. Sometimes sites have "low graphics" versions of their web pages. This can be used to grab the information of interest, and then filter the results to have them point to the "high graphics" web pages. For example, some news sites simply use "low" or "hi" in the URL to distinguish the two types of web pages. Fetching the low graphics links and then replacing "low" with "hi" may be easier than extracting the links directly from the high graphics web page. Other web sites actually provide a special back-end specifically designed to make the job of extracting information easier. They realize that the more people link to their sites and use their content, the more traffic they generate. This is the exception more than the rule. One special type of back-end is one that is based on XML, a standard for interchange of data. The rss and rdf input handlers are designed to allow you to easily extract links from site that use XML, and they significantly expand the set of news sources available to News Clipper. A list of sites that export XML files with news information is located at http://www.xmltree.com/, http://static.userland.com/myUserLandServices/serviceList2.xml/, and http://theweb.startshere.net/channels.phtml/. See the documentation of the rdf and rss handlers for more information. When deciding to build a handler for a specific site, avoid creating individual handlers for the different departments of a web site. Instead, try to exploit commonality in the web pages. Create one handler with a "source" attribute that allows people to select the department they want. STANDARDS In order to keep the quality of handlers high, please follow the following guidelines: Every data element (and internal data element of a complex data structure) is a reference - Get and Filter functions return references to strings, arrays, and hashes, and arrays and hashes contain references. Whenever possible, provide defaults for any attributes. The handler should generate output when run as "NewsClipper -e yourhandler". (This is how the example web page for the handler will be generated when you submit the handler to the database.) When defaults are not possible, print an error message from within ProcessAttributes and return undef. If there is a problem, use the error function to log the error, and do not insert visible text into the output. While an acquisition handler can operate as a filter or output handler, try to avoid writing a Filter or Output function unless the data structure is totally unique. If the data structure is unique, provide filters that let people translate it into common data structures. For example, the slashdot handler returns an array of hashes, and has a Filter function that can convert data of the type "array of Slashdot hashes" to "array of links". If writing a filter, be careful to change the output type if necessary. For example, if the filter turns an array of hashes into an array of links, "bless $link,’Link’" for each link in the array before returning the data. (But don’t bless the array as "ArrayOfLinks" or anything like that.) Try to use the other built-in filters whenever possible. (See the uexpresscomic handler for an example.) News Clipper has a helper function called RunHandler (see below) which you can use to invoke other handlers to process the data. Always include the line "return undef unless defined $data" after a call to GetHtml, GetText, etc., since these functions return undef when they fail, and you should too. If a web site has common formatting, consider using a "source" parameter to choose among the different data types. (See the maximumpc handler, for example.) Always return "clean" HTML without unopened or unclosed tags, like <b> but no </b>. See TrimOpenTags, as well as StripTags. Only rarely is it necessary to use GetUrl to grab HTML, because it doesn't make links absolute. Use GetHtml($url,'^','$') instead. Try to specify the beginning of document ("^") and end of document ("$") for the start and end patterns of the acquisition functions whenever possible. Experience has shown that when handlers break, it’s usually because the start or end pattern doesn’t work anymore. A good strategy is to use "^" and "$" to grab everything on the page, identify something that is unique about the links or other data you are trying to capture, and weed out the results that do not match. When checking attributes, do something like if (lc($attributes->{'source'}) eq 'headlines') to make the attribute case insensitive. Try to make regular expressions robust. Generally, the longer the pattern, the more chance it will fail. Also, try to store matches in variables one at a time. If you try to match many items with one pattern, they will all fail if the pattern does not match. Every filter must have a FilterType function that returns a type specifier that says what types of input it can handle. Likewise, any handler with an Output function must also have an OutputType function. CHOOSING THE ACQUISITION FUNCTION First, find the web page that has the data. Then, decide what type of information News Clipper will retrieve, which will indicate which acquisition function to use: GetUrl: Grabs all the content from a URL, in totally raw form. Usually this is used to grab a text file. GetText: Grabs text data from a block of HTML, stripping HTML tags out GetHtml: Grabs a block of HTML from a URL's content, making links absolute rather than relative GetImages: Grabs images from a block of HTML, making links absolute GetLinks: Grabs hyperlinks from a block of HTML, and makes them absolute CHOOSING THE STARTING AND ENDING PATTERNS If GetLinks was used to get the links off a web page and only a few links are to be selected, consider the following HTML: <html> <head><title>Title</title></head> <body> <p>This is some text An <a href="unwanted.html">unwanted link</a>. <!-- Insert links here --> <a href="/news/somewhere.html">Somewhere</a><br> <a href="/news/somewhereelse.html">Somewhere Else</a><br> <!-- End links --> An <a href="mailto:webmaster@asdf.com">email link</a>. </p> </body> If the HTML designers were nice enough to use the comments shown, simply use "Insert links here -->" and "<-- End links" as the start and end patterns. Otherwise, some other marker text will need to be found. The starting and ending patterns are expressed as Perl regular expressions . For example, "." matches any single character, "a.b" matches an "a" followed by any number of characters (including an "a" or "b") followed by a "b". For more information about Perl’s support for regular expressions, run perldoc perlre. Each of the acquisition functions works by first searching for the first match to the start pattern, and then searching for the first match of the end pattern after the start pattern. To find good starting and ending patterns, try the following. Use the "View Source" option of the web browser to see the HTML. Search for the information News Clipper should grab. Look for something right above the information that can be used as the start pattern. (The GetLinks or GetImages functions are less particular, since they generally ignore any extra items at the beginning of the grabbed data.) If it is a simple bit of text, and not a full-blown regular expression, scroll to the top of HTML and use the browser's find feature to see if the chosen pattern shows up earlier in the HTML. Now go to the end of the content to be grabbed and find a good ending pattern. As before, use the browser's find feature to make sure the end pattern does not show up somewhere in the middle of the content to be grabbed. Unfortunately, start and end patterns often change on web sites, which means your handler will break more easily. A better strategy is to see if the links have anything in common, fetch every link in the web page, and then filter out the ones we don’t want. For example, in the above sample web page, we see that all the interesting links have the string "/news/" in them. As a result, we can tell GetLinks to get every link on the page: my $data = GetLinks($url,’^’,’$’); and then extract only the ones with "/news/" in them: @$data = grep { $$_ =~ /\/news\//i ) @$data; Be sure to use "(?i)" at the start of patterns to make them case insensitive. "\n" can be used to indicate a newline in the pattern. GETUPDATETIMES* To specify when the server updates its information, add a GetUpdateTimes function. This function tells News Clipper when to refresh its cached data. For example, when are making a handler for a daily comic, consider using "7", since the comic changes at 6 am PST every day. It is important to set the times as close as possible to the actual update time. For example, if the data gets updated at 1am PST, and the update is set to 3am PST, visitors in England will have stale data between 9am and 11am. On the other hand, setting the time too early could mean getting the data from the day before. It pays to be a little conservative here - specifying every hour of the day, means lots of people will be hitting their server when they probably are not even looking at their News Clipper webpage. Date specifications are of the form "[day] hour,hour,hour [time zone]". If the day is omitted, every day is assumed. If the time zone is omitted, Pacific Standard Time is assumed. If no GetUpdateTimes function is provided, the default of "2,5,8,11,14,17,20,23 PST" is used. The days are: sun, mon, tue, wed, thu, fri, sat. Multiple times can be specified, for example: sub GetUpdateTimes { return ['1 EST','mon 6,8 EST','tues 16 CST','20']; } will update Mondays at 6am and 8am EST, Tuesdays at 4pm CST, and every day at 8pm PST. You can specify "always" as the update time to have News Clipper update the data every time it is run. However, please do not do this unless the data being queried really does change by the minute. Headlines, for example, do not change constantly, and specifying "always" will cause News Clipper to hit the remote server repeatedly, causing the system adminstrators to send you a nasty email. OTHER FUNCTIONALITY News Clipper provides additional functionality for handler writers to help manipulate acquired data. This includes the following functions: RunHandler($handlerName,$handlerType,$data,$attributes) : runs the handler specified by $handlerName as the type specified by $handlerType ("input", "filter", or "output"). $data is used to pass the data to the filter or output handler to be run, or undef should be used if the handler to be run is an input handler. Finally, $attributes should contain the hash of attributes for the handler. TypesMatch($data,$typeSignature) : compares the actual type signature of $data to the type signature specified by $typeSignature. The function is useful for filter and output handlers that accept more than one kind of data. MakeSubtype($subType,$baseType) : makes one type a subtype of another. For example, a handler that creates a hash of information from a website should call "MakeSubtype(’Name’, ’HASH’);" to let News Clipper know that data of type "Name" can be used wherever a hash is expected. As the hashes are created, they should be declared as being of type Name by calling bless: "bless \%hash, ’Name’;". error($text) : logs an error that will be sent to the output as an HTML comment after the News Clipper command has completed execution. ExtractText($text,$beginPattern,$endPattern) : extracts text between the beginning and ending patterns. MakeLinksAbsolute($baseurl,$html) : Finds all "a href=" and "img src=" tags in the HTML and makes the URLs absolute. EscapeHTMLChars($text) : Escapes all "<", ">", and "&" characters in the text. StripTags($html,’tag1’,’tag2’,...) : Removes the specified HTML tags from the HTML. This is normally used to remove formatting. The default tags are "strong", "h1", "h2", "h3", "h4", "h5", "h6", "b", "i", "u", "tt", "font", "big", "small", and "strike". StripAttributes($html,’att1’,’att2’,...) : Removes attributes from HTML tags. By default, the tags are "alt" and "class". HTMLsubstr($html,$offset,$length) : Extracts a substring from the HTML, counting only the non-tag characters. Also removes open tags from the beginning and end. TrimOpenTags($html,’tag1’,’tag2’,...) : Removes open tags from the beginning and end of a block of HTML. By default the tags are every possible HTML tag. GetAttributeValue($html,’tag1’,’attr’) : Searches the HTML for a tag and an attribute, and returns the value of the attribute for the first tag encountered. Returns undef if the value can’t be found. ADDITIONAL PROCESSING Additional processing may be needed at the end of the Get function. For example, text can be split into several segments and stored in an array, or just the third image returned from GetImages could be used. Keep in mind that the GetUrl, GetText, and GetHtml functions return a reference to text, and others return a reference to an array (GetImages, GetLinks). Use $$data to manipulate the text, and @$data to manipulate the array. Here are some examples: @$data = grep { $$_ =~ /$pattern/ } @$data : weeds out all elements in the @$data array that do not match pattern $pattern. This is useful for removing all URLs from GetLinks that do not match a pattern. @$data = grep { $$_ =~ /\d{5}.gif/i } @$data: weed out any images that do not have five digits in them. @$data = grep { $$_ =~ !/$pattern/ } @$data : weeds out all elements in the @$data array that match pattern $pattern. This is useful for removing all URLs from GetLinks that match a pattern. @$data = grep { $$_ =~ s/$pattern1/$pattern2/ } @$data : swap $pattern2 for $pattern1 in every member of @$data. $$data =~ s/sometext/othertext/gsi: Replace sometext for othertext everywhere it appears. The "g" in gsi means do it for all occurences. The "i" means don't use case when matching sometext. When referring to attributes that a user may set in the tag, refer to them as $attributes->{X}. Set a logical default in ProcessAttributes in case they do not specify the attribute. DEFAULT FILTER AND OUTPUT HANDLERS MakeHandler tries to make a logical guess for the default filter and output functions. Only create a Filter and Output function if the data type being returned from Get is nonstandard. Do not restrict the user's ability to manipulate the data and output on their own. If filters are used, they should convert from special data to a standard one, like a string or array. See the freshmeat and slashdot handlers for examples. Be sure that the handler works when run as below: NewsClipper -e yourhandler This is how the example file will be generated by our server for users who want to view the sample output. SUBMITTING YOUR HANDLER Once the handler is finished, please consider submitting it to the News Clipper database. It will then be available for other people to use and enjoy. For instructions on how to submit, visit the handler submission service webpage.

MAKEHANDLER TUTORIAL
	OVERVIEW MakeHandler is a "wizard" that steps you through a series of questions, generating a handler based on your responses. The questions you are asked depend partly on the responses you give for earlier questions. MakeHandler lets you enter most information using your favorite text editor. MakeHandler asks certain questions, then invokes the text editor with a file for you to edit. Responses should be kept between the prompt arrows located in the file. The first question MakeHandler asks is the name of the text editor to use. On Windows, notepad is a good choice. On Unix, pico works well. After all of the necessary questions have been answered, MakeHandler will write a handler to the disk. You can then copy the handler to your $home/.NewsClipper/NewsClipper/Handler/Acquisition directory and begin using it. Step 1: General Information MakeHandler begins by asking some general information: name: Your name. You will be indicated as the creator and maintainer of the handler. email: Your email address handler name: The handler’s name. This is a lowercase string without spaces or non-alphanumeric characters. handler URL: A URL that best shows where the data comes from. language: The language of the fetched data license: The license for the generated handler code. This describes how others can modify, copy, or otherwise use your handler. category: The category helps people find your handler while browsing on the website Step 2: General Information The URL from which the handler should fetch information may depend on an attribute that the user supplies. For example, many news sites have different sections on different pages. You may wish to have an attribute for the handler (e.g. "source") that determines the URL. Answering "no" to the question "Will your URL depend on a parameter" will cause MakeHandler to ask you the URL of the website. Answering "yes" will cause MakeHandler to ask you for the attribute name, and for a list of the possible values and their associated URLs. MakeHandler will also ask you for the default value of the attribute. Make sure the default value is one of the possible attribute values you specified. Step 3: Choosing An Acquisition Function The next step is to choose the acquisition function based on the type of information to be acquired. See the section entitled Choosing the Acquisition Function above. Step 4: Specifying Starting and Ending Patterns Depending on your choice, you may be asked to supply a starting and ending pattern. See the section entitled Choosing the Starting and Ending Patterns above for more information on patterns. Some web sites place different types of information on the same web page. In this case, you can fetch information from the web page based on a handler attribute. For example, you may want to fetch links from the "recent news" section or the "older news" section of a web page. Answering "no" to the question "Do you want the grabbed data to depend on a parameter" will cause MakeHandler to ask you for a single pair of starting and ending patterns. Answering "yes" will cause MakeHandler to ask you for the attribute name, and for a list of the possible values and their associated starting and ending patterns. MakeHandler will also ask you for the default value of the attribute. Make sure the default value is one of the possible attribute values you specified. Step 5: Specifying Update Times As the handler writer, you specify the times when News Clipper should fetch new data from the remote website. See the section entitled GetUpdateTimes above for more information on patterns. Step 6: Manual Editing of the Handler The final step is to edit the handler file manually, adding any additional processing. News Clipper will load the new handler into the text editor, adding some additional commentary that will be stripped out when the final handler file is written. You can add additional processing to extract the data you want from the fetched HTML, or filter out certain types of links, etc. See the section entitled Additional Processing above for more information.