HTML::SimpleLinkExtor - Extract links from HTML |
HTML::SimpleLinkExtor - Extract links from HTML
use HTML::SimpleLinkExtor;
my $extor = HTML::SimpleLinkExtor->new(); $extor->parse_file($filename); #--or-- $extor->parse($html);
$extor->parse_file($other_file); # get more links
$extor->clear_links; # reset the link list
#extract all of the links @all_links = $extor->links;
#extract the img links @img_srcs = $extor->img;
#extract the frame links @frame_srcs = $extor->frame;
#extract the hrefs @area_hrefs = $extor->area; @a_hrefs = $extor->a; @base_hrefs = $extor->base; @hrefs = $extor->href;
#extract the body background link @body_bg = $extor->body; @background = $extor->background;
@links = $extor->scheme( 'http' );
This is a simple HTML link extractor designed for the person who does
not want to deal with the intricacies of HTML::Parser
or the
de-referencing needed to get links out of HTML::LinkExtor
.
You can extract all the links or some of the links (based on the HTML
tag name or attribute name). If a <BASE HREF>
tag is found,
all of the relative URLs will be resolved according to that reference.
This module is simply a subclass around HTML::LinkExtor
, so it can
only parse what that module can handle. Invalid HTML or XHTML may
cause problems.
If you parse multiple files, the link list grows and contains the aggregate list of links for all of the files parsed. If you want to reset the link list between files, use the clear_links method.
new()
new('')
new($base)
Create the link extractor object and do not resolve relative links.
LWP::UserAgent
object.
HTML::SimpleLinkExtor
keeps an internal list of HTML tags (such as
'a' and 'img') that have URLs as values. If you run into another tag
that this module doesn't handle, please send it to me and I'll add it.
Until then you can add that tag to the internal list. This affects
the entire class, including previously created objects.
HTML::SimpleLinkExtor
keeps an internal list of HTML tag attributes
(such as 'href' and 'src') that have URLs as values. If you run into
another attribute that this module doesn't handle, please send it to
me and I'll add it. Until then you can add that attribute to the
internal list. This affects the entire class, including previously
created objects.
can()
can
that can tell which attributes are also methods.
HTML::SimpleLinkExtor
uses
to extract URLs. This affects the entire class, including previously
created objects.
HTML::SimpleLinkExtor
uses to extract URLs. This affects the entire
class, including previously created objects.
HTML::SimpleLinkExtor
pays
attention to.
HTML::SimpleLinkExtor
pays attention to.
These tags have convenience methods.
HTML::Parser
.
$data
. Inherited from HTML::Parser
.
In list context it returns the links. In scalar context it returns the count of the matching links.
In list context it returns the links. In scalar context it returns the count of the matching links.
In list context it returns the links. In scalar context it returns the count of the matching links.
This module doesn't handle all of the HTML tags that might
have links. If someone wants those, I'll add them, or you
can edit %AUTO_METHODS
in the source.
Will Crain who identified a problem with IMG links that had a USEMAP attribute.
This module is in Github
https://github.com/briandfoy/html-simplelinkextor
brian d foy, <bdfoy@cpan.org>
Copyright (c) 2004-2014 brian d foy. All rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
HTML::SimpleLinkExtor - Extract links from HTML |