Interface LinkExtractorParser

All Known Implementing Classes:
BaseParser, CssParser, HTMLParser, JsoupBasedHtmlParser, LagartoBasedHtmlParser

public interface LinkExtractorParser
Interface specifying contract of content parser that aims to extract links
Since:
3.0
  • Method Summary

    Modifier and Type
    Method
    Description
    getEmbeddedResourceURLs(String userAgent, byte[] responseData, URL baseUrl, String encoding)
    Get the URLs for all the resources that a browser would automatically download following the download of the content, that is: images, stylesheets, javascript files, applets, etc...
    boolean
     
  • Method Details

    • getEmbeddedResourceURLs

      Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] responseData, URL baseUrl, String encoding) throws LinkExtractorParseException
      Get the URLs for all the resources that a browser would automatically download following the download of the content, that is: images, stylesheets, javascript files, applets, etc...

      URLs should not appear twice in the returned iterator.

      Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.

      Parameters:
      userAgent - User Agent
      responseData - Response data
      baseUrl - Base URL from which the HTML code was obtained
      encoding - Charset
      Returns:
      an Iterator for the resource URLs
      Throws:
      LinkExtractorParseException - when extracting the links fails
    • isReusable

      boolean isReusable()