Class HTMLParser

java.lang.Object
org.apache.jmeter.protocol.http.parser.BaseParser
org.apache.jmeter.protocol.http.parser.HTMLParser
All Implemented Interfaces:
LinkExtractorParser
Direct Known Subclasses:
JsoupBasedHtmlParser, LagartoBasedHtmlParser

public abstract class HTMLParser extends BaseParser
HTMLParser subclasses can parse HTML content to obtain URLs.
  • Field Details

  • Constructor Details

    • HTMLParser

      protected HTMLParser()
      Protected constructor to prevent instantiation except from within subclasses.
  • Method Details

    • getEmbeddedResourceURLs

      public Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, String encoding) throws HTMLParseException
      Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...

      URLs should not appear twice in the returned iterator.

      Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.

      Parameters:
      userAgent - User Agent
      html - HTML code
      baseUrl - Base URL from which the HTML code was obtained
      encoding - Charset
      Returns:
      an Iterator for the resource URLs
      Throws:
      HTMLParseException - when parsing the html fails
    • getEmbeddedResourceURLs

      public abstract Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, URLCollection coll, String encoding) throws HTMLParseException
      Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...

      All URLs should be added to the Collection.

      Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.

      N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.

      Parameters:
      userAgent - User Agent
      html - HTML code
      baseUrl - Base URL from which the HTML code was obtained
      coll - URLCollection
      encoding - Charset
      Returns:
      an Iterator for the resource URLs
      Throws:
      HTMLParseException - when parsing the html fails
    • getEmbeddedResourceURLs

      public Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, Collection<URLString> coll, String encoding) throws HTMLParseException
      Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...

      N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.

      Parameters:
      userAgent - User Agent
      html - HTML code
      baseUrl - Base URL from which the HTML code was obtained
      coll - Collection - will contain URLString objects, not URLs
      encoding - Charset
      Returns:
      an Iterator for the resource URLs
      Throws:
      HTMLParseException - when parsing the html fails
    • isEnableConditionalComments

      protected static boolean isEnableConditionalComments(Float ieVersion)
      Parameters:
      ieVersion - Float IE version
      Returns:
      true if IE version < IE v10
    • extractIEVersion

      protected Float extractIEVersion(String userAgent)
      Parameters:
      userAgent - User Agent
      Returns:
      version null if not IE or the version after MSIE
    • normalizeUrlValue

      protected static String normalizeUrlValue(CharSequence url)
      Normalizes URL as browsers do
      Parameters:
      url - CharSequence
      Returns:
      normalized url