java.lang.Object

org.apache.jmeter.protocol.http.parser.BaseParser

org.apache.jmeter.protocol.http.parser.HTMLParser

All Implemented Interfaces:: LinkExtractorParser

Direct Known Subclasses:: JsoupBasedHtmlParser, LagartoBasedHtmlParser

public abstract class HTMLParser extends BaseParser

HTMLParser subclasses can parse HTML content to obtain URLs.

Field Summary

Fields

Modifier and Type

Field

Description

protected static final String

ATT_ARCHIVE

protected static final String

ATT_BACKGROUND

protected static final String

ATT_CODE

protected static final String

ATT_CODEBASE

protected static final String

ATT_DATA

protected static final String

ATT_HREF

protected static final String

ATT_IS_IMAGE

protected static final String

ATT_REL

protected static final String

ATT_SRC

protected static final String

ATT_STYLE

protected static final String

ATT_TYPE

static final String

DEFAULT_PARSER

protected static final String

ICON

protected static final String

IE_UA

protected static final Pattern

IE_UA_PATTERN

static final String

PARSER_CLASSNAME

protected static final String

PRELOAD

protected static final String

SHORTCUT_ICON

protected static final String

STYLESHEET

protected static final String

TAG_APPLET

protected static final String

TAG_BASE

protected static final String

TAG_BGSOUND

protected static final String

TAG_BODY

protected static final String

TAG_EMBED

protected static final String

TAG_FRAME

protected static final String

TAG_IFRAME

protected static final String

TAG_IMAGE

protected static final String

TAG_INPUT

protected static final String

TAG_LINK

protected static final String

TAG_OBJECT

protected static final String

TAG_SCRIPT
Constructor Summary

Constructors

Modifier

Constructor

Description

protected

HTMLParser()

Protected constructor to prevent instantiation except from within subclasses.
Method Summary

Modifier and Type

Method

Description

protected Float

extractIEVersion(String userAgent)

Iterator<URL>

getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, String encoding)

Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...

Iterator<URL>

getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, Collection<URLString> coll, String encoding)

Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...

abstract Iterator<URL>

getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, URLCollection coll, String encoding)

Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...

protected static boolean

isEnableConditionalComments(Float ieVersion)

protected static String

normalizeUrlValue(CharSequence url)

Normalizes URL as browsers do

Methods inherited from class org.apache.jmeter.protocol.http.parser.BaseParser
getParser, isReusable

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- ATT_ARCHIVE
  
  protected static final String ATT_ARCHIVE
  See Also:
  
  Constant Field Values
- ATT_BACKGROUND
  
  protected static final String ATT_BACKGROUND
  See Also:
  
  Constant Field Values
- ATT_CODE
  
  protected static final String ATT_CODE
  See Also:
  
  Constant Field Values
- ATT_CODEBASE
  
  protected static final String ATT_CODEBASE
  See Also:
  
  Constant Field Values
- ATT_DATA
  
  protected static final String ATT_DATA
  See Also:
  
  Constant Field Values
- ATT_HREF
  
  protected static final String ATT_HREF
  See Also:
  
  Constant Field Values
- ATT_REL
  
  protected static final String ATT_REL
  See Also:
  
  Constant Field Values
- ATT_SRC
  
  protected static final String ATT_SRC
  See Also:
  
  Constant Field Values
- ATT_STYLE
  
  protected static final String ATT_STYLE
  See Also:
  
  Constant Field Values
- ATT_TYPE
  
  protected static final String ATT_TYPE
  See Also:
  
  Constant Field Values
- ATT_IS_IMAGE
  
  protected static final String ATT_IS_IMAGE
  See Also:
  
  Constant Field Values
- TAG_APPLET
  
  protected static final String TAG_APPLET
  See Also:
  
  Constant Field Values
- TAG_BASE
  
  protected static final String TAG_BASE
  See Also:
  
  Constant Field Values
- TAG_BGSOUND
  
  protected static final String TAG_BGSOUND
  See Also:
  
  Constant Field Values
- TAG_BODY
  
  protected static final String TAG_BODY
  See Also:
  
  Constant Field Values
- TAG_EMBED
  
  protected static final String TAG_EMBED
  See Also:
  
  Constant Field Values
- TAG_FRAME
  
  protected static final String TAG_FRAME
  See Also:
  
  Constant Field Values
- TAG_IFRAME
  
  protected static final String TAG_IFRAME
  See Also:
  
  Constant Field Values
- TAG_IMAGE
  
  protected static final String TAG_IMAGE
  See Also:
  
  Constant Field Values
- TAG_INPUT
  
  protected static final String TAG_INPUT
  See Also:
  
  Constant Field Values
- TAG_LINK
  
  protected static final String TAG_LINK
  See Also:
  
  Constant Field Values
- TAG_OBJECT
  
  protected static final String TAG_OBJECT
  See Also:
  
  Constant Field Values
- TAG_SCRIPT
  
  protected static final String TAG_SCRIPT
  See Also:
  
  Constant Field Values
- STYLESHEET
  
  protected static final String STYLESHEET
  See Also:
  
  Constant Field Values
- SHORTCUT_ICON
  
  protected static final String SHORTCUT_ICON
  See Also:
  
  Constant Field Values
- ICON
  
  protected static final String ICON
  See Also:
  
  Constant Field Values
- PRELOAD
  
  protected static final String PRELOAD
  See Also:
  
  Constant Field Values
- IE_UA
  
  protected static final String IE_UA
  See Also:
  
  Constant Field Values
- IE_UA_PATTERN
  
  protected static final Pattern IE_UA_PATTERN
- PARSER_CLASSNAME
  
  public static final String PARSER_CLASSNAME
  See Also:
  
  Constant Field Values
- DEFAULT_PARSER
  
  public static final String DEFAULT_PARSER
  See Also:
  
  Constant Field Values
Constructor Details
- HTMLParser
  
  protected HTMLParser()
  
  Protected constructor to prevent instantiation except from within subclasses.
Method Details
- getEmbeddedResourceURLs
  
  public Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, String encoding) throws HTMLParseException
  
  Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...
  URLs should not appear twice in the returned iterator.
  Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.
  
  Parameters:
  
  userAgent - User Agent
  
  html - HTML code
  
  baseUrl - Base URL from which the HTML code was obtained
  
  encoding - Charset
  
  Returns:
  
  an Iterator for the resource URLs
  
  Throws:
  
  HTMLParseException - when parsing the html fails
- getEmbeddedResourceURLs
  
  public abstract Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, URLCollection coll, String encoding) throws HTMLParseException
  
  Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...
  All URLs should be added to the Collection.
  Malformed URLs can be reported to the caller by having the Iterator return the corresponding RL String. Overall problems parsing the html should be reported by throwing an HTMLParseException.
  N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.
  
  Parameters:
  
  userAgent - User Agent
  
  html - HTML code
  
  baseUrl - Base URL from which the HTML code was obtained
  
  coll - URLCollection
  
  encoding - Charset
  
  Returns:
  
  an Iterator for the resource URLs
  
  Throws:
  
  HTMLParseException - when parsing the html fails
- getEmbeddedResourceURLs
  
  public Iterator<URL> getEmbeddedResourceURLs(String userAgent, byte[] html, URL baseUrl, Collection<URLString> coll, String encoding) throws HTMLParseException
  
  Get the URLs for all the resources that a browser would automatically download following the download of the HTML content, that is: images, stylesheets, javascript files, applets, etc...
  N.B. The Iterator returns URLs, but the Collection will contain objects of class URLString.
  
  Parameters:
  
  userAgent - User Agent
  
  html - HTML code
  
  baseUrl - Base URL from which the HTML code was obtained
  
  coll - Collection - will contain URLString objects, not URLs
  
  encoding - Charset
  
  Returns:
  
  an Iterator for the resource URLs
  
  Throws:
  
  HTMLParseException - when parsing the html fails
- isEnableConditionalComments
  
  protected static boolean isEnableConditionalComments(Float ieVersion)
  
  Parameters:
  
  ieVersion - Float IE version
  
  Returns:
  
  true if IE version < IE v10
- extractIEVersion
  
  protected Float extractIEVersion(String userAgent)
  
  Parameters:
  
  userAgent - User Agent
  
  Returns:
  
  version null if not IE or the version after MSIE
- normalizeUrlValue
  
  protected static String normalizeUrlValue(CharSequence url)
  
  Normalizes URL as browsers do
  
  Parameters:
  
  url - CharSequence
  
  Returns:
  
  normalized url

Class HTMLParser

Field Summary

Constructor Summary

Method Summary

Methods inherited from class org.apache.jmeter.protocol.http.parser.BaseParser

Methods inherited from class java.lang.Object

Field Details

ATT_ARCHIVE

ATT_BACKGROUND

ATT_CODE

ATT_CODEBASE

ATT_DATA

ATT_HREF

ATT_REL

ATT_SRC

ATT_STYLE

ATT_TYPE

ATT_IS_IMAGE

TAG_APPLET

TAG_BASE

TAG_BGSOUND

TAG_BODY

TAG_EMBED

TAG_FRAME

TAG_IFRAME

TAG_IMAGE

TAG_INPUT

TAG_LINK

TAG_OBJECT

TAG_SCRIPT

STYLESHEET

SHORTCUT_ICON

ICON

PRELOAD

IE_UA

IE_UA_PATTERN

PARSER_CLASSNAME

DEFAULT_PARSER

Constructor Details

HTMLParser

Method Details

getEmbeddedResourceURLs

getEmbeddedResourceURLs

getEmbeddedResourceURLs

isEnableConditionalComments

extractIEVersion

normalizeUrlValue