Mar 28, 2015 Tag: TYPO3

TYPO3 CMS: Html Parsing

The TYPO3 core is full of useful and rock solid functions. This is an attempt to understand the HTML parsing capabilities better. And it shows how I make use of these core functions. This blogpost will be enhanced and updated as needed.

Updated on Apr 01, 2015

Overview:

Cleaning HTML

Cleaning HTML with TypoScript and Fluid

This is a configuration for the HTMLparser in TypoScript style. It means: Take the current data and pass it through the HTMLparser function. Remove all tags except those listed at `allowTags`. Remove all attributes for tags listed at `noAttrib`. Definitely remove those specified at `removeTags`, overriding `allowTags`. Compare these settings with the HTMLparser section in the TypoScript reference:

mylib.cleanHtml = TEXT
mylib.cleanHtml {
   current = 1
   HTMLparser {
      allowTags  = b, br, h1, h2, h3, i, li, ol, p, u, ul
      noAttrib   = b, br, h1, h2, h3, i, li, ol, p, u, ul
      removeTags =
   }
}

Now you may use this parser in Fluid. You can expect that no html tags remain <b>, <br>, <h1>, <h2>, <h3>, <i>, <li>, <ol>, <p>, <u>, <ul> and their closing tags. And none of these tags will have an attribute. So nobody can smuggle in a onclick="javascript:alert('hi');":

{htmlField -> f:cObject(typoscriptObjectPath:'mblib.cleanHtml')}

or <f:cObject typoscriptObjectPath="mblib.cleanHtml">{htmlField}</f:cObject>

or, for example, a textarea field:

<f:form.textarea property="htmlField" cols="40" rows="15"
    value="{htmlField -> f:cObject(typoscriptObjectPath:'mylib.cleanHtml')}" />

Cleaning HTML with PHP and configuration in TypoScript style

The HTMLparser_TSbridge method is a clever function that lets you use the html cleaner from within PHP with a configuration like in TypoScript:

namespace Mbless\Project\Utility;
use TYPO3\CMS\Frontend\ContentObject\ContentObjectRenderer;

/**
* Class to process html
*/
class Html {
   /*
    * $cleanedHtml = Html::cleanOurHtml($content);
    */
   public function cleanOurHtml($content) {
      $conf = array(
         'allowTags' => 'b,br,h1,h2,h3,i,li,ol,p,u,ul',
         'noAttrib'  => 'b,br,h1,h2,h3,i,li,ol,p,u,ul'

         // if you only want to remove some tags and
         // you don't set 'allowTags'
         'removeTags' => '',
      );
   return ContentObjectRenderer::HTMLparser_TSbridge($content, $conf);
}

}

To use this function somewhere else in your project do:

use Mbless\Project\Utility\Html;

$content = '<body><p class="abc">Hello world!</p></body>';
$result = Html::cleanOurHtml($content);

// $result: '<p>Hello world!</p>'

Cleaning HTML with HTMLcleaner()

This chapter is mainly a rewrite of the existing documentation. Most of this is contained in the PhpDoc comment of the source code.

Compare:

  • API: typo3\cms\core\html\htmlparser::htmlcleaner
  • API reference: Parsing HTML

HTMLcleaner() is a function that can clean up HTML content according to configuration given in the $tags array.

To initialize the $tags array to allow a list of tags (in this case <B>, <I>, <U> and <A>), set it like this:

$tags = array_flip(explode( ',', 'b,a,i,u' ));

print_r($tags);
Array
(
   [b] => 0
   [a] => 1
   [i] => 2
   [u] => 3
)

If the value of the $tags[$tagname] entry is an array, advanced processing of the tag is initialized. Here’s a visualisation of the options:

// Here we take the concrete tag 'b' as an example.

$tagname = 'b';
$tags[$tagname] = array();

// If set, this string is preset as the attributes of the tag
$tags['b']['overrideAttribs'] = ''; // example?
// (zero) = no attributes allowed
$tags['b']['allowedAttribs' ] = '0';

// String with commalist of attributes: Allow only those attributes.
$tags['b']['allowedAttribs' ] = 'id,class';

// The empty string means: Allow all attributes.
$tags['b']['allowedAttribs' ] = '';
$tags['b']['fixAttrib'] = array(); // array of attribute names

// each attribute has an array
$tags['b']['fixAttrib']['id'] = array();

// Force the attribute value to this value.
$tags['b']['fixAttrib']['id']['set'] = 'v';

// Boolean: If set, the attribute is unset.
$tags['b']['fixAttrib']['id']['unset'] = TRUE;

// If no attribute exists by this name, this value is set
// as default value, if this value is not blank
$tags['b']['fixAttrib']['id']['default'] = 'v';

// Boolean. If set, the attribute is always processed.
// Normally an attribute is processed only if it exists
$tags['b']['fixAttrib']['id']['always'] = TRUE;

// Boolean. If set, the value is passed through PHP-function.
$tags['b']['fixAttrib']['id']['trim'] = TRUE;

// Boolean. If set, the value is passed through the PHP-function.
$tags['b']['fixAttrib']['id']['intval'] = TRUE;

// Boolean. If set, the value is passed through the PHP-function.
$tags['b']['fixAttrib']['id']['lower'] = TRUE;

// Boolean. If set, the value is passed through the PHP-function.
$tags['b']['fixAttrib']['id']['upper'] = TRUE;
// Setting integer range.
$tags['b']['fixAttrib']['width']['range'] =
   array ('[low limit]','[high limit, optional]');

// Attribute must be in this list. If not, the value is set
// to the first element.
$tags['b']['fixAttrib']['width']['list'] =
   array ('[value1/default]', '[value2]', '[value3]');

// If set, then the attribute is removed if it is 'FALSE'.
$tags['b']['fixAttrib']['width']['removeIfFalse' ] = TRUE;

// If this value is set to 'blank' then the value must be a blank
// string (that means a 'zero' value will not be removed)
$tags['b']['fixAttrib']['width']['removeIfFalse' ] = 'blank';

// If the attribute value matches the value set here, then it is removed.
$tags['b']['fixAttrib']['width']['removeIfEquals'] = 'v';

// If set, then the removeIfEquals and list comparisons will be
// case sensitive. Otherwise not.
$tags['b']['fixAttrib']['width']['casesensitiveComp'] = 1;

// Boolean. If set, the tags <> are converted to &lt; and &gt;
$tags['b']['protect'] = '';

// String. If set, the tagname is remapped to this tagname
$tags['b']['remap'] = 'strong';

// Boolean. If set, then the tag is removed if no attributes are given
$tags['b']['rmTagIfNoAttrib' = '';
// Boolean/'global'.
// If set TRUE, then this tag must have starting and ending tags in the correct order.
// Any tags not in this order will be discarded. Thus '</B><B><I></B></I></B>'
// will be converted to '<B><I></B></I>'.
$tags['b']['nesting'] = '';
$tags['b']['nesting'] = '1';

// If the value is 'global' then true nesting in relation to other
// tags marked for 'global' nesting control is preserved.
// This means that if <B> and <I> are set for global nesting then this string
// '</B><B><I></B></I></B>' is converted to '<B></B>'
$tags['b']['nesting'] = 'global';
// @param string $content
//    This is the HTML-content that is being processed.
//    It is also the result that is being returned. (?)

// @param array $tags
//    Is an array where each key is a tagname in lowercase.
//    Only tags present as keys in this array are preserved.
//    The value of the key can be an array with a vast number of options to configure.

// @param string $keepAll
//    Boolean/'protect', if set, then all tags are kept regardless of tags present
//    as keys in $tags-array. If $keepAll=='protect' then the preserved tags have
//    their <> converted to &lt; and &gt;

// @param integer $hSC
//    Values -1,0,1,2:
//    Set to zero = disabled,
//    set to 1 then the content BETWEEN tags is htmlspecialchar()'ed,
//    set to -1 its the opposite and
//    set to 2 the content will be HSC'ed BUT with preservation for
//       real entities (eg. "&amp;" or "&#234;")

// @param array $addConfig
//    Configuration array send along as $conf to the internal functions
//    ->processContent() and ->processTag()

// @return string
//    Processed HTML content

public function HTMLcleaner(
   $content,
   $tags = array(),
   $keepAll = 0,
   $hSC = 0,
   $addConfig = array()
) {

   // ... work ...
   return $result;

}

//

Real Life Example

TypoScript in example/Configuration/TypoScript/setup.txt:

plugin.tx_example {
   view {
      templateRootPath = {$plugin.tx_example.view.templateRootPath}
      partialRootPath = {$plugin.tx_example.view.partialRootPath}
      layoutRootPath = {$plugin.tx_example.view.layoutRootPath}
   }
   settings {
      # what tags are allowed in some longtext fields
      cleanHtml {
         allowTags  = b, br, h1, h2, h3, i, li, ol, p, u, ul
         noAttrib   = b, br, h1, h2, h3, i, li, ol, p, u, ul
         removeTags =
      }
   }
}

# for use in fluid:
#   {object.field -> f:cObject(typoscriptObjectPath:'mblib.cleanHtml')}
mylib.cleanHtml = TEXT
mylib.cleanHtml {
   current = 1
   HTMLparser =< plugin.tx_bwoffers.settings.cleanHtml
}

Works together with the utility function Html::cleanOurHtml($content);. This is also an example of how to access the objectmanager and the configurationmanager from a standalone static function.

Comments

comments powered by Disqus

Previous topic

TYPO3 CMS: FAL - File Abstraction Layer

Next topic

Easy Solution Power Box

Tags

Archives

Languages

Recent Posts

This Page