How To Fix Buggy PHP strip_tags Function

By Angsuman Chakraborty, Gaea News Network
Friday, December 21, 2007

strip_tags() in PHP has several problems. It doesn’t recognize that css within the style tags are not document text. It will not remove HTML entities or content within script tags. strip_tags() fails for invalid HTML. In short strip_tags() is not advisable to use except for trivial cases. The best solution I have come across is by uersoy at tnn dot net:


function html2txt($document){
  $search = array('@<script[^>]*?>.*?</script>@si',  // Strip out javascript
                 '@<style[^>]*?>.*?</style>@siU',    // Strip style tags properly
                 '@<[\/\!]*?[^<>]*?>@si',            // Strip out HTML tags
                 '@<![\s\S]*?--[ \t\n\r]*>@'         // Strip multi-line comments including CDATA
  );
  $text = preg_replace($search, '', $document);
  return $text;
}
Discussion
May 9, 2010: 3:23 am

Your blog is good! Generally when I visit blogs, I just come across shit, but this time I was really surprised when I got your blog containing wonderful information. Thanks mate and keep this effort up.

May 6, 2010: 1:33 pm

Superb! Generally I never read whole articles but the way you wrote this information is simply amazing and this kept my interest in reading and I enjoyed it. You have got good writing skills.

March 14, 2009: 6:08 pm

This works good for me, returns a string of 1000 characters separated by an underscore, more then enough to work with.

$text = file_get_contents($url);
$text = strtolower($text);
$text = preg_replace(
array(
'@]*?>.*?@siu',
'@]*?>.*?@siu’,
‘@]*?>.*?@siu’,
‘@]*?>.*?@siu’,
‘@]*?>.*?@siu’,
‘@]*?>.*?@siu’,
‘@]*?>.*?@siu’,
‘@]*?>.*?@siu’,
‘@]*?.*?@siu’,
‘@]*?.*?@siu’,
‘@]*?.*?@siu’,
‘@]*?.*?@siu’,
‘@]*?.*?@siu’,
‘@]*?.*?@siu’,
‘@]*?.*?@siu’,
),
array(
‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘,
‘ ‘,’ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘, ‘ ‘),
$text );
$text = strip_tags($text);
$text = preg_replace(’/[^0-9^a-z^A-Z^]/’, ‘ ‘, $text);
$parts = explode(’ ‘,$text);
$unique = array_unique($parts);
$str = implode(’_',$unique);
$str = substr($str, 0, 1000);
echo $str;

YOUR VIEW POINT
NAME : (REQUIRED)
MAIL : (REQUIRED)
will not be displayed
WEBSITE : (OPTIONAL)
YOUR
COMMENT :