Tim Hastings - NonHostile (because there's no need)

Weblog and collection of geeky articles.

  Home :: Who? :: Contact :: Links :: Subscribe subscribe
Watching the Tweenies and Trying to Sit Up and Stand (nearly 12 weeks old)A Long Weekend in the LakesFather and Daughter Day


OK, So what I need to do is remove HTML mark up tags and unnecessary whitespace from HTML so that it can be indexed for a searching algorithm

 

To do this, I will mostly be using VB's own Replace function, and the Scripting Regular Expression Object available by adding a Visual Basic project reference to: Microsoft VBScript Regular Expressions 5.5

 

The Plan

  • Swap <BR> for a new-line so that occurances of "hello<br>goodbye" don't become "hellogoodbye"
  • Swap &nbsp; for a space 
  • Remove any HTML tags and replace with nothing
  • Remove multiple whitespace characters and swap for a single space
  • Handle translating common named entities such as &amp; &gt; &lt; etc.

 

Further Reading

For more information, see Microsoft Beefs Up VBScript with Regular Expressions

 

Code

Private Function HTMLClean(ByVal strText As String) As String

 

    Dim objRegEx As VBScript_RegExp_55.RegExp
   
    ' replace <br>'s for a newline
    strText = Replace$(strText, "<br>", Chr$(10), 1, -1, vbTextCompare)
   
    ' replace non-breaking spaces for a space
    strText = Replace$(strText, "&nbsp;", Chr$(32), 1, -1, vbTextCompare)
   
    ' create new regex object
    Set objRegEx = New VBScript_RegExp_55.RegExp
    objRegEx.Global = True ' don't just operate on first find.
   
    ' remove HTML tags
    objRegEx.Pattern = "<[^>]*>"
    strText = objRegEx.Replace(strText, "")
   
    ' ditch excessive white space
    objRegEx.Pattern = "\s+"
    strText = objRegEx.Replace(strText, " ")

 

    ' thanks, bye
    Set objRegEx = Nothing

 

    ' named-entities
    strText = Replace$(strText, "&gt;", ">", 1, -1, vbTextCompare)
    strText = Replace$(strText, "&lt;", "<", 1, -1, vbTextCompare)

 

    ' insert your favourite named-entities here.
   
strText = Replace$(strText, "&amp;", "&", 1, -1, vbTextCompare)  ' must do last
   
    ' return
    HTMLClean = strText
   
End Function

Debug.Print HTMLClean("<greets>Hello &lt;oh&gt;&nbsp;&amp;" & vbNewLine & _

                      "<b>I</b> <a href=""#"">wonder</a>  " & vbNewLine & _

                      "   what<br>will become of all this?</greets>")


Hello <oh> &I wonder what will become of all this?

 




3 comments, Visual Basic 6, Friday, May 7, 2004 12:05

Timeline Navigation for Visual Basic 6 posts
VB6: XML and How To Read It With Visual Basic (made 1 week later)
VB6: Convert HTML into Searchable Text using Regular Expressions (this post, made Friday, May 7, 2004 12:05)
VB6: Validate XML against XSD in Visual Basic (made 3 weeks earlier)


Comments
Tim,
This is a great post...even 3 years later. Thanks for posting...the code is just what I have been looking for. How could I execute a function like this by pointing to a file or group of files rather than the way you have show it here (which I realize is just for demonstration purposes)? Could I feed this function a file name somehow? What would the code look like for that? Thanks.

Posted by: on Sunday, April 15, 2007 06:32
Great work!! it really solved my problem.

HTML to TEXT CONVERSION

I slightly modified it use '"' character.

rest worked fine.



Posted by: arjun on Thursday, July 26, 2007 06:33
Many thanks for this Tim. Its great.

ali

Posted by: ali on Friday, May 8, 2009 10:48

Post a Comment
Name:  Home page and email address are optional.
  Email addresses will not be displayed or spammed!
Remember these details
Email:
Home Page:
Comment:
Comments cannot contain HTML, URLs will be formatted into hyperlinks.
I reserve the right to remove any comments for any reason.