OK, So what I need to do is remove HTML mark up tags and unnecessary whitespace from HTML so that it can be indexed for a searching algorithm
To do this, I will mostly be using VB's own Replace function, and the Scripting Regular Expression Object available by adding a Visual Basic project reference to: Microsoft VBScript Regular Expressions 5.5
The Plan
- Swap <BR> for a new-line so that occurances of "hello<br>goodbye" don't become "hellogoodbye"
- Swap for a space
- Remove any HTML tags and replace with nothing
- Remove multiple whitespace characters and swap for a single space
- Handle translating common named entities such as & > < etc.
Further Reading
For more information, see Microsoft Beefs Up VBScript with Regular Expressions
Code
Private Function HTMLClean(ByVal strText As String) As String
Dim objRegEx As VBScript_RegExp_55.RegExp
' replace <br>'s for a newline
strText = Replace$(strText, "<br>", Chr$(10), 1, -1, vbTextCompare)
' replace non-breaking spaces for a space
strText = Replace$(strText, " ", Chr$(32), 1, -1, vbTextCompare)
' create new regex object
Set objRegEx = New VBScript_RegExp_55.RegExp
objRegEx.Global = True ' don't just operate on first find.
' remove HTML tags
objRegEx.Pattern = "<[^>]*>"
strText = objRegEx.Replace(strText, "")
' ditch excessive white space
objRegEx.Pattern = "\s+"
strText = objRegEx.Replace(strText, " ")
' thanks, bye
Set objRegEx = Nothing
' named-entities
strText = Replace$(strText, ">", ">", 1, -1, vbTextCompare)
strText = Replace$(strText, "<", "<", 1, -1, vbTextCompare)
' insert your favourite named-entities here.
strText = Replace$(strText, "&", "&", 1, -1, vbTextCompare) ' must do last
' return
HTMLClean = strText
End Function
Debug.Print HTMLClean("<greets>Hello <oh> &" & vbNewLine & _
"<b>I</b> <a href=""#"">wonder</a> " & vbNewLine & _
" what<br>will become of all this?</greets>")
Hello <oh> &I wonder what will become of all this?


