Wednesday, September 17, 2008

Build your own VB.Net Search Engine (Part 1)

I wanted to produce my own site search engine using VB.net and I chose
  • Html Agility Pack  for HTML page parsing
      - Extract word list for indexing
      - Extract a list of embedded links
  • lucene.Net as the search engine

This post shows the steps I took to

  • parse the the document
  • produce a word list
  • hint at how to add the word list to lucene
    (Part 2 is to come)
  • extract each link

Part 2 will describe the lucene.Net

  • Indexing
  • Lookup
The Main Process

When the HTML has been obtained, text is passed to ProcessHTML below.

  • The document is parsed with LoadHtml
  • The words are extracted with ConvertContentTo
  • a collection of links is built using SelectNodes("//a[@href]")
    Sub ProcessHTML(ByVal html As String, ByVal url As String)
Dim doc As New HtmlDocument
doc.LoadHtml(html)
Dim sw As New StringWriter
ConvertContentTo(doc.DocumentNode, sw)
Dim wordlist As String = sw.ToString
Dim lDoc As New Lucene.Net.Documents.Document
lDoc.Add(New Field("text", wordlist, Field.Store.YES, Field.Index.TOKENIZED))
lDoc.Add(New Field("url", url, Field.Store.YES, Field.Index.TOKENIZED))
iw.AddDocument(lDoc)

For Each link As HtmlNode In doc.DocumentNode.SelectNodes("//a[@href]")
Dim att As HtmlAttribute = link.Attributes("href")
Dim a As String = att.Value
If Not a.StartsWith("#") Then
Extracting a list of words

The following code builds a list of words. It has been adapted from C# code here.
It is also one of the samples in the download.

   Sub ConvertContentTo(ByVal node As HtmlNode, ByVal tw As TextWriter)
For Each subnode As HtmlNode In node.ChildNodes
ConvertTo(subnode, tw)
Next
End Sub
Sub ConvertTo(ByVal node As HtmlNode, ByVal tw As TextWriter)
Dim html As String
Select Case node.NodeType
Case HtmlNodeType.Document
ConvertContentTo(node, tw)
Case HtmlNodeType.Text
' script and style must not be output
Dim parentName As String = node.ParentNode.Name
If parentName = "script" Or parentName = "style" Then
Return
End If
' html = node.
html = node.InnerText
' is it in fact a special closing node output as text?
If (HtmlNode.IsOverlappedClosingElement(html)) Then
Return
End If
' check the text is meaningful and not a bunch of whitespaces
If (html.Trim().Length > 0) Then
tw.Write(HtmlEntity.DeEntitize(html) + " ")
End If
Case HtmlNodeType.Element
If node.Name = "p" Then
tw.WriteLine("")
End If
If node.HasChildNodes Then
ConvertContentTo(node, tw)
End If
End Select
End Sub

The HTML parsing was pretty quick - much quicker than an earlier approach using regular expressions.

No comments: