John Sawyer's Techo Blog: Build your own VB.Net Search Engine (Part 1)

I wanted to produce my own site search engine using VB.net and I chose

Html Agility Pack for HTML page parsing
- Extract word list for indexing
- Extract a list of embedded links
lucene.Net as the search engine

This post shows the steps I took to

parse the the document
produce a word list
hint at how to add the word list to lucene
(Part 2 is to come)
extract each link

Part 2 will describe the lucene.Net

Indexing
Lookup

The Main Process

When the HTML has been obtained, text is passed to ProcessHTML below.

The document is parsed with LoadHtml
The words are extracted with ConvertContentTo
a collection of links is built using SelectNodes("//a[@href]")

    Sub ProcessHTML(ByVal html As String, ByVal url As String)
        Dim doc As New HtmlDocument
        doc.LoadHtml(html)
        Dim sw As New StringWriter
        ConvertContentTo(doc.DocumentNode, sw)
        Dim wordlist As String = sw.ToString
        Dim lDoc As New Lucene.Net.Documents.Document
        lDoc.Add(New Field("text", wordlist, Field.Store.YES, Field.Index.TOKENIZED))
        lDoc.Add(New Field("url", url, Field.Store.YES, Field.Index.TOKENIZED))
        iw.AddDocument(lDoc)

        For Each link As HtmlNode In doc.DocumentNode.SelectNodes("//a[@href]")
            Dim att As HtmlAttribute = link.Attributes("href")
            Dim a As String = att.Value
            If Not a.StartsWith("#") Then

Extracting a list of words

The following code builds a list of words. It has been adapted from C# code here.
It is also one of the samples in the download.

   Sub ConvertContentTo(ByVal node As HtmlNode, ByVal tw As TextWriter)
        For Each subnode As HtmlNode In node.ChildNodes
            ConvertTo(subnode, tw)
        Next
    End Sub
    Sub ConvertTo(ByVal node As HtmlNode, ByVal tw As TextWriter)
        Dim html As String
        Select Case node.NodeType
            Case HtmlNodeType.Document
                ConvertContentTo(node, tw)
            Case HtmlNodeType.Text
                ' script and style must not be output
                Dim parentName As String = node.ParentNode.Name
                If parentName = "script" Or parentName = "style" Then
                    Return
                End If
                '           html = node.
                html = node.InnerText
                ' is it in fact a special closing node output as text?
                If (HtmlNode.IsOverlappedClosingElement(html)) Then
                    Return
                End If
                ' check the text is meaningful and not a bunch of whitespaces
                If (html.Trim().Length > 0) Then
                    tw.Write(HtmlEntity.DeEntitize(html) + " ")
                End If
            Case HtmlNodeType.Element
                If node.Name = "p" Then
                    tw.WriteLine("")
                End If
                If node.HasChildNodes Then
                    ConvertContentTo(node, tw)
                End If
        End Select
    End Sub

The HTML parsing was pretty quick - much quicker than an earlier approach using regular expressions.

John Sawyer's Techo Blog

Wednesday, September 17, 2008

Build your own VB.Net Search Engine (Part 1)

The Main Process

Extracting a list of words

No comments:

Subscribe Now:

Blog Archive