Simplified Wikipedia Search Engine

Time

April 2017

Specification

Click here to see specification.

Summary

  • MapReduce Indexing
  • Index Server
  • Search Interface

Implementation

MapReduce Indexing

Four stages total (pipeline using given hadoop)

  • Stage0: count # documents
  • Stage1: count term frequency of each word in every doc
  • Stage2: calculate normalization factor for each doc
  • Stage3: print required inverted index

Index Server

  • Calculate score by the equation: Score = w * Pagerank + (1-w) * tfIdf
  • Return Docs by relevance (scores)

Search Interface

  • Specify query and weight by user
  • Request to index server to get top 10 relevant docs
  • For each doc, use doc_title as a new query, display top 10 similar docs

Related Code

You can contact me if you want to see the actual implementation. I can add you to the private Github repo.