Click here to see specification.
- MapReduce Indexing
- Index Server
- Search Interface
Four stages total (pipeline using given hadoop)
- Stage0: count # documents
- Stage1: count term frequency of each word in every doc
- Stage2: calculate normalization factor for each doc
- Stage3: print required inverted index
- Calculate score by the equation: Score = w * Pagerank + (1-w) * tfIdf
- Return Docs by relevance (scores)
- Specify query and weight by user
- Request to index server to get top 10 relevant docs
- For each doc, use doc_title as a new query, display top 10 similar docs
You can contact me if you want to see the actual implementation. I can add you to the private Github repo.