Monday, August 22, 2011

Topic-Specific Web Resource Crawling for Quality Controlling ofAutomated Search Engines – (UNDERGRADUATE RESEARCH PROJECT).

Search Engines  had played a vital character in my life since I entered to the university as a engineering undergraduate.

After completing my  first year  in university I had realized that It is governed by “lecturing –>examine” not “Self learning-> Work Hard on real things” so in that time it was a  “POOOOVH”  on my dreams. I wanted to be some one that really working on advanced stuffs..think different than others..and simply not a student that know all the things in the books and know how to solve the differential equations but does not know how to use it to make something  which comfortable the human being.

So I knew the time had come to make a turning point.I had decided to learn software engineering subjects ,Computer languages, frameworks by myself because my subject list did not provide my requirements.Life was hard because of my decision.But I have the spirit to fight with it.On that quest Google was my friend, teacher…To find E-Books,Tutorials Demos, Bleeding edge of the technology,What are movements of the world economy ,how it affects to my future career,What’s up with the current job market,job requirements,International level qualifications much more…

That’s how I have make a incomparable friendship with Google.Working more than three years together I know how Google behaves with different kind of search methodologies.If you not aware of it please visit  Basic search help : Google you can find set of examples that describes how to use Google well. But still Google has very significant problems.

1. Result page full with junks

here is the example for it.. just imagine you wants find a e book on eclipse IDE plug-in development. So I am using “eclipse plugin eBook download” as my search query.

image

This is the Google's top link for it.

image

Just because of having “Eclipse plugins free eBook at …” as a page content ,Google gives us the page with no relevant information of our search.How sad.?

2.Wasting the valuable time of the seeker.

3.Not identifying the exact requirements of the seeker.

Normally when we are searching we just use one or two key words.

image

It just pops up with the bunch of links in (0.16 seconds) ,But the thing is when we are searching we have list of thing we wanted to know but not entered in the search query.For an example above I was searching to buy a laptop so I wanted to know the prices,brands,performance. ..ext. .Those information should be there in the first 10 links Google gives us but if you aware enough of Google results you will be realized the quality is not there.

4.Give junk results repetitively even seeker expand the query length.

 As I mentioned above the junk results given by the Google will repetitively pops up if we change the search query length or the meaning.

1.Search using query as “Eclipse plugin eBook download free”.

image

Still the junk link is there.

2.Search using query as ”eclipse plugin development eBook download free”.

image

How sad Still the junk link is there.

I have already visited it and it was confirmed that that site does not contains the result that I want but still Google suggesting the result for me.Simple use of cookie it can be omitted those visited result.

So it is about my Friend Google.Now lets talk about us.Our undergraduate research project is to make a “Quality Controlling” mechanism of above results using “Topic-Specific Web Resource Crawling “. It was an idea from my friend and the navigational component on this quest Amith.

What is out Solution?

•AI based focused crawler automatically builds a web directory.

1.-AI based + Identified user expectations

2.-Automated

3.-Eliminate junk results as much as possible

4.-Web Directory is updated more frequently

•System Architecture

image

•How we did it?

–Get information from human-edited directory – ODP (*Human Edited Web directory)

–Identified most frequent keywords in a particular category

image

•URL Finder

each time fetch combinations of keyword sets to Google and store the results.

image

•Web Crawler

crawl the resulted websites and count the related keyword frequency for the use of stat analyzer.

•Link Analyzer

Eliminate same results in the URL pool.

•Calculate Keywords p.u values of each websites

image

image

*In the above Graph (generated using system result submitted to MATLAB) shows the first result (top result) of the Google is poor in frequency of the key word that we are searching than the 2nd and 7th result key word frequency.So it is obvious that top results of the Google not the result that we are expecting as users.

  • Statistics Analyzer

imageimage

*Above Screen shot shows the top ranked sites according to out system results for keyword “Diamond”.

•Improvements

- contribute more sources to find relevant keywords

- Human supervision is better

- use phrases than keywords

So It was privileged to contribute this stuff and this is one of the greatest things that I  have done in my university life other than my Undergraduate Project humiee.It was a incomparable experience that inspired myself.Because I always wanted to learn for the quest not for the marks.

No comments:

Post a Comment