Enhanced Methods of Job Offer Extraction from Web Using Crawler Automation

Doctoral Thesis
Jürgen Dorn
A Web crawler, also known as automatic indexers, bots, Web robot or web spider, is computer software that searches the word wide web and stores the results in the form of indices. This process of searching is called Web crawling or spidering. Many search engines use web crawling as a mean to provide up to date data by scan or crawl through internet pages and create index of these visited pages for efficient searching. Other uses of web crawler are automating maintenance of web sites by verifying links or validating its HTML code.
First name: 
Last name: 
Matr nr: 
The significance of employment in the setup of a society is quite evident. The methods of employment procurement are gradually changing from conventional to digital. The internet has become a prominent source of job procurement. Online job offers opened the research opportunities to explore different methods for the automation of online jobs classification and retrieval. Classification of web documents as job opportunities required a mechanism from Machine Learning or some other domain. To automate the retrieval of online job opportunities, text classifcation is an only viable method - in case of machine learning. The Semantic web mining is also a possible solution for job offers classifcation. We studied different methods for job offers classifcation, from machine learning and semantic web technologies. More than 5000 job offers were collected from multiple existing job offer websites for this study. From machine learning discipline, we investigated eight text classifers to study their effectiveness and generalization performance on new data. Job offers dataset is pre- processed with different available methods and a newly defned method, and arranged into fve groups for classification. Classifers are regularized to avoid high variance, and their effectiveness parameters and generalization errors were evaluated. All the classifers showed >90% accuracy but generalization errors varied. Ridge Regression and Stochastic Gradient Decent generalized well on new data, for all the groups. On the contrary Random Forest and Perceptron tenacious toward high variance. We found two classifiers that generalized well to new data. From semantic web technology, we proposed a scalable ontology-based classifer. This enhanced classifer classify generic as well as specifc job offers. We used the ontology to: extract concepts from job offers text description, fnd the minimum threshold for classifcation, and developed a classifcation model. We did not use any Machine Learning algorithm to develop this classifer. We evaluated this classifer according to Machine learning evaluation mode - training, and testing dataset. Our classifer showed >90% accuracy, precision, and recall, for both training and testing dataset. With these promising results of the defned methods we can automate the job offers categorization and retrieval from the World Wide Web.