Keywords document processing, web analysis, classification, natural language processing, web graph, link analysis, intellectual property protection

The WebLab is an open source (under LGPL 2.1) platform aimed at providing intelligence systems that need to process multimedia data. Thus a system based on WebLab tackle the problem of “unstructured document processing” and in particular in the analysis of documents coming for the Internet. One of its typical application is media monitoring which could serve many different business needs.

In that context, a particular problem is that content found on the web may be protected by specific terms of use and/or copyrights which could prevent (or limit) the use of such content. Finding the legal terms applied on content is thus a very important operational issue and currently only manual assessment allows to find these legal terms.

The objective of this programming contest would be to try to automate the detection of the terms of use and copyright of content found on the web. This work would be organised in 3 steps:
1- learn to use a weblab system to collect and process content from website ;
2- try to implement a specific analysis service that tries to detect if a page contains legal terms. This problem will probably need the manual annotation of documents in order to create a corpus of samples. Then the analysis service will apply different technique (rule based to NLP) in order to tag document ;
3- when a document is not tagged as being “legal terms”, one will needs to detect possible related pages or linked that point to legal terms that could be applied. This will probably include analysis of website map and link graph to apply specific algorithms that allows to deduce which terms apply to what documents.

Main Topic Contact Person Name Gérard Dupont
Main Topic Contact Person e-mail ger.dupont@gmail.com
Other Topic Contact Person(s) Name(s) (optional) weblab user mailing list
Other Topic Contact e-mail(s) (optional) user@weblab-project.org
Estimated Workload (total, in manmonths) 6
Targeted Contestants master/PhD

