First semester

Web datamining

Objectifs

At the end of this course, students should know how to collect information from the web, be familiar with the notion of Information Retrieval, know how to build corpora, and organize them for exploratory analysis. They will also need to master the algorithm used to rank web pages (pagerank) and techniques for classifying text documents.

In addition, they should be familiar with opinion mining (text classification, sentiment analysis, model evaluation).

All applications will be handled in R.

Plan

Part 1 – Information Retrieval: Preprocessing, Extraction and PageRank
Keywords: Twitter, R, PageRank, corpus, term-document matrix, Information retrieval, tf-idf, stemming, Regex, kmeans

Theoretical part (3h)
– Information Retrieval
Concepts & Definitions
° Term Document Matrix
Tf-idf, Cosine Index, jaccard Index
° Stemming
– Web Search : Google
° Google and Page Rank
Yellow Pages (Notion of alpha sorting)
Notion of graphs and eigenvectors

Practical part (9h)
– TP1: Introduction to R for Web Mining (3h)
° Installation of text mining libraries available in R
Gathering information from the WEB: Twitter, Wikipedia
Pre-processing: Stemming, Lemming
° Parsing HTML, XML,
° Tokenization
Introduction to term-document matrix

– Practical exercise 2: Document similarity (Applications to user searches on the pagesjaunes.fr website (3h)
Similarity indices: Tf, tf-idf Jaccard, Cosine
Damerau distance, jaro distance
° Links between searches, Notion of search graph

-TP3: Ordering search results (3h)
° PageRank
° Detecting keywords
° Introduction to keyword-based doc classification

Part 2 – Opinion Mining: Textmining, sentiment analysis, classification and model evaluation.
Keywords: Facebook, R, opinion mining, corpus, sentiment analysis, syntax annotation.

Theoretical part (4h)
– Introduction
Applications in which fields of activity

– State of the art (opinion mining, sentiment analysis, affective computing)
Which descriptors for which types of data ?
* Textual
* Audio
* Image
Automatic descriptor selection (search space reduction)
°Which classification algorithms for which cases ?

– Corpus construction
General thoughts on data quality and its impact
Manual and automatic annotation (annotation scheme, calculation of an interannotator agreement score, etc.)
° Distribution of data into classes

– Pre-processing (text)
What granularity for my data (words, sentences, paragraphs)?
Syntactic and semantic annotation (examples of POS, WordNet-Affect, etc.)

– Evaluation
What metrics should I use to measure the quality of a model (recall, precision, f-score, ROC, 0.95 confidence index)?

– Market products (examples)
TEMIS product (sentiment cartridge)
Sinequa product

Practical part (8h)

– TP1: classifying the valence of a literary text (film reviews)

– TP2: classifying the valence of texts from social networks (twitter, facebook)

– TP3: Model merging (using models created in TP2)

– TP4 (optional): Building models from multimodal cues (text + audio)

Prérequis

SQL