A multitude of linguistically-rich features for authorship attribution
Ludovic Tanguy,
Assaf Urieli,
Basilio Calderone,
Nabil Hathout and
Franck Sajous
2011
Notebook for PAN at CLEF 2011
Amsterdam, Netherlands
[ PDF article ]
L. Tanguy, Assaf Urieli, B. Calderone, N. Hathout and F. Sajous (2011).
A multitude of linguistically-rich features for authorship attribution.
Notebook for PAN at CLEF 2011,
Amsterdam, Netherlands.
[ .bib ]
This paper reports on the procedure and learning models we adopted for the
‘PAN 2011 Author Identification’ challenge targetting real-world email messages.
The novelty of our approach lies in a design which combines shallow characteristics of the emails
(words and trigrams frequencies) with a large number of ad hoc linguistically-rich features
addressing different language levels. For the author attribution tasks, all these features were
used to train a maximum entropy model which gave very good results.
For the single author verification tasks, a set of features exclusively based on the linguistic description of
the emails’ messages was considered as input for symbolic learning techniques (rules and
decision trees), and gave weak results. This paper presents in detail the features extracted
from the corpus, the learning models and the results obtained.