Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce
Ludovic Tanguy,
Franck Sajous,
Basilio Calderone and
Nabil Hathout
2012
Notebook for PAN at CLEF 2012
Rome, Italy
[ PDF article ]
L. Tanguy, F. Sajous, B. Calderone and N. Hathout (2011).
Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce.
Notebook for PAN at CLEF 2012,
Rome, Italy.
[ .bib ]
We describe here the technical details of our participation to PAN
2012's “traditional” authorship attribution tasks.
The main originality of our approach lies in the use of a large quantity of varied features to represent textual
data, processed by a maximum entropy machine learning tool. Most of these
features make an intensive use of natural language processing annotation techniques
as well as generic language resources such as lexicons and other linguistic
databases. Some of the features were even designed specifically for the target data
type (contemporary fiction). Our belief is that richer features, that integrate external
knowledge about language, have an advantage over knowledge-poorer ones
(such as words and character n-grams frequencies) when training data is scarce
(both in raw volume and number of training items for each target author).
Although overall results were average (66% accuracy over the main tasks for the
best run), we will focus in this paper on the differences between feature sets. If
the “rich” linguistic features have proven to be better than trigrams of characters
and word frequencies, the most efficient features vary widely from task to task.
For the intrusive paragraphs tasks, we got better results (73 and 93%) while still
using the maximum entropy engine as an unsupervised clustering tool.