Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce

Ludovic Tanguy, Franck Sajous, Basilio Calderone and Nabil Hathout 2012 Notebook for PAN at CLEF 2012 Rome, Italy [ PDF article ] L. Tanguy, F. Sajous, B. Calderone and N. Hathout (2011). Authorship Attribution: Using Rich Linguistic Features when Training Data is Scarce. Notebook for PAN at CLEF 2012, Rome, Italy. [ .bib ] We describe here the technical details of our participation to PAN 2012's “traditional” authorship attribution tasks. The main originality of our approach lies in the use of a large quantity of varied features to represent textual data, processed by a maximum entropy machine learning tool. Most of these features make an intensive use of natural language processing annotation techniques as well as generic language resources such as lexicons and other linguistic databases. Some of the features were even designed specifically for the target data type (contemporary fiction). Our belief is that richer features, that integrate external knowledge about language, have an advantage over knowledge-poorer ones (such as words and character n-grams frequencies) when training data is scarce (both in raw volume and number of training items for each target author). Although overall results were average (66% accuracy over the main tasks for the best run), we will focus in this paper on the differences between feature sets. If the “rich” linguistic features have proven to be better than trigrams of characters and word frequencies, the most efficient features vary widely from task to task. For the intrusive paragraphs tasks, we got better results (73 and 93%) while still using the maximum entropy engine as an unsupervised clustering tool.