Computational Stylistic Profiles for the Analysis of Modern Hebrew Prose


Israel Ministry of Science and Technology grant, December 2020, 320,000 Nis for three years

Prof. Avi Shmidman, Bar-Ilan University


Computational stylometry is the quantitative study of recurring features in our language that we use automatically, and that are as unique to each one of us as our fingerprints. Computational stylometry serves as a powerful method in the course of “distant reading,” which, following Franco Moretti, offers a new approach to the analysis of literary texts, one that replaces the selective reading of a canon.

The primary purpose of this research is to leverage stylometric methods in order to contribute a new perspective on the study of modern Hebrew prose. Our approach include computational analysis of linguistics features such as morphology and syntax; semantic features that represent temporal and spatial dimensions and relations. In order to achieve this goal, we will develop cutting-edge natural language processing algorithms for the automatic extraction of these features from each text, in order to relate them to the metadata of the work and the author, and ultimately to run the clustering and classification algorithms which will isolate the features of any given stylistic profile.

Our project contributes greatly to both Hebrew literary stylometry and to the historiography of Hebrew literature. Furthermore, the algorithms we plan on developing will form a basis for the automated analysis of any modern Hebrew text, and will contribute to the ever-growing field of Hebrew natural language processing.


Preliminary results on gender differences in twentieth-century Hebrew prose - 141 books by female authors and 178 books written by male authors.

The investigation checked masculine and feminine verbs in prose written by male and by female authors in three periods of time: authors born prior to 1930, authors born between 1930 and 1960, and authors born after 1960. 


This graph shows that in literature written by both male and female authors, we can find dramatically fewer female verbs than male verbs. We can see that in the prose written by women, there are more female actions than in literature written by men. We also see an increase in female verbs in the later years in prose written by women authors born after 1960.  The numbers are normalized for 10K words. 

Speaking verbs are the most frequent verbs. The verb “say” and “ask” are the most-used verbs, both in literature written by male and by female authors. However, when grouping speaking verbs, we discovered that in early writings, female characters spoke less than men in both male and female writings (light green and light orange). However, the situation dramatically changed in later periods, especially in literature written by women. In later women’s literature, women talk more than men (see the light green line going up). In men’s literature the change is minor.  This is the only case in which we see that women action become higher than men actions.
 
Preliminary results on stylistic profiles

This graph demonstrates a distant reading map of Jerusalemite authors (Haim Be'er, Amos Oz, Haim Sabato, David Grossman, Emuna Elon, Dan Benaya Seri, Shifra Horn, Rivka Miriam, Brenner and Agnon), based on morphosyntactic features. Each author's books are input to the computational system in terms of sequences of morphological categories, without preserving any content words or lexemes whatsoever. Nevertheless, as we can see here, the books of any given author all cluster together - that is, the morphosyntactic patterns of each author are sufficiently unique to provide for this. At the same time, the graph also gives us insight regarding the extent to which any two authors tend to use similar morphosyntactic patterns to one another.