Wei Aiyun, Luo Yinlin
Abstract: As two central representatives of the American “Lost Generation,” Hemingway and Fitzgerald developed distinct writing styles despite their shared thematic concerns. Previous studies of their styles have been largely qualitative and therefore lack sufficient objective quantitative support. This study constructs two specialized corpora based on their representative works and adopts a mixed quantitative-qualitative approach to compare their stylistic differences at the lexical, syntactic, and textual levels using Python, SPSS, and related tools. The results show that: (1) at the lexical level, word length measured by letters significantly distinguishes the two corpora, whereas word length measured by syllables and lexical density do not; Fitzgerald also displays higher lexical diversity, reflecting his preference for delicate and varied vocabulary in depicting the complex social world of the Jazz Age; (2) at the syntactic level, most indicators reveal significant differences, with Fitzgerald’s works showing greater syntactic complexity and a longer mean sentence length; and (3) at the textual level, no significant difference is found in overall paragraph-length distribution, and both writers tend to favor short, tightly structured paragraphs. This study confirms the effectiveness of quantitative stylistic methods, provides objective evidence for comparing the two authors’ styles, enriches stylistic research on modernist literature, and further promotes empirical approaches to literary studies.
Keywords: Ernest Hemingway; F. Scott Fitzgerald; novels; stylometry
1. Introduction
Stylistic studies have gradually developed into a more systematic and rigorous field, and an increasing number of scholars now focus on their quantitative dimension. The quantitative exploration of style in natural language has thus given rise to stylometry (Tuldava 141). To date, stylometry has centered its applications in linguistics on three core directions: first, authorship attribution based on the quantitative analysis of linguistic features, which plays a pivotal role in areas such as identifying anonymous texts and authenticating disputed works; today, this method has also been extended to a wide range of text genres (Puspitasari, Fakhrurroja and Sutrisno; Azimov; Mikros; Zhu, Lei and Craig; Savoy); second, the comparative study of authors’ stylistic features, which reveals stylistic differences and shared characteristics across texts through multi-dimensional quantitative indicators (Hameed and Ali; Yasin and Faizullah; Ren, Liu and Zhang; Shen and Wu; Tu and Liu); and third, the identification and systematic classification of distinct genres and registers (Hou, Yang and Jiang; Mandravickaite and Krilavicius). In summary, stylometry, with its strength in quantitative analysis, has demonstrated substantial value and broad prospects in linguistic research.
As important representatives of the “Lost Generation,” Ernest Hemingway and F. Scott Fitzgerald played a crucial role in the development of modernist literature in the United States. Existing domestic and international studies on the stylistic features of their novels are abundant, but both strands of research remain dominated by qualitative analysis, while quantitative and corpus-based studies are relatively limited and insufficiently developed. Linguistic studies (Wen; Li; Tian) and narrative studies (Chen; C. Wu; X. Cheng; Muradian) constitute the main research directions, and studies of rhetorical devices (Prasanwon; Donald) are also relatively numerous. In addition, a number of studies have adopted quantitative approaches to examine the works of the two authors, with some focusing on the lexical level (Deli; Liu and Li) and others using a broader set of indicators (Ihrmark and Nilsson).
Overall, existing studies share several limitations: their research objects are mostly confined to classic individual texts such as The Old Man and the Sea and Tender Is the Night; their methods are still dominated by traditional qualitative analysis; and quantitative studies largely focus on surface linguistic features such as word length. In response to these gaps, this study adopts a stylometric perspective and constructs a multi-level analytical framework encompassing lexical, syntactic, and textual dimensions in order to conduct a systematic quantitative comparison of representative works by the two authors. To achieve this objective, the study addresses three core research questions:
(1) What are the similarities and differences between Hemingway’s and Fitzgerald’s novels at the lexical level?
(2) What are the similarities and differences between Hemingway’s and Fitzgerald’s novels at the syntactic level?
(3) What are the similarities and differences between Hemingway’s and Fitzgerald’s novels at the textual level?
2. Research Materials and Methods
2.1Research Materials
This study adopts quantitative methods to conduct a comparative stylistic analysis of Ernest Hemingway’s and F. Scott Fitzgerald’s works. Since general corpora do not meet the needs of this research, two specialized corpora—the Hemingway Corpus (HC) and the Fitzgerald Corpus (FC)—were compiled, with texts selected for their representativeness of the authors’ core stylistic and thematic features. Both corpora include novels and short stories to enable cross-length linguistic analysis while maintaining comparable scale, and all materials were drawn from digital archives in order to ensure textual quality.
Specifically, the HC comprises three representative novels—A Farewell to Arms, For Whom the Bell Tolls, and Across the River and Into the Trees—the novella The Old Man and the Sea, and the major short-story collection Winner Take Nothing. The FC includes three iconic novels—This Side of Paradise, The Beautiful and Damned, and Tender Is the Night—thenovella The Great Gatsby, as well as the early short-story collection Flappers and Philosophers, in order to reflect the poetic and ornate prose style that established Fitzgerald as a chronicler of the Jazz Age.
All downloaded texts were saved in TXT format. Texts originally in other formats (such as EPUB) were converted to TXT to facilitate subsequent code-based reading and processing. This standardization ensures compatibility with our custom Python scripts for lexical and syntactic analysis and eliminates potential parsing errors caused by formatting metadata in the original files.
Table 1 Basic Information on HC and FC
| Corpus | Book Name | Word Count | Total Word Count |
|---|---|---|---|
| HC | A Farewell to Arms | 88544 | 395826 |
| Across the River and Into the Trees | 67727 | ||
| For Whom the Bell Tolls | 174130 | ||
| The Old Man and the Sea | 26584 | ||
| Winner Take Nothing | 38841 | ||
| FC | Tender is the Night | 108260 | 420759 |
| This Side of Paradise | 80648 | ||
| The Beautiful and Damned | 123087 | ||
| The Great Gatsby | 48327 | ||
| Flappers and Philosophers | 60437 |
2.2 Research Methods
Python serves as the primary tool for most data-processing tasks in this study. To measure word length, a custom Python program was developed. Two distinct methods were implemented: one calculates word length on the basis of the number of syllables using the textstat package, while the other counts the number of letters in each word.
For syntactic complexity analysis, this study employs the L2 Syntactic Complexity Analyzer (L2SCA) developed by Professor Xiaofei Lu. The tool was downloaded and integrated into our analytical workflow. It processes input text files automatically and generates a comprehensive set of syntactic indices, which are then exported for subsequent statistical analysis, as follows:
self.l2sca_dir = os.path.join(base_path, "L2SCA-minimal")
Moving AverageType-Token Ratio (MATTR) is used to measure lexical diversity. This metric was calculated using custom code. Michael A. Covington and Joe D. McFall recommend a window size of 500 words for stylometric research, but a larger window may be adopted when text length permits (97).Accordingly, a window size of 1,000 words is used in this study.
In addition, textual-level features were also calculated using Python.
For statistical analysis, SPSS was used. Two main statistical procedures were conducted: independent-samples t-tests and non-parametric tests.
3.Data Analysis
3.1Comparative Features of the HC and FC at Lexical Level
Vocabulary serves as a central carrier of authors’ creative styles, textual emotions, and thematic expression. This section focuses on the lexical dimension of Hemingway’s and Fitzgerald’s works, comparing lexical complexity, word-length distribution, lexical diversity, and lexical density. It aims to reveal how their distinct lexical features underpin Hemingway’s objective, iceberg-like narration and Fitzgerald’s poetic, emotionally textured expression, thereby providing lexical evidence for the subsequent analysis of stylistic differences between their texts.
3.1.1 Word Length and Word-Length Distribution
Generally, the longer a word is, the greater its complexity and the more difficult it is to master (Wang, Yu, and Wu 115). The average word length of ordinary texts is approximately four letters. An average word length of fewer than four letters indicates relatively simple and easily comprehensible vocabulary, whereas a longer average word length suggests more complex lexical choices. Thus, average word length can reflect the lexical complexity of an author’s word choice in a given work (Zhu and Li 78).
Heng Chen and Haitao Liu argue that selecting an appropriate unit of measurement is a critical prerequisite for the validity of research on word-length distribution (8). Current studies of word length have attempted to measure this feature from multiple dimensions, including letters, phonemes, and syllables. Although most studies use the number of letters to measure word length, the number of syllables also affects the difficulty of English texts. Ioan-Iovitz Popescu et al. maintain that, when assessing word length, the only viable measurement unit is the number of syllables (225). Therefore, drawing on existing research findings and perspectives, this study adopts two approaches to measuring word length: the number of letters and the number of syllables per word.
Table 2 Mean Word Length in HC and FC
| Corpus | Mean Word Length | |
|---|---|---|
| by letters | by syllables | |
| HC | 3.92 | 1.26 |
| FC | 4.27 | 1.4 |
Based on the mean word-length data, it is clear that FC exceeds HC in both letter-based and syllable-based mean word length (MWL), with the difference being more noticeable in the letter-based measure. This discrepancy aligns with the authors’ distinct stylistic traits. Hemingway’s MWL of 3.92 letters and 1.26 syllables reflects his iconic “Iceberg Theory”: he strips language to its essentials and uses short, forceful words to convey emotion beneath the surface. His prose, rooted in journalistic brevity, relies on terse dialogue and economical description, in which even a single-syllable word can carry significant weight. In contrast, Fitzgerald’s averages of 4.27 letters and 1.4 syllables mirror his lyrical and ornate style. He favors words with subtle cadence to depict the glittering yet fragile world of the Jazz Age. His longer words and richer syllabic patterns help evoke the complexity of social aspiration and inner turmoil, often through layered adjectives and rhythmic phrasing that create atmospheric depth.
Figure 1 Word-Length Distribution of HC and FC (by Letters)
In terms of letter-based word-length distribution (WLD), the works of Hemingway and Fitzgerald show a similar overall trend with subtle differences. Both writers predominantly use words of one to five letters, and both reach their peak at three-letter words. Specifically, Hemingway has the highest proportion of three-letter words (more than 26%), which is noticeably higher than Fitzgerald’s. As word length increases beyond five letters, the proportion of words used by both writers drops sharply, but Fitzgerald uses longer words (such as five-to-thirteen-letter words) in a slightly higher proportion than Hemingway. Overall, short words dominate in both writers’ works, which accords with the general law of word-length distribution. Because both datasets conform to a normal distribution, an independent-samples t-test was conducted to examine whether the difference between them was statistically significant. The results show that p < 0.001 (< 0.05), indicating a statistically significant difference in word length measured by letters, with Fitzgerald’s words being significantly longer on average than Hemingway’s.
Figure 2 Word-Length Distribution of HC and FC (by Syllables)
In terms of syllable-based word-length distribution, the works of Hemingway and Fitzgerald show both similarities and differences. Both writers predominantly use monosyllabic words, accounting for more than 80% and 70% respectively, and both distributions peak at one syllable. More specifically, Hemingway has a significantly higher proportion of monosyllabic words, whereas Fitzgerald has a higher proportion of disyllabic words, approaching 20%, which is notably above Hemingway’s level. In the use of polysyllabic words, both writers show low proportions, but Fitzgerald uses slightly more three-syllable words than Hemingway. The proportion of words with four or more syllables is extremely close in the two corpora and remains very low. Overall, both writers prefer words with fewer syllables, but Hemingway relies more heavily on monosyllabic words, whereas Fitzgerald uses relatively more disyllabic words, a difference that aligns with their stylistic characteristics. Because HC conforms to a normal distribution while FC does not at the syllable level, a chi-square test was used to examine whether the difference was statistically significant. The results show that p = 0.247 (> 0.05), indicating that there is no statistically significant difference between HC and FC.
Based on the above results, word length measured by letters serves as an effective indicator for distinguishing the works of the two writers, whereas word length measured by syllables does not.
3.1.2 Vocabulary Richness
After examining the average word length of the two corpora, this section turns to lexical richness. Lexical richness is a crucial composite indicator in the study of authorial style and remains a focal topic in quantitative stylistics (Smith and Kelly 411). Batia Laufer and Paul Nation argue that lexical richness can be examined through lexical originality, lexical density, lexical sophistication, and lexical variation (309). This paper mainly adopts two indicators: lexical diversity and lexical density.
(i) Lexical Diversity
Webdell Johnson was the first to employ the Type-Token Ratio (TTR) to quantify the breadth of lexical range (1). A type refers to the total number of distinct words in a text, while a token denotes the total number of word occurrences. The TTR is calculated as the percentage of distinct words relative to the total vocabulary, with the formula expressed as follows:
TTR = Number of Types ÷ Number of Tokens × 100%
Although the TTR formula is straightforward, it is highly sensitive to text length. Longer texts tend to contain more repeated words, which lowers the proportion of types. To address this limitation, this study adopts the Moving AverageType-Token Ratio (MATTR) proposed by Covington and McFall (96). For subsequent normality analysis, each corpus theoretically requires a minimum of 30 samples. To ensure text coherence, textual integrity, and manageable variation in corpus size, both corpora were divided into 62 independent samples (31 per corpus), with each sample containing 12,000 to 15,000 words. The code for text segmentation is presented below:
"split_info": "divide evenly into 31 parts",
"total_segments": 31,
"total_words": 395826,
"segment_stats": He was an old man who fished…
"segment_id": 1,
"word_count": 13664,
"char_count": 67536,
"text_preview": "He was an old man who fished alone in a skiff in the Gulf Stream and he had gone eighty-four days no..."
After the texts were segmented into the specified number of parts by code, the 62 samples were manually checked to ensure the coherence and integrity of each segment. Subsequently, the MATTR value of each sample was calculated using code, with the sliding-window size set to 1,000. The detailed results are presented in the following table:
Table 3 MATTR Values of Individual Samples in HC and FC (Window Size = 1000)
| Sample (HC) | Word Number | MATTR | Sample (FC) | Word Number | MATTR |
|---|---|---|---|---|---|
| 1 | 13511 | 0.349 | 1 | 13631 | 0.467 |
| 2 | 13794 | 0.336 | 2 | 13574 | 0.45 |
| 3 | 12959 | 0.355 | 3 | 13420 | 0.44 |
| 4 | 13138 | 0.346 | 4 | 13575 | 0.44 |
| 5 | 13416 | 0.333 | 5 | 13626 | 0.462 |
| 6 | 13068 | 0.33 | 6 | 13720 | 0.451 |
| 7 | 13217 | 0.331 | 7 | 13684 | 0.453 |
| 8 | 13199 | 0.334 | 8 | 13810 | 0.44 |
| 9 | 13523 | 0.326 | 9 | 13875 | 0.464 |
| 10 | 12821 | 0.393 | 10 | 14129 | 0.459 |
| 11 | 12852 | 0.361 | 11 | 14400 | 0.44 |
| 12 | 12827 | 0.367 | 12 | 12713 | 0.456 |
| 13 | 12764 | 0.38 | 13 | 14158 | 0.452 |
| 14 | 13142 | 0.375 | 14 | 13853 | 0.458 |
| 15 | 12878 | 0.35 | 15 | 13857 | 0.44 |
| 16 | 12867 | 0.352 | 16 | 14093 | 0.445 |
| 17 | 12758 | 0.343 | 17 | 13103 | 0.494 |
| 18 | 12934 | 0.332 | 18 | 13498 | 0.463 |
| 19 | 12726 | 0.338 | 19 | 13463 | 0.458 |
| 20 | 13133 | 0.345 | 20 | 13634 | 0.431 |
| 21 | 12650 | 0.347 | 21 | 13585 | 0.47 |
| 22 | 12805 | 0.356 | 22 | 13253 | 0.48 |
| 23 | 12767 | 0.358 | 23 | 13182 | 0.471 |
| 24 | 12840 | 0.36 | 24 | 13309 | 0.47 |
| 25 | 13303 | 0.339 | 25 | 13480 | 0.464 |
| 26 | 12676 | 0.368 | 26 | 13289 | 0.468 |
| 27 | 12766 | 0.366 | 27 | 13349 | 0.46 |
| 28 | 13105 | 0.341 | 28 | 13089 | 0.473 |
| 29 | 13196 | 0.341 | 29 | 13562 | 0.472 |
| 30 | 12841 | 0.37 | 30 | 13294 | 0.464 |
| 31 | 13096 | 0.347 | 31 | 13533 | 0.45 |
| Mean | 0.351 | 0.458 |
As shown in Table 3, the average MATTR value of Fitzgerald’s works is higher than that of Hemingway’s. Because both groups of data conform to a normal distribution, an independent-samples t-test was conducted to examine whether the difference was statistically significant. The results are presented as follows:
Table 4 Results of the Independent-Samples t-Test (MATTR)
| Mean | Standard Deviation | Sig. | |
|---|---|---|---|
| HC | 0.351 | 0.016 | <0.001 |
| FC | 0.458 | 0.014 |
According to the test results (p < 0.001), there is a statistically significant difference between HC and FC. Fitzgerald’s works exhibit significantly higher lexical diversity than Hemingway’s. This finding directly reflects the distinct linguistic styles of the two authors at the lexical level. Fitzgerald’s writing is consistently marked by lexical delicacy and diversity. Taking his masterpiece The Great Gatsby as an example, descriptions of the opulent mansions in East Egg and of characters’ dress and demeanor are replete with precise and varied vocabulary. He uses adjectives of different semantic gradations, such as “gorgeous,” “sumptuous,” and “resplendent,” to portray scenes of luxury. Fitzgerald also paid meticulous attention to lexical selection, using precise diction to delineate characters’ external appearances and inner personalities with subtlety (Sun 63). For instance, words such as “turbulent,” “melancholy,” and “yearning” are used to progressively reveal the complexity of characters’ inner worlds. Furthermore, his works frequently employ imagistic vocabulary; symbols such as the “green light,” the “dock,” and various colors carry thematic weight. Through the intricate interweaving of symbolic devices related to objects, characters, and colors, the novel constructs a rich symbolic system and achieves remarkable artistic success (F. Cheng 58).
In contrast, Hemingway’s iconic “Iceberg Principle” is closely associated with lexical simplicity and repetition, which constitutes a key reason for his lower MATTR values. This finding is consistent with the research of Daniel Ihrmark and Johan Nilsson, who concluded that Hemingway’s lexical choices tend to be less varied and more repetitive (83). Hemingway advocated writing that reveals only the tip of the iceberg, leaving deeper meanings for readers to infer. This philosophy is reflected in his lexical choices, which favor short, precise, and high-frequency basic words. In A Farewell to Arms, when describing war scenes and romantic entanglements, he predominantly uses semantically clear and easily comprehensible words such as “cold,” “hard,” “pain,” and “love.” To strengthen emotional and situational effects, he also consciously repeats core vocabulary. For example, the image of “rain” appears multiple times in the novel. Rain not only creates a bleak and sorrowful atmosphere but also serves as a concrete projection of the protagonist’s inner melancholy and loss (Zhao 65). While this repetitive use of vocabulary strengthens textual coherence and emotional resonance, it also results in a lower number of distinct types.
A further comparison of the thematic differences between the two authors provides additional context for the divergence in MATTR values. Fitzgerald focused on upper-class social life in the Jazz Age, a theme that inherently involves multiple domains such as fashion, architecture, and social etiquette and therefore requires a diverse vocabulary to support descriptions of varied settings. In contrast, Hemingway’s works primarily center on themes such as war, hunting, and bullfighting, with vocabulary leaning more toward action description and realistic scene depiction, resulting in a relatively narrower lexical scope. This relationship between thematic choice and lexical usage further widens the gap in lexical diversity between their works.
(ii) Lexical Density
Most commonly, lexical density denotes the ratio of content words (nouns, verbs, adjectives, and often adverbs as well) to the total number of words in a given context (Johansson 65). Therefore, this study adopts Johansson’s method for calculating lexical density. The formula is as follows:
Lexical Density = Number of Content Words ÷ Total Number of Words × 100%
The lexical density of all 62 samples was calculated, and the results are presented in the table below:
Table 5 Lexical Density of Individual Samples in HC and FC
| HC | FC | ||||||
|---|---|---|---|---|---|---|---|
| Sample | Content Words | Total Words | Lexical Density | Sample | Content Words | Total Words | Lexical Density |
| 1 | 3849 | 9202 | 0.42 | 1 | 3320 | 8801 | 0.38 |
| 2 | 3863 | 9351 | 0.41 | 2 | 4680 | 10148 | 0.46 |
| 3 | 2842 | 7978 | 0.36 | 3 | 4795 | 9852 | 0.49 |
| 4 | 4032 | 9589 | 0.42 | 4 | 3485 | 8910 | 0.39 |
| 5 | 3273 | 8542 | 0.38 | 5 | 5278 | 10853 | 0.49 |
| 6 | 5567 | 10813 | 0.51 | 6 | 3478 | 8993 | 0.39 |
| 7 | 4465 | 9868 | 0.45 | 7 | 6077 | 11787 | 0.52 |
| 8 | 5690 | 11129 | 0.51 | 8 | 5994 | 11717 | 0.51 |
| 9 | 5823 | 11179 | 0.52 | 9 | 4567 | 10280 | 0.44 |
| 10 | 4078 | 9351 | 0.44 | 10 | 5161 | 10983 | 0.47 |
| 11 | 3235 | 8305 | 0.39 | 11 | 3510 | 9346 | 0.38 |
| 12 | 4214 | 9574 | 0.44 | 12 | 4594 | 9765 | 0.47 |
| 13 | 3069 | 8112 | 0.38 | 13 | 3579 | 9347 | 0.38 |
| 14 | 5311 | 10740 | 0.49 | 14 | 3241 | 9187 | 0.35 |
| 15 | 3083 | 8208 | 0.38 | 15 | 3313 | 8950 | 0.37 |
| 16 | 2749 | 8281 | 0.33 | 16 | 3413 | 9353 | 0.36 |
| 17 | 4077 | 9057 | 0.45 | 17 | 3503 | 8726 | 0.40 |
| 18 | 2532 | 7839 | 0.32 | 18 | 3345 | 8942 | 0.37 |
| 19 | 4134 | 9252 | 0.45 | 19 | 4695 | 10267 | 0.46 |
| 20 | 3943 | 9152 | 0.43 | 20 | 3543 | 8946 | 0.40 |
| 21 | 4332 | 9127 | 0.47 | 21 | 3432 | 8909 | 0.39 |
| 22 | 3862 | 8841 | 0.44 | 22 | 3434 | 8442 | 0.41 |
| 23 | 4188 | 9180 | 0.46 | 23 | 3487 | 8888 | 0.39 |
| 24 | 3862 | 9044 | 0.43 | 24 | 3345 | 8682 | 0.39 |
| 25 | 2947 | 8407 | 0.35 | 25 | 3467 | 8741 | 0.40 |
| 26 | 4136 | 9036 | 0.46 | 26 | 4553 | 9900 | 0.46 |
| 27 | 4339 | 9466 | 0.46 | 27 | 5029 | 10420 | 0.48 |
| 28 | 4252 | 9453 | 0.45 | 28 | 3382 | 8533 | 0.40 |
| 29 | 4166 | 9427 | 0.44 | 29 | 5417 | 10859 | 0.50 |
| 30 | 4216 | 9298 | 0.45 | 30 | 5061 | 10388 | 0.49 |
| 31 | 3214 | 8226 | 0.39 | 31 | 4552 | 10123 | 0.45 |
Because the data from HC were normally distributed whereas those from FC were not, a non-parametric test was employed to examine whether there were significant differences between the two groups. The results are presented as follows:
Table 6 Results of the Non-Parametric Test
| Median | Standard Deviation | P-value | |
|---|---|---|---|
| HC | 0.440 | 0.050 | 0.949 |
| FC | 0.400 | 0.051 |
Based on the test results (p = 0.949 > 0.05), there is no statistically significant difference in lexical density between the works of Hemingway and Fitzgerald. From a quantitative perspective, their works exhibit a high degree of similarity on this indicator. Ran Tian notes that although Fitzgerald favored poetic language and expression, he also valued concision in writing (63). This similarity can be explained from a literary perspective: both authors belonged to the “Lost Generation” of American modernist writers in the twentieth century. Their shared creative background, target audience, and overlap in thematic concerns and narrative demands contributed to a convergence in the proportion of content words they used. Despite the difference between Hemingway’s “Iceberg Principle” and Fitzgerald’s lyrical style, these stylistic divergences are not reflected in the quantitative measure of lexical density. It is important to clarify that a statistically non-significant difference does not imply identity; it merely indicates that variation across individual texts did not reach the threshold of substantial difference. Therefore, further exploration of their stylistic distinctions should turn to other metrics, such as syntactic complexity.
3.2 Comparative Features of the HC and FC at Syntactic Level
3.2.1 Sentence Length and Sentence-Length Distribution
Mean sentence length (MLS) is a pivotal quantitative metric for measuring the linguistic style of novels at the sentence level, and it has been widely applied in analyses of authors’ stylistic features in literary works. As an indirect indicator of an author’s habits of syntactic segmentation, this metric can reveal a writer’s preference for shorter or longer sentences and thereby clarify whether a novel’s overall style tends toward concision or complexity. Research on sentence length dates back to the late nineteenth century. G. Udny Yule, using the English works of Bacon, Coleridge, and Lamb as analytical corpora, demonstrated that distribution-free sentence-length statistics could serve as an effective indicator for distinguishing authorial style. Beyond stylistic identification, sentence length has also been used to assess text readability. Sentences are typically delimited by periods, question marks, and exclamation marks, and sentence count is positively associated with syntactic structure (Zhu and Li 79). Therefore, in this study, code is used to segment the text into sentences by identifying sentence boundaries marked by periods, question marks, and exclamation marks. The spaCy library is employed for sentence segmentation because it effectively handles boundary-related issues such as abbreviations (e.g., “Mr.” and “U.S.A.”) and non-standard punctuation, which are not mistakenly treated as sentence endings.
Table 7 Sentence-Length Information for HC and FC
| HC | FC | |
|---|---|---|
| Total Word Number | 395826 | 420759 |
| Sentence Number | 27665 | 22184 |
| MLS | 14.31 | 18.97 |
Because neither dataset conformed to a normal distribution, a non-parametric test was employed to examine whether there were significant differences in sentence length. The results are presented in Table 8.
Table 8 Results of the Non-Parametric Test
| Median | Standard Deviation | P-value | |
|---|---|---|---|
| HC | 14.050 | 2.087 | <0.001 |
| FC | 19.320 | 1.841 |
Based on the test results (p < 0.001), there is a statistically significant difference between the two authors in sentence usage, with sentence length in Fitzgerald’s works being notably greater than that in Hemingway’s.
Hemingway’s and Fitzgerald’s choices of sentence length have shaped distinctly different literary styles. Hemingway’s mean length of sentence (MLS) is only 14.31 words, with concise and straightforward short sentences scattered throughout his texts. Walu Liu notes that the sentences in Hemingway’s novels are succinct and never deliberately exaggerated (26). Such a stylistic trait finds clear expression in the following excerpt from For Whom the Bell Tolls:
The old man’s doing very well. He’s in quite a place up there.
He hated to shoot that sentry. So did I but I didn’t think about it.
Nor do I think about it now. You have to do that. But then
Anselmo got a cripple. I know about cripples. I think that killing
a man with an automatic weapon makes it easier. I mean on the
one doing it. (437–438)
Such sentences are structurally simple and free of redundant modification. Fei Wu argues that in The Old Man and the Sea Hemingway placed particular emphasis on short sentences, whose style is closer to spoken language and which usually contain no more than ten words (47). This tendency accords with his creative concern with themes such as war and adventure, while also enhancing the authenticity and impact of the narration.
Fitzgerald’s MLS reaches 18.97 words, significantly higher than Hemingway’s. His texts are dominated by medium and long sentences, and he excels at creating atmospheres and depicting delicate emotions through complex syntactic structures and detailed descriptions. An example from The Great Gatsby illustrates this point:
The groups change more swiftly, swell with new arrivals, dissolve and form in the same breath; already there are wanderers, confident girls who weave here and there among the stouter and more stable, become for a sharp, joyous moment the center of a group, and then, excited with triumph, glide on through the sea-change of faces and voices and color under the constantly changing light(40).
Figure 3 Sentence-Length Distributions of HC and FC
Figure 3 presents the sentence-length distribution curves for HC and FC, offering quantitative insight into the syntactic differences between the two authors. In HC, the distribution has a sharp and prominent peak concentrated in the five-to-six-word range, and the frequency declines rapidly as sentence length increases, with sentences exceeding 30 words accounting for less than 1% of HC. In contrast, FC exhibits a broader and lower peak shifted slightly to the six-to-seven-word range; the frequency decreases more gradually with length, and sentences of more than 25 words appear twice as often in FC as in HC, which aligns with Fitzgerald’s tendency toward more elaborate and information-dense syntactic structures.
Christopher Butler classifies sentences into three categories based on length: short sentences (containing 1–9 words), medium sentences (10–24 words), and long sentences (25 words or more) (16). Following Butler’s classification, the sentences in Hemingway’s and Fitzgerald’s works are further categorized, with the results presented in Figure 4.
Figure 4 Different Sentence Types in HC and FC
As can be seen from Figure 4, there are significant differences in the writing styles of Hemingway and Fitzgerald. Hemingway predominantly uses short sentences: short sentences account for 61.25%, medium sentences for 31.52%, and long sentences for only 7.23%. This pattern indicates his marked preference for concise expression through short sentences, producing a brisk and straightforward linguistic rhythm, few complex syntactic structures, and relatively easy readability. Such a style aligns with his concise and restrained writing characteristics and enables efficient transmission of information and emotion through simple language. In contrast, Fitzgerald’s sentence-length distribution is more balanced: short sentences account for 43.7%, medium sentences for 38.19%, and long sentences for 18.1%—more than twice the proportion found in Hemingway’s works. This demonstrates that Fitzgerald more frequently employs medium and long sentences, resulting in a more leisurely linguistic rhythm. He tends to depict scenes and elaborate details through medium and long sentences, thereby producing more delicate and dense expression capable of carrying richer content and more complex emotional layers.
3.2.2 Syntactic Complexity
Syntactic complexity refers to the diversity and complexity of syntactic structures in language production (Ortega 492). In the study of quantitative stylistics in literary works, scholars tend to focus more on mean length of sentence or subordinating conjunctions in syntactic complexity, while paying insufficient attention to other indicators of syntactic complexity. However, the L2 Syntactic Complexity Analyzer (L2SCA) is designed to address data analysis issues in the research on syntactic complexity of L2 writing, enabling L2 writing researchers to conduct in-depth studies on syntactic complexity (Lu 491–492). Therefore, this study adopts 14 syntactic complexity indices proposed by Lu, which are classified into 5 categories: unit length, sentence complexity, subordinate clause usage, coordinate structure usage, and specific phrase structures. Among these, the definitions of some indices need to be clarified: A T-unit refers to a main clause with other subordinate clauses (Hunt 198-199). A complex T-unit is a T-unit that “contains at least one subordinate clause or embedded clause” (Casanave 186).
(i) Length of Production Unit
Unit length includes mean length of clause (MLC), mean length of sentence (MLS), and mean length of T-unit (MLT). Since MLS has been discussed in Section 3.2.1, only MLC and MLT are addressed in this section.
Because neither dataset conforms to a normal distribution, a non-parametric test was adopted to examine whether the differences were statistically significant. The results are presented in Table 9.
Table 9 Results of the Non-Parametric Test
| Median | Standard Deviation | P-value | |||
|---|---|---|---|---|---|
| MLT | HC | 9.580 | 0.918 | <0.001 | |
| FC | 11.760 | 0.720 | |||
| MLC | HC | 8.260 | 0.675 | <0.001 | |
| FC | 9.830 | 0.511 | |||
Based on the test results for MLT and MLC (p < 0.001), it can be concluded that there are statistically significant differences between the two writers in their use of MLT and MLC, with Fitzgerald’s MLT and MLC being significantly greater than Hemingway’s.
In Fitzgerald’s syntactic construction, T-units often take the main clause as a skeletal framework, embedding multiple layers of subordinate clauses, modifiers, and detailed descriptions to form lengthy yet information-dense expressions. This structure is exemplified in the following passage from This Side of Paradise:
From his fourth to his tenth year he did the country with his mother in her father’s private car, from Coronado, where his mother became so bored that she had a nervous breakdown in a fashionable hotel, down to Mexico City, where she took a mild, almost epidemic consumption. (6)
Centered on the spatial movement depicted in the main clause, the sentence supplements key experiences along the journey through two locative relative clauses, with a causal adverbial clause nested within one of the relative clauses, resulting in the superposition of multiple modifiers. This syntactic design achieves a high unity between textual information density and narrative coherence, fully embodying Fitzgerald’s writing characteristic of utilizing syntactic complexity to serve scene construction and emotional conveyance.
Hemingway, however, differs significantly. He strictly restricts the expressive focus to the most critical actions and core facts, forming concise and succinct linguistic units. This creates a sharp contrast between the two writers in terms of syntactic complexity and information presentation methods.
At the level of MLC, their differences further highlight a divergence in their creative philosophies. Fitzgerald’s clauses are notably longer, primarily due to his habitual practice of embedding abundant modifying elements within clauses, and participial structures—making the clauses themselves the core carrier of semantic meaning. By superimposing modifying components to enhance internal information density, the length of the clauses is naturally extended. In contrast, Hemingway’s clauses adhere to a function-first principle, with significantly shorter MLC. He retains only core functional components such as the subject and finite predicate, implementing extreme compression of modifying elements to complete semantic transmission through the most concise syntactic structure. This conciseness is not a lack of expression, but a linguistic embodiment of his “Iceberg Theory”, emotional connotations and contextual backgrounds beyond the core information are deliberately obscured, leaving them for readers to independently perceive and fill in through textual gaps.
(ii) Coordination
Coordinate structure is one of the core linguistic devices with a very wide range of applications in the grammatical expression systems of human languages (Chen and Wei 56). Dexi Zhu proposed that conjunctions and adverbs can serve the function of connecting coordinate constituents (156). In this study, the indicators of coordinate structure are CP_C, CP_T, and T_S.
CP_C (coordinate phrases per clause) refers to the number of coordinate phrases in each clause; CP_T (coordinate phrases per T-unit) denotes the number of coordinate phrases in each T-unit; T_S (T-units per sentence) represents the number of T-units in each sentence (Lu 478).
Because HC does not conform to a normal distribution at the levels of CP_C and CP_T whereas FC does, and because HC conforms to a normal distribution at the level of T_S whereas FC does not, non-parametric tests were adopted to determine whether the differences were statistically significant. The results are shown in Table 10.
Table 10 Results of the Non-Parametric Test
| Median | Standard Deviation | P-value | |||
|---|---|---|---|---|---|
| CP_C | HC | 0.600 | 0.024 | <0.001 | |
| FC | 0.650 | 0.027 | |||
| CP_T | HC | 0.700 | 0.031 | <0.001 | |
| FC | 0.780 | 0.040 | |||
| T_S | HC | 1.470 | 0.069 | <0.001 | |
| FC | 1.640 | 0.061 | |||
The non-parametric tests indicate that there are significant differences between the two writers in the use of CP_C, CP_T, and T_S (p < 0.001), with Fitzgerald’s levels on all three indicators being significantly higher than Hemingway’s.
The significantly higher CP_C value in Fitzgerald’s works indicates that he is more inclined to adopt coordinate phrase structures to expand and enrich the internal components of clauses, thereby creating a delicate sense of imagery, flowing images, and layered emotional atmosphere. John M. Norris and Lourdes Ortega claimed that the developmental sequence of learners’ syntactic complexity proceeds from coordinate structures to subordinate structures, and ultimately to phrasal structures (563). Xuelan Li and Huiping Zhang pointed out that with the improvement of English proficiency, third-language beginners gradually reduce their use of coordinate clauses; as their exposure to language input continues to increase, they will gradually attempt to use more complex syntactic structures, resulting in a decreased use of coordinate structures (116). It can thus be concluded that although Fitzgerald’s works contain a relatively high frequency of coordinate phrases, non-native English speakers will not find it strenuous to read his works.
The high CP_T value further indicates that the parallel phrase structure constitutes a pervasive syntactic habit across the vast majority of sentential units, thereby fostering a dense, rich, and highly consistent textual texture in which language itself becomes the primary vehicle for conveying illusion and loss.In contrast, Hemingway’s low CP_C and CP_T values align perfectly with his “Iceberg Principle” and telegraphic style, as he deliberately pares down syntactic embellishment to make each noun and verb carry maximum semantic weight, resulting in sentences that are terse and sharply sculpted.
Fitzgerald’s T_S is significantly higher than that of Hemingway, indicating that he integrates a greater number of T-unit structures within individual sentences. This elevated T_S value not only quantitatively reflects the stronger structural extensibility of his sentences but also helps explain their longer average sentence length. In contrast to Hemingway’s characteristic use of shorter sentences with fewer T-units to achieve concise parataxis, Fitzgerald tends to employ extended sentences through the layering and nesting of multiple T-units. This syntactic strategy enables him to construct nuanced descriptive levels, convey complex psychological states, and maintain the distinctive flowing rhythm of his prose. Thus, a higher T_S value serves as a key syntactic means through which Fitzgerald achieves his intended aesthetic effects.
(iii) Subordination
A subordinate structure refers to the combination of two semantically related linguistic units at different hierarchical levels, characterized syntactically by the embedding of one structure within another as an integral component of the latter (Zhang 661). Key indicators of subordination include:
C_T (clauses per T-unit), CT_T (complex T-units per T-unit), DC_C (dependent clauses per clause), and DC_T (dependent clauses per T-unit) are the main indicators of subordination (Lu 478).
Because FC does not conform to a normal distribution at the levels of DC_C and DC_T whereas HC does, and because neither C_T nor CT_T conforms to a normal distribution, non-parametric tests were adopted to determine whether the differences were statistically significant. The results are shown in Table 11.
Table 11 Results of the Non-Parametric Test
| Median | Standard Deviation | P-value | |||
|---|---|---|---|---|---|
| C_T | HC | 0.480 | 0.046 | <0.001 | |
| FC | 0.590 | 0.035 | |||
| CT_T | HC | 0.333 | 0.0002 | 0.365 | |
| FC | 0.333 | 0.0002 | |||
| DC_C | HC | 0.410 | 0.339 | <0.001 | |
| FC | 0.490 | 0.025 | |||
| DC_T | HC | 1.160 | 0.159 | <0.001 | |
| FC | 1.200 | 0.129 | |||
As indicated by the data in the table, statistically significant differences were observed in C_T, DC_C, and DC_T (p< 0.001), with Fitzgerald’s medians on these three indices consistently higher than Hemingway’s.
The relatively higher DC_C and DC_T values directly demonstrate Fitzgerald’s intensive utilization of hypotactic structures, such as adverbial clauses and attributive clauses. Through the nesting and interweaving of multiple dependent clauses, his sentences closely integrate semantic relations including temporality, causality, modification, and psychological activities into the framework of main clauses, thereby constructing a narrative network characterized by intricate logical hierarchies and highly integrated information.
The higher C_T value further reinforces this stylistic trait, indicating that Fitzgerald was adept at incorporating multiple layers of structural elaboration under a single main clause to achieve the progressive accumulation and superposition of semantic meaning. In comparison, Hemingway stripped away redundant details to restore the inherent conciseness of literary expression.
No statistically significant difference was found between the two writers in the employment of CT_T, with both recording an identical median of 0.333. This result suggests that neither writer tended to superimpose complex T-units in their writing. A study by Tan and Bi revealed that the mean CT_T values of English translations from Chinese and original English texts were 0.394 and 0.545, respectively (18). It can thus be inferred that the works of Hemingway and Fitzgerald exhibit a simpler usage of complex T-units per T-unit compared to these two text categories.
(iv) Particular Structures
According to the definition proposed by Lu, particular structures are operationalized through three syntactic indices, namely CN_C, CN_T and VP_T. CN_C denote the number of complex nominal phrases per clause; CN_T refers to the number of complex nominal phrases per T-unit; VP_T represents the number of verbal phrases per T-unit (478).
Because FC does not conform to a normal distribution at the levels of CN_C, CN_T, and VP_T whereas HC does, non-parametric tests were adopted to determine whether the differences were statistically significant. The results are shown in Table 12.
Table 12 Results of the Non-Parametric Test
| Median | Standard Deviation | P-value | |||
|---|---|---|---|---|---|
| CN_C | HC | 0.830 | 0.068 | <0.001 | |
| FC | 0.980 | 0.508 | |||
| CN_T | HC | 0.960 | 0.092 | <0.001 | |
| FC | 1.180 | 0.072 | |||
| VP_T | HC | 1.910 | 0.183 | <0.001 | |
| FC | 2.350 | 0.145 | |||
As indicated by the data in Table 12, highly significant differences were observed between HC and FC in CN_C, CN_T, and VP_T. These indices therefore serve as effective discriminators of the authors’ syntactic features, with FC consistently showing higher frequencies than HC on all three measures. Specifically, Fitzgerald’s higher CN_T value, which corresponds to a greater number of complex nominal phrases per T-unit, indicates that the clauses in his literary works contain a denser distribution of modifying elements.
This finding indicates Fitzgerald’s stylistic preference for the frequent use of complex nominal phrases, which implies that his texts are replete with concrete and abstract entities, conceptual constructs, and modifying details. This linguistic trait is closely intertwined with his literary effort to construct a worldview characterized by material affluence, vivid sensory perception, and overlapping symbolic layers. In addition, Fitzgerald also made more extensive use of verbal phrases. This not only enriched the predicative structures of his sentences but also, through the coordination of multiple verbal phrases, enabled him to construct a narrative rhythm that is more dynamic, hierarchically nuanced, and lyrical in tone. The rich visual and auditory narratives embedded in Fitzgerald’s works allow readers to immerse themselves deeply in the social landscape of 1920s America (R. Wu 111). However, although Hemingway employed far fewer nominal and verbal phrases, this by no means suggests a deficiency in linguistic expressiveness or a reduction in the artistic merit of his works. Instead, Hemingway was more inclined to invite readers to perceive the deeper implications underlying his stories, and he opposed the excessive accumulation of ornate rhetoric. As a result, his texts, while succinct in form, are by no means lacking in profound connotations. In a comparative study of syntactic complexity in thematically identical writings by Asian EFL learners and native English speakers, Sheng Yan pointed out that native speakers achieved mean values of 1.081, 2.302, and 3.088 for CN_C, CN_T, and VP_T, respectively, whereas the learners recorded lower medians of 0.980, 1.708, and 2.415 for the same three indices (49). This suggests that, for both native English speakers and non-native learners, these three syntactic indices do not pose a substantial barrier to the comprehension of either Hemingway’s or Fitzgerald’s works, as both groups are sufficiently proficient to master these structures.
(v) Sentence Complexity
The sentence complexity index C_S denotes the number of clauses per sentence (Lu 478).
Because the HC dataset conformed to a normal distribution whereas the FC dataset did not, a non-parametric test was employed to examine whether the difference between the two groups was statistically significant. The results are presented in Table 13.
Table 13 Results of the Non-Parametric Test
| Median | Standard Deviation | P-value | |
|---|---|---|---|
| HC | 1.700 | 0.105 | <0.001 |
| FC | 1.970 | 0.093 |
The data reveal a statistically significant difference between HC and FC in terms of the sentence complexity index C_S, with FC yielding a notably higher median than HC. Specifically, Hemingway’s works are predominantly characterized by short sentences with a highly streamlined number of clauses. In contrast, Fitzgerald exhibited a distinct stylistic preference for the employment of subordinate clauses. On average, each sentence in Fitzgerald’s writings contains nearly two clauses, whereas the corresponding figure for Hemingway’s works stands at merely 1.7 clauses. This finding is consistent with the results presented in the preceding sections of this study, further confirming that Fitzgerald favored the use of multi-clause constructions, which consequently contributed to the formation of longer sentence structures in his literary output.
3.3 Comparative Features of the HC and FC at the Textual Level
The previous two sections have explored the writing styles and stylistic features of the two authors’ works at the lexical and syntactic levels. This section investigates their distinct choices in textual organization from the textual perspective. By focusing on paragraph length as an objective quantitative indicator and drawing on empirical corpus data, it further clarifies the core differences between Hemingway and Fitzgerald in terms of paragraph construction, information density, and narrative rhythm.
3.3.1 Paragraph Length
As an intermediate unit of textual meaning transmission, the paragraph is not only a natural aggregation of lexical and syntactic features but also a direct externalization of an author’s creative thinking and narrative intention. In essence, it extends the author’s core creative philosophy at the level of textual structure. A paragraph can be regarded as a condensed version of a complete text because it satisfies the structural and content criteria of a full-length article (X. Liu 113). In this section, the statistical results for the average numbers of sentences and words per paragraph are first presented for each individual work and for the overall output of the two authors, as follows:
Table 14 Mean Paragraph Lengths of HC and Individual Works
| HC | MPL (by sentences) | MPL (by words) |
|---|---|---|
| A Farewell to Arms | 2.53 | 22.29 |
| Across the River and Into the Trees | 2.08 | 23.47 |
| For Whom the Bell Tolls | 2.64 | 30.83 |
| The Old Man and the Sea | 3.01 | 43.01 |
| Winner Take Nothing | 2.26 | 24.37 |
| Mean | 2.47 | 26.89 |
Table 15 Mean ParagraphLengths of FC and Individual Works
| FC | MPL (by sentences) | MPL (by words) |
|---|---|---|
| Tender Is the Night | 50.49 | 802.61 |
| This Side of Paradise | 1.82 | 26.5 |
| The Beautiful and Damned | 2.18 | 34.37 |
| The Great Gatsby | 1.97 | 29.45 |
| Flappers and Philosophers | 2.05 | 27.6 |
| Mean | 2.65 | 40.25 |
According to the statistical results, HC records an average of 2.47 sentences per paragraph, whereas FC reaches 2.65. In terms of the number of words per paragraph, HC averages only 26.89 words, in contrast to FC’s markedly higher mean of 40.25. Together, these two indicators suggest that Fitzgerald shows a stronger preference for longer paragraph structures, whereas Hemingway tends to adopt a more concise approach to paragraph segmentation.
3.3.2 Paragraph-Length Distribution
To further analyze the paragraph features of the two corpora, this section presents their respective paragraph-length distributions. These distributions are measured in two ways: by the number of sentences per paragraph and by the number of words per paragraph. The results are shown in the table below.
Table 16 Paragraph-Length Distribution of HC and FC (by Sentences)
| PL | HC | FC | ||
|---|---|---|---|---|
| Number | Percentage (%) | Number | Percentage (%) | |
| 1 | 6473 | 42.27% | 5595 | 51.70% |
| 2 | 4282 | 27.96% | 2434 | 22.49% |
| 3 | 1977 | 12.91% | 1357 | 12.54% |
| 4 | 972 | 6.35% | 646 | 5.97% |
| 5 | 533 | 3.48% | 388 | 3.59% |
| 6 | 327 | 2.14% | 173 | 1.60% |
| 7 | 179 | 1.17% | 115 | 1.06% |
| 8 | 140 | 0.91% | 52 | 0.48% |
| 9 | 110 | 0.72% | 21 | 0.19% |
| 10 | 72 | 0.47% | 18 | 0.17% |
| 11 | 55 | 0.36% | 5 | 0.05% |
| 12 | 43 | 0.28% | 6 | 0.06% |
| 13 | 29 | 0.19% | 2 | 0.02% |
| 14 | 18 | 0.12% | 2 | 0.02% |
| 15 | 16 | 0.10% | 0 | 0.00% |
| 16 | 15 | 0.10% | 2 | 0.02% |
| 17 | 11 | 0.07% | 0 | 0.00% |
| 18 | 9 | 0.06% | 1 | 0.01% |
| 19 | 10 | 0.07% | 1 | 0.01% |
| 20 | 2 | 0.01% | 0 | 0.00% |
| 21 | 9 | 0.06% | 0 | 0.00% |
| 22 | 5 | 0.03% | 0 | 0.00% |
| 23 | 3 | 0.02% | 0 | 0.00% |
| 24 | 2 | 0.01% | 0 | 0.00% |
| 25 | 2 | 0.01% | 0 | 0.00% |
| 26 | 2 | 0.01% | 1 | 0.01% |
| 27 | 2 | 0.01% | 1 | 0.01% |
| 28 | 2 | 0.01% | 1 | 0.01% |
| 29 | 2 | 0.01% | 0 | 0.00% |
| 30 | 2 | 0.01% | 0 | 0.00% |
| 31 | 3 | 0.02% | 0 | 0.00% |
| 32 | 1 | 0.01% | 0 | 0.00% |
| 33 | 0 | 0.00% | 0 | 0.00% |
| 34 | 0 | 0.00% | 0 | 0.00% |
| 35 | 2 | 0.01% | 0 | 0.00% |
| 36 | 0 | 0.00% | 0 | 0.00% |
| 37 | 0 | 0.00% | 1 | 0.01% |
| 38 | 3 | 0.02% | 0 | 0.00% |
| 39 | 0 | 0.00% | 0 | 0.00% |
| 40 | 2 | 0.01% | 0 | 0.00% |
Figure 5 Paragraph-Length Distribution of HC and FC (by Sentences)
As indicated by the statistical data, the distribution of paragraph lengths in both HC and FC is predominantly characterized by short paragraphs, with the core concentration falling within the range of one to three sentences.
Table 17 Statistical Results of the Paired-Samples t-Test of Paragraph Length (by Sentences)
| Mean Value | t Value | Sig. | |
|---|---|---|---|
| HC | 0.024 | -0.03 | 0.998 |
| FC | 0.025 |
The paired-samples t-test yielded p = 0.998, indicating that there was no statistically significant difference in the average number of sentences per paragraph between Fitzgerald’s and Hemingway’s works. This finding shows that sentence-based paragraph length is not an effective indicator for distinguishing the two authors’ literary styles. As Figure 5 shows, the paragraph-length distributions in both authors’ works follow the same declining trend, with the majority of paragraphs concentrated in the range of one to three sentences and accounting for more than 80% of all paragraphs in each corpus. When paragraph length exceeds three sentences, the distribution curve levels off and remains relatively low. This stylistic similarity can largely be attributed to the extensive use of dialogue in the works of both authors. An illustration of this trait appears in Fitzgerald’s This Side of Paradise, as seen in the following passage:
“You shouldn’t smoke, Amory,” she whispered. “Don’t you know that?”
He shook his head.
“Nobody cares.”
Myra hesitated.
“I care.” (13)
This conversation marks a pivotal juncture in Amory’s emotional connection with his new environment. Caught in the dislocation of his identity and social standing, Amory struggles to maintain his distinctive elite demeanor while simultaneously feeling alienated and awkward in the face of such unadorned adolescent social interactions. In contrast, Myra, an ordinary girl he encounters in this new setting, harbors the innocent infatuation typical of adolescence toward him—a narrative detail that lays crucial groundwork for Amory’s subsequent self-awareness and emotional maturation. Despite its brevity, this exchange enables readers to fully grasp the personalities and respective predicaments of the two teenagers.
Dialogue occupies a more substantial proportion in Hemingway’s literary works, and he is particularly adept at advancing the plot through deliberate use of dialogue (Gao 48). This characteristic is clearly illustrated in the following excerpt from Winner Take Nothing:
“Last week he tried to commit suicide,” one waiter said.
“Why?”
“He was in despair.”
“What about?”
“Nothing.”
“How do you know it was nothing?”
“He has plenty of money.” (17)
This concise dialogue between two waiters centers on an elderly patron in the café. It establishes the contradictory image of a wealthy man who nevertheless attempted to take his own life, thereby effectively propelling the development of the narrative plot.
Next, the distribution of paragraph length measured by word count is examined. Owing to space constraints, only the 60 most frequent paragraph lengths are selected for detailed analysis, as presented in Table 18.
Table 18 Paragraph-Length Distribution of HC and FC (by Words)
| PL | HC | FC | ||
|---|---|---|---|---|
| Number | Percentage (%) | Number | Percentage (%) | |
| 1 | 142 | 1.03% | 108 | 1.18% |
| 2 | 349 | 2.53% | 400 | 4.36% |
| 3 | 456 | 3.31% | 480 | 5.23% |
| 4 | 703 | 5.10% | 482 | 5.26% |
| 5 | 740 | 5.37% | 544 | 5.93% |
| 6 | 814 | 5.91% | 467 | 5.09% |
| 7 | 722 | 5.24% | 486 | 5.30% |
| 8 | 623 | 4.52% | 448 | 4.88% |
| 9 | 646 | 4.69% | 380 | 4.14% |
| 10 | 576 | 4.18% | 330 | 3.60% |
| 11 | 564 | 4.10% | 276 | 3.01% |
| 12 | 486 | 3.53% | 277 | 3.02% |
| 13 | 452 | 3.28% | 254 | 2.77% |
| 14 | 454 | 3.30% | 198 | 2.16% |
| 15 | 435 | 3.16% | 210 | 2.29% |
| 16 | 393 | 2.85% | 194 | 2.12% |
| 17 | 405 | 2.94% | 190 | 2.07% |
| 18 | 317 | 2.30% | 210 | 2.29% |
| 19 | 339 | 2.46% | 184 | 2.01% |
| 20 | 240 | 1.74% | 179 | 1.95% |
| 21 | 236 | 1.71% | 145 | 1.58% |
| 22 | 231 | 1.68% | 156 | 1.70% |
| 23 | 205 | 1.49% | 139 | 1.52% |
| 24 | 226 | 1.64% | 130 | 1.42% |
| 25 | 213 | 1.55% | 132 | 1.44% |
| 26 | 197 | 1.43% | 116 | 1.26% |
| 27 | 157 | 1.14% | 117 | 1.28% |
| 28 | 155 | 1.13% | 111 | 1.21% |
| 29 | 148 | 1.07% | 102 | 1.11% |
| 30 | 155 | 1.13% | 85 | 0.93% |
| 31 | 119 | 0.86% | 97 | 1.06% |
| 32 | 140 | 1.02% | 77 | 0.84% |
| 33 | 108 | 0.78% | 91 | 0.99% |
| 34 | 87 | 0.63% | 72 | 0.78% |
| 35 | 92 | 0.67% | 60 | 0.65% |
| 36 | 104 | 0.76% | 83 | 0.90% |
| 37 | 98 | 0.71% | 52 | 0.57% |
| 38 | 82 | 0.60% | 58 | 0.63% |
| 39 | 80 | 0.58% | 61 | 0.67% |
| 40 | 61 | 0.44% | 66 | 0.72% |
| 41 | 63 | 0.46% | 64 | 0.70% |
| 42 | 77 | 0.56% | 51 | 0.56% |
| 43 | 60 | 0.44% | 57 | 0.62% |
| 44 | 60 | 0.44% | 58 | 0.63% |
| 45 | 74 | 0.54% | 45 | 0.49% |
| 46 | 62 | 0.45% | 60 | 0.65% |
| 47 | 41 | 0.30% | 52 | 0.57% |
| 48 | 64 | 0.46% | 45 | 0.49% |
| 49 | 57 | 0.41% | 56 | 0.61% |
| 50 | 48 | 0.35% | 37 | 0.40% |
| 51 | 50 | 0.36% | 48 | 0.52% |
| 52 | 44 | 0.32% | 38 | 0.41% |
| 53 | 48 | 0.35% | 36 | 0.39% |
| 54 | 38 | 0.28% | 48 | 0.52% |
| 55 | 44 | 0.32% | 49 | 0.53% |
| 56 | 35 | 0.25% | 51 | 0.56% |
| 57 | 32 | 0.23% | 33 | 0.36% |
| 58 | 35 | 0.25% | 29 | 0.32% |
| 59 | 33 | 0.24% | 35 | 0.38% |
| 60 | 57 | 0.41% | 33 | 0.36% |
Figure 6 Paragraph-Length Distribution of the Two Authors (by Words)
As can be observed in Figure 6, the paragraph-length distributions of the two corpora measured by word count follow the same overall pattern: they rise steadily to a peak and then decline gradually.
Table 19 Statistical Results of the Paired-Samples t-Test of Paragraph Length (by Words)
| Mean Value | t Value | Sig. | |
|---|---|---|---|
| HC | 0.0167 | 0.000 | 1.000 |
| FC | 0.0167 |
The paired-samples t-test yielded p = 1.000, indicating that there was no statistically significant difference in the distribution of paragraph lengths measured by word count between Hemingway’s and Fitzgerald’s works. This finding further demonstrates that word-count-based paragraph length is not an effective indicator for distinguishing the two authors’ literary corpora. By examining Figure 6 and Table 18, we can see that the two authors share a similar overall trend in the distribution of word-count-based paragraph lengths, especially in the range of shorter paragraphs.
Conclusion
This study employs a mixed quantitative-qualitative method to compare the linguistic and stylistic features of Hemingway’s and Fitzgerald’s works at the lexical, syntactic, and textual levels.
At the lexical level, both authors predominantly use short words and words with relatively few syllables, and no significant differences are found in syllable-based word length or lexical density. However, a significant difference is observed in letter-based word length, and Fitzgerald’s works exhibit significantly higher lexical diversity, reflecting his preference for delicate and varied vocabulary in depicting the Jazz Age, whereas Hemingway’s adherence to the “Iceberg Theory” results in more concise and repetitive lexical use.
At the syntactic level, notable differences emerge: Fitzgerald’s works have a longer mean sentence length, higher syntactic complexity (including more coordinate structures, subordinate clauses, and complex phrases), and a stronger preference for medium and long sentences. In contrast, Hemingway favors concise short sentences with streamlined structures, which aligns with his telegraphic writing style.
At the textual level, both authors rely heavily on short paragraphs because of their extensive use of dialogue, and no significant difference is found in overall paragraph-length distribution.
In summary, the two authors’ distinct stylistic traits—Hemingway’s conciseness and Fitzgerald’sdelicacy—are systematically verified through quantitative data, thereby enriching empirical research on modernist literary stylistics.
This study also has limitations. The scope of the corpus can be expanded further, and future research can incorporate discourse-level and semantic indicators, as well as more advanced technologies, to improve the comprehensiveness and accuracy of the analysis.
Works Cited
Azimov, Rustam B. “Comparative Analysis of Using Different Text Features, Models, and Methods in Text Author Recognition.” Cybernetics and Systems Analysis, vol. 60, no. 5, 2021, pp. 711–725.
Butler, Christopher. Statistics in Linguistics. Basil Blackwell, 1985.
Casanave, Christine Pearson. “Language Development in Students’ Journals.” Journal of Second Language Writing, vol. 3, no. 3, 1994, pp. 179–201.
Chen, Guoqing, and Wei Deming. “Types and Characteristics of Coordinate Structures in the Va Language.” Minority Languages of China, no. 6, 2025, pp. 56–62.
Chen, Heng, and Liu Haitao. “How to Measure Word Length in Spoken and Written Chinese.” Journal of Quantitative Linguistics, vol. 23, no. 1, 2016, pp. 5–29.
Chen, Lihua. “An Analysis of the Narrative Language in In Another Country.” Language Planning, no. 24, 2017, pp. 43–44.
Cheng, Fei. “On the Application of Symbolic Artistic Techniques in The Great Gatsby.” Language Planning, no. 5, 2016, pp. 57–58.
Cheng, Xilin. “Narratology of Verbal Imagery in Tender Is the Night.” Foreign Literature, no. 5, 2015, pp. 38–46, 157.
Covington, Michael A., and Joe D. McFall. “Cutting the Gordian Knot: The Moving-Average Type-Token Ratio (MATTR).” Journal of Quantitative Linguistics, vol. 17, no. 2, 2010, pp. 94–100.
Deli, Zsolt P. “The Lexical Analysis of Two Works by Ernest Hemingway and F. Scott Fitzgerald.” Porta Lingua, no. 1, 2021, pp. 199–209.
Donald, David E., and Hiram E. Essex. “Pressure Studies after Inactivation of the Major Portion of the Canine Right Ventricle.” American Journal of Physiology-Legacy Content, vol. 176, no. 1, 1953, pp. 155–161.
Fitzgerald, F. Scott. The Great Gatsby. Charles Scribner’s Sons, 1925.
———. This Side of Paradise. Alma Classics, 2012.
Gao, Wenmei. “Language Features of Hemingway’s The Old Man and the Sea.” Language Planning, no. 3, 2016, pp. 47–48.
Hameed, Mishal, and Hira Ali. “A Comparative Stylistic Analysis of Selected Short Stories by Yiyun Li and Zadie Smith.” Journal of Political Stability Archive, vol. 3, no. 4, 2025, pp. 556–569.
Hemingway, Ernest. For Whom the Bell Tolls. Scribner, 1995.
———. Winner Take Nothing. Charles Scribner’s Sons, 1933.
Hou, Renkui, Jiang Yang, and Minghu Jiang. “A Study on Chinese Quantitative Stylistic Features and Relation among Different Styles Based on Text Clustering.” Journal of Quantitative Linguistics, vol. 21, no. 3, 2014, pp. 246–280.
Hunt, Kellogg W. “Do Sentences in the Second Language Grow Like Those in the First?” Tesol Quarterly, vol. 4, no. 3, 1970, pp. 195–202.
Ihrmark, Daniel, and Johan Nilsson. “A Corpus Stylistic Analysis of Development in Hemingway’s Literary Production.” The Hemingway Review, vol. 40, no. 2, 2021, pp. 71–93.
Johansson, Victoria. “Lexical Diversity and Lexical Density in Speech and Writing: A Developmental Perspective.” Working Papers, Department of Linguistics and Phonetics, Lund University, vol. 53, 2008, pp. 61–79.
Johnson, Webdell. “Studies in Language Behavior: A Program of Research.” Psychological Monographs, vol. 56, no. 2, 1944, pp. 1–15.
Laufer, Batia, and Paul Nation. “Vocabulary Size and Use: Lexical Richness in L2 Written Production.” Applied Linguistics, vol. 16, no. 3, 1995, pp. 307–322.
Li, Xuelan, and Zhang Huiping. “Developmental Features of Syntactic Complexity in L3 English Written Production by Chinese Beginners.” Journal of PLA University of Foreign Languages, vol. 45, no. 3, 2022, pp. 111–119.
Li, Yuanyuan. “A Study on the Linguistic Features of Hemingway’s A Day’s Wait.” Language Planning, no. 12, 2008, pp. 25–26.
Liu, Haiyan, andLi Manhui. “The Changes of Fitzgerald’s Creative Ability—A Statistical Analysis Based on Lexical Measures.” Statistics & Information Forum, vol. 25, no. 10, 2010, pp. 108–112.
Liu, Walu. “On the Linguistic Style of The Old Man and the Sea from the Perspective of Cognitive Linguistics.” Language Planning, no. 27, 2017, pp. 26–27.
Liu, Xinzi. “Research on College English Writing Training Based on Paragraph Awareness Cultivation.” Journal of Teaching and Management, no. 27, 2016, pp. 113–115.
Lu, Xiaofei. “Automatic Analysis of Syntactic Complexity in Second Language Writing.” International Journal of Corpus Linguistics, vol. 15, no. 4, 2010, pp. 474–496.
Mandravickaite, Justina, and Tomas Krilavicius. “Quantitative Analysis of Textual Genres: Comparison of English and Lithuanian.” Proceedings of the International Conference on Information Technologies, 2018.
Mikros, George K. “Content Words in Authorship Attribution: An Evaluation of Stylometric Features in a Literary Corpus.” Studies in Quantitative Linguistics, vol. 5, 2009, pp. 61–75.
Muradian, Gaiane, and Zaruhi Antonyan. “Author and Text Interpretation: F. Scott Fitzgerald and the Great Gatsby.” Bulletin of Yerevan University B: Philology, vol. 16, no. 3 (48), 2025, pp. 136–142.
Norris, John M., and Lourdes Ortega. “Towards an Organic Approach to Investigating CAF in Instructed SLA: The Case of Complexity.” Applied Linguistics, vol. 30, no. 4, 2009, pp. 555–578.
Ortega, Lourdes. “Syntactic Complexity Measures and Their Relationship to L2 Proficiency: A Research Synthesis of College‐Level L2 Writing.” Applied Linguistics, vol. 24, no. 4, 2003, pp. 492–518.
Popescu, Ioan-Iovitz, et al. “Word Length: Aspects and Languages.” Issues in Quantitative Linguistics, vol. 3, 2013, pp. 224–281.
Prasanwon, Pisut. “A Literary Stylistic Analysis of Ernest Hemingway’s Short Stories: What Might be Hiding Beneath Linguistics in ‘Big Two-Hearted River’ Sequels?” Journal of Studies in the Field of Humanities, vol. 23, no. 2, 2016, pp. 272–290.
Puspitasari, Devi Ambarwati, Hanif Fakhrurroja, and Adi Sutrisno. “Authorship Analysis in Electronic Texts Using Similarity Comparison Method.” Linguistik Indonesia, vol. 42, no. 1, 2024, pp. 91–112.
Ren, Donghai, Zhongbao Liu, and Bowen Zhang. “A Quantitative Analysis and Comparative Study of Literary Works’ Styles from the Perspective of Digital Humanities—Taking Jin Yong and Gu Long’s Novels as Examples.” Journal of Information Engineering, vol. 11, no. 4, 2025, pp.50–64.
Savoy, Jacques. “Authorship Attribution: A comparative Study of Three Text Corpora and Three Languages.” Journal of quantitative linguistics, vol. 19, no. 2, 2012, pp. 132–161.
Shen, Wei, and Wu Yurong. “A Quantitative Stylistic Study on the Vocabulary in Ba Jin and Mao Dun’s Novels.” Journal of Sichuan University of Arts and Science, vol. 31, no. 4, 2021, pp. 63–71.
Smith, Joseph A., and Coleen Kelly. “Stylistic Constancy and Change across Literary Corpora: Using Measures of Lexical Richness to Date Works.” Computers and the Humanities, vol. 36, no. 4, 2002, pp. 411–430.
Sun, Na. “The Narrative Art of The Great Gatsby.” Fiction Review, vol. S2, 2008, p. 63.
Tan, Hua, and Bi Yude. “A Study of Syntactic Complexity in English Translations of National Translation Projects.” Foreign Language Research, no. 6, 2024, pp. 15–22.
Tian, Ran. “An Interpretation of the Linguistic Features in Fitzgerald’s Novels.” Language Planning, no. 23, 2016, pp. 63–64.
Tu, Mengchun, and Liu Ying. “Quantitative Statistics and Analysis of Yu Hua and Mo Yan’s Full-Length Novels.” Journal of Chinese Information Processing, vol. 33, no. 2, 2019, pp. 131–142.
Tuldava, Juhan. “The development of statistical stylistics (a survey).” Journal of Quantitative Linguistics, vol. 11, nos. 1–2, 2004, pp. 141–151.
Wang, Jinquan, Yu Xiang, and Wu Wanneng. “Translation Quality Evaluation Based on Lexical Quantitative Features.” Chinese Translators Journal, vol. 42, no. 5, 2021, pp. 113–120.
Wen, Hua. “An Interpretation of In Another Country from the Perspective of Cognitive Linguistics.” Language Planning, no. 8, 2016, pp. 25–26.
Wu, Cong. “Narrative Analysis of The Old Man and the Sea under Defamiliarization Theory.” Language Planning, no. 30, 2017, pp. 43–44.
Wu, Fei. “A Linguistic Analysis of The Old Man and the Sea.” Language Planning, no. 36, 2016, pp. 46–47.
Wu, Rong. “A Study of the Narrative Style Interweaving Sense and Symbolism in Fitzgerald’s Novels.” Journal of Social Science of Jiamusi University, vol. 40, no. 5, 2022, pp. 110–113.
Yan, Sheng. “A Comparative Study of Syntactic Complexity in Timed Writing Between Asian EFL Learners and Native English Speakers.” Shandong Foreign Language Teaching Journal, vol. 43, no. 5, 2022, pp. 44–55.
Yasin, Sidra, and Hira Haroon Faizullah. “A Comparative Study of Cohesion and Coherence in Male and Female Authored Novels: Joyce’s A Portrait of the Artist as a Young Man and Woolf’s To the Lighthouse: A Corpus-Based Analysis.” Journal of Applied Linguistics and TESOL (JALT), vol. 7, no. 4, 2024, pp. 685–692.
Yule, G. Udny. “On Sentence-Length as a Statistical Characteristic of Style in Prose: With Application to Two Cases of Disputed Authorship.” Biometrika, vol. 30, nos. 3–4, 1939, pp. 363–390.
Zhang, Zhenbang.A New English Grammar.Shanghai Translation Publishing House, 1981.
Zhao, Chunxia. “Linguistic Features and Artistic Representation in A Farewell to Arms.” Language Planning, no. 32, 2016, pp. 65–66.
Zhu, Chaowei, and Li Runfeng. “A Corpus-Based Study of Ezra Pound’s Translating Style in Rendering Chinese Classics.” Foreign Language Education, vol. 44, no. 4, 2023, pp. 75–82.
Zhu, Dexi. Lectures on Grammar. The Commercial Press, 1982.
Zhu, Haoran, Lei Lei, and Hugh Craig. “Prose, Verse and Authorship in Dream of the Red Chamber: A Stylometric Analysis.” Journal of Quantitative Linguistics, vol. 28, no. 4, 2021, pp. 289–305.
The Authors
Wei Aiyun, female, Ph.D. and professor. Head of the Institute of Foreign Language and Literature, Guangxi Normal University; Visiting Scholar at the University of Cambridge.
She received her Doctor of Literature degree from Zhejiang University in 2020 under the supervision of Professor Liu Haitao, an internationally renowned and highly cited scholar in quantitative linguistics. Her primary research areas include applied linguistics (especially quantitative linguistics), English teaching, and literature or translation from the perspective of quantitative linguistics. She has completed eight projects (including one funded by the National Social Science Foundation of China), and her publications include one translated work (published by The Commercial Press of China) and more than 20 academic papers in scholarly journals such as the Journal of Quantitative Linguistics.
Email: wendyaiyun@163.com
Luo Yinlin, female, a postgraduate student at the College of Foreign Studies, Guangxi Normal University. Her primary research field is foreign linguistics and applied linguistics. She has published one academic paper and participated in one research project. She is currently a teacher at Qinzhou Preschool Education College.
Email: 494357917@qq.com