Skip to Content

Instrukcja korzystania z Biblioteki


Ukryty Internet | Wyszukiwarki specjalistyczne tekstów i źródeł naukowych | Translatory online | Encyklopedie i słowniki online


Astronomia Astrofizyka

Sztuka dawna i współczesna, muzea i kolekcje

Metodologia nauk, Matematyka, Filozofia, Miary i wagi, Pomiary

Substancje, reakcje, energia
Fizyka, chemia i inżynieria materiałowa

Antropologia kulturowa Socjologia Psychologia Zdrowie i medycyna

Przewidywania Kosmologia Religie Ideologia Polityka

Geologia, geofizyka, geochemia, środowisko przyrodnicze

Biologia, biologia molekularna i genetyka

Technologia cyberprzestrzeni, cyberkultura, media i komunikacja

Wiadomości | Gospodarka, biznes, zarządzanie, ekonomia

Budownictwo, energetyka, transport, wytwarzanie, technologie informacyjne

Literary and Linguistic Computing

In this slightly modified version of my 2013 Roberto Busa Prize lecture, I look from the first four decades of digital humanities through its present toward a possible future. I find a means to construct this future by paying close attention to the enemy we need in order to grow: the fear that closed down the horizons of imaginative exploration during the years of the Cold War and that re-presents itself now clothed in numerous techno-scientific challenges to the human. 2014/08/23 - 19:58

Digital Humanities (DH) has come a long way towards establishing itself as a dynamic and innovative field of study. However, it has been pointed out that the DH community predominantly comprises scholars from a handful of mainly English-speaking countries, and a current challenge is achieving a broader internationalization of the DH community. This article provides an overview of the landscape in terms of geo-linguistic diversity, as well as reviewing current DH initiatives to broaden regional and linguistic diversity and identifies some of the main challenges ahead. The aim of this article is to serve as a benchmark of the current situation and suggest areas where further research is required. 2014/08/23 - 19:58

This article provides a brief description of Mapping the Catalogue of Ships, which maps the towns and contingents of Homer’s Catalogue of Ships, analyzing the poet’s knowledge and use of ancient Greek geography. We offer a brief account of the questions that drive our research, detail our novel method to analyze Homer’s poetry in terms of geospatial organization, and summarize the geospatial organizational principles that we have discovered. We discuss the necessity of a digital format to our research and the presentation of our argument, which requires simultaneous attention to literary, geographical, archival, and bibliographical material. The article also details the Neatline ( platform that allows us to achieve these goals. We end with outlining future directions for our research and user interface. 2014/08/23 - 19:58

This paper charts the origins, trajectory, development, challenges, and conclusion of Project Bamboo, a humanities cyberinfrastructure initiative funded by the Andrew W. Mellon Foundation between 2008 and 2012. Bamboo aimed to enhance arts and humanities research through the development of infrastructure and support for shared technology services. Its planning phase brought together scholars, librarians, and IT staff from a wide range of institutions, in order to gain insight into the scholarly practices Bamboo would support, and to build a community of future developers and users for Bamboo’s technical deliverables. From its inception, Bamboo struggled to define itself clearly and in a way that resonated with scholars, librarians, and IT staff alike. The early emphasis on a service-oriented architecture approach to supporting humanities research failed to connect with scholars, and the scope of Bamboo’s ambitions expanded to include scholarly networking, sharing ideas and solutions, and demonstrating how digital tools and methodologies can be applied to research questions. Funding constraints for Bamboo’s implementation phase led to the near-elimination of these community-oriented aspects of the project, but the lack of a shared vision that could supersede the individual interests of partner institutions resulted in a scope around which it was difficult to articulate a clear narrative. When Project Bamboo ended in 2012, it had failed to realize its most ambitious goals; this article explores the reasons for this, including technical approaches, communication difficulties, and challenges common to projects that bring together teams from different professional communities. 2014/08/23 - 19:58

When Robert Coover anointed Michael Joyce the ‘granddaddy’ of hypertext literature in a 1992 New York Times article, it could scarcely have been imagined that this pronouncement would come to define the origin of electronic literature. This short article examines the human and machinic operations obscuring Judy Malloy's Uncle Roger, a hypertext that predates afternoon. Malloy's reputation was stunted because Uncle Roger was algorithmically invisible, a factor that became increasingly important as the Web's commercial capacities matured. afternoon's endurance can be traced to its ISBN, which made afternoon easy for readers to find and united disparate stewards in preserving access to this work. Malloy's programming expertise and the goodwill among hypertext authors were insufficient to protect her against sexist exclusions that, in aggregate, fostered enduring disequilibria. While some male pioneers of hypertext are now full professors, Malloy and other early female hypertext pioneers are adjuncts or are otherwise at a remove from the academic power base. Ironically, Judy Malloy's papers—13,200 items, 15.6 linear feet—are collected at Duke University's Rubenstein Library, but Judy herself still seeks sustained academic employment. This gesture is read in the context of pursuing the digital humanities ‘for love’ in a higher education environment that's increasingly neoliberal in its financial allegiances. 2014/08/23 - 19:58

Integrating data from different sources represents a tremendous research opportunity across the humanities, social, and natural sciences. However, repurposing data for uses not imagined or anticipated by their creators involves conceptual, methodological, and theoretical challenges. These are acute in archaeology, a discipline that straddles the humanities and sciences. Heritage protection laws shape archaeological practice and generate large bodies of data, largely untapped for research or other purposes. The Digital Index of North American Archaeology (DINAA) project adapts heritage management data sets for broader open and public uses. DINAA’s initial goal is to integrate government-curated public data from off-line and online digital repositories, from up to twenty US states, and which qualitatively and quantitatively describe over 500,000 archaeological sites in eastern North America. DINAA hopes to promote extension and reuse by government personnel, as well as by domestic and international researchers interested in the cultures, histories, artifacts, and behaviors described within these public data sets. DINAA innovatively applies methodologies and workflows typical of many ‘open science’ and digital humanities programs to these data sets. The distributed nature of data production, coupled with protections for sensitive data, add layers of complexity. Ethically negotiating these issues can wider the collaboration between stakeholder communities, and offer an unprecedented new view on human use of the North American landscape across vast regions and time scales. 2014/08/23 - 19:58

This article presents computational techniques for analyzing soundplay in a corpus and applies it to a corpus of Biblical Hebrew poetry, namely, the Book of Psalms. Evidence is presented to show that there is soundplay in the Book of Psalms, and computational techniques are presented to evaluate a poetic passage proposed by a scholar as having soundplay. That is, the computational techniques, though not definitive, help to distinguish between artistic soundplay and the results of chance and a limited phonemic inventory. In addition, visualization tools are presented to aid the researcher in finding soundplay in a corpus. 2014/08/23 - 19:58

Graduate student fellows of the Praxis Program at the University of Virginia Library have created Prism, a digital project to explore the possibilities of collaborative interpretation of texts, or ‘crowdsourced interpretation’. Prism was developed by two discrete teams of 1-year fellows with the Scholars’ Lab and is freely available at This article describes Prism’s intervention into current crowdsourcing debates. First, we demonstrate that where other crowdsourcing projects have tended to ask users to compile data or perform other mechanistic tasks such as optical character recognition correction or manuscript transcription, Prism enables community-generated interpretation along discrete parameters. In addition, we describe how Prism challenges two common approaches to crowdsourcing in the digital humanities that are characterized as microtasking and macrotasking. We also explore the ways in which Prism’s user interface and design respect the role of the individual in crowdsourcing and how future developments of the tool might expand on these possibilities. 2014/08/23 - 19:58

The stereotype of the multi-authored Digital Humanities paper is well known but has not, until now, been empirically investigated. Here we present the results of a statistical analysis of collaborative publishing patterns in Computers and the Humanities (CHum) (1966–2004); Literary and Linguistic Computing (LLC) (1986–2011); and, as a control, the Annals of the Association of American Geographers (AAAG) (1966–2013) in order to take a first step towards investigating concepts of ‘collaboration’ in Digital Humanities. We demonstrate that in two core Digital Humanities journals, CHum and LLC, single-authored papers predominate. In AAAG, single-authored papers are also predominant. In regard to multi-authored papers the statistically significant increases are more wide-ranging in AAAG than in either LLC or CHum, with increases in all forms of multi-authorship. The author connectivity scores show that in CHum, LLC, and AAAG, there is a relatively small cohort of authors who co-publish with a wide set of other authors, and a longer tail of authors for whom co-publishing is less common. 2014/08/23 - 19:58

This article describes the development of a geographical information system (GIS) at Språkbanken as part of a visualization solution to be used in an archive of historical Swedish literary texts. The research problems we are aiming to address concern orthographic and morphological variation, missing place names, and missing place name coordinates. Some of these problems form a central part in the development of methods and tools for the automatic analysis of historical Swedish literary texts at our research unit. We discuss the advantages and challenges of covering large-scale spelling variation in place names from different sources and in generating maps with focus on different time periods. 2014/08/23 - 19:58

This article addresses the ‘meaning problem’ of unsupervised topic modeling algorithms using a tool called the Networked Corpus, which offers a way to visualize topic models alongside the texts themselves. We argue that the relationship between quantitative methods and qualitative interpretation can be reframed by investigating the long history of machine learning procedures and their historical antecedents. The new method of visualization presented by the Networked Corpus enables users to compare the results of topic models with earlier methods of topical representation such as the 18th-century subject index. Although the article provides a brief description of the tool, the primary focus is to describe an argument for this kind of comparative analysis between topic models and older genres that perform similar tasks. Such comparative analysis provides a new method for developing conceptual histories of the categories of meaning on which the topic model and the index depend. These devices are linked by a shared attempt to represent what a text is ‘about’, but the concept of ‘aboutness’ has evolved over time. The Networked Corpus enables researchers to discover congruities and contradictions in how topic models and indexes represent texts in order to examine what kinds of information each historically situated device prioritizes. 2014/08/23 - 19:58

For more than 40 years now, modern theories of literature insist on the role of paraphrases, rewritings, citations, reciprocal borrowings, and mutual contributions of many kinds. The notions of ‘intertextuality’, ‘transtextuality’, and ‘hypertextuality/hypotextuality’ were introduced in the seventies and eighties to approach these phenomena. Through the Phœbus project, computer scientists from the computer science laboratory of the University Pierre and Marie Curie collaborate with the literary teams of Paris-Sorbonne University to develop efficient tools for literary studies that take advantage of modern computer science techniques to detect borrowings of huge masses of texts and to help put them in context. In this context, we have developed a piece of software that automatically detects and explores networks of textual reuses in classical literature. This article describes the principles on which our program is based, the significant results that have already been obtained and the prospective for the near future. It is divided into four parts. The first part recalls the distinction between various types of borrowings like plagiarism, pastiches, citations, etc. The second enumerates the criteria that are retained to characterize reuses and citations on which we are focusing here. The third part describes the implementation and shows its efficiency by comparison with manual detection. Finally, we show some of the results that have already been obtained with the Phœbus program. 2014/08/23 - 19:58

Based on Burrows's measure of stylometric difference that uses frequencies of most frequent words, Rolling Delta is a method for revealing stylometric signals of two (or more) authors in a collaborative text. It is applied here to study the texts written jointly by Joseph Conrad and Ford Madox Ford, producing results that generally confirm the usual critical consensus on the visibility of the two author's hand. It also confirms that Ford's claims to a sizeable fragment in Nostromo are unfounded. 2014/08/23 - 19:58

Computer simulation is the only practical way to model diffusion of cultural features, including speech. We describe the use of a cellular automaton to model feature diffusion as the adaptive aspect of the complex system of speech. Throughout hundreds of iterations that correspond to the daily interaction of speakers across time, we can watch regional distributional patterns emerge as a consequence of simple update rules. A key feature of our simulations is validation with respect to distributions known to occur in survey data. We focus on the importance of appropriate visualizations to observe what is happening during the process of diffusion, with comparison between visualizations of actual survey data and visualizations applied to our simulation. In this way, we believe that we are breaking new ground in simulation of cultural interactions as complex systems. The study of speech as a complex system addresses language as an aspect of culture that emerges from human interaction. We believe that successful simulation of speech in cultural interaction as a complex system can suggest how other aspects of humanities, such as sites, artifacts, or styles in archaeology, can diffuse and change across space and time. Our successful simulation confirms our complex systems approach, and indicates how appropriate use of visualizations makes this possible. 2014/08/23 - 19:58

The article focuses on two related issues: authorship distinction and the analysis of characters’ voices in fiction. It deals with the case of Elisabeth Wolff and Agatha Deken, two women writers from the Netherlands who collaboratively published several epistolary novels at the end of the 18th century. First, the task division between the two authors will be analysed based on their usage of words and their frequencies. Next, any stylistic differences between the characters (letter writers) will be dealt with. The focus lies on Wolff’s and Deken’s first joint novel, Sara Burgerhart (1782). As to the authorship, nothing clearly showed a clear task division, which implies that Deken’s and Wolff’s writing styles are very much alike. This confirms findings of other scholars, who found that collaborating authors jointly produce a style that is distinguishable from both authors’ personal styles. As to stylistic differences in the voices of the characters in Sara Burgerhart, it was found that only a couple of the letter writers are clearly distinguishable compared with the main characters in the novel. I experimented with two possible tools to zoom in on the exact differences between those characters, but the methods are still too subjective to my taste. In the follow-up research, I will look further than words and their frequencies as building stones of literary style. 2014/08/23 - 19:58

This paper examines prospects and limitations of citation studies in the humanities. We begin by presenting an overview of bibliometric analysis, noting several barriers to applying this method in the humanities. Following that, we present an experimental tool for extracting and classifying citation contexts in humanities journal articles. This tool reports the bibliographic information about each reference, as well as three features about its context(s): frequency, location-in-document, and polarity. We found that extraction was highly successful (above 85%) for three of the four journals, and statistics for the three citation figures were broadly consistent with previous research. We conclude by noting several limitations of the sentiment classifier and suggesting future areas for refinement. 2014/08/23 - 19:58

The frequencies of individual words have been the mainstay of computer-assisted authorial attribution over the past three decades. The usefulness of this sort of data is attested in many benchmark trials and in numerous studies of particular authorship problems. It is sometimes argued, however, that since language as spoken or written falls into word sequences, on the ‘idiom principle’, and since language is characteristically produced in the brain in chunks, not in individual words, n-grams with n higher than 1 are superior to individual words as a source of authorship markers. In this article, we test the usefulness of word n-grams for authorship attribution by asking how many good-quality authorship markers are yielded by n-grams of various types, namely 1-grams, 2-grams, 3-grams, 4-grams, and 5-grams. We use two ways of formulating the n-grams, two corpora of texts, and two methods for finding and assessing markers. We find that when using methods based on regularly occurring markers, and drawing on all the available vocabulary, 1-grams perform best. With methods based on rare markers, and all the available vocabulary, strict 3-gram sequences perform best. If we restrict ourselves to a defined word-list of function-words to form n-grams, 2-grams offer a striking improvement on 1-grams. 2014/05/17 - 12:45

Pearson’s chi-squared test is probably the most popular statistical test used in corpus linguistics, particularly for studying linguistic variations between corpora. Oakes and Farrow (2007) proposed various adaptations of this test to allow for the simultaneous comparison of more than two corpora while also yielding an almost correct Type I error rate (i.e. claiming that a word is most frequently found in a variety of English, when in actuality this is not the case). By means of resampling procedures, the present study shows that when used in this context, the chi-squared test produces far too many significant results, even in its modified version. Several potential approaches to circumventing this problem are discussed in the conclusion. 2014/05/17 - 12:45

The political negotiation, erection, and fall of national and cultural borders represent an issue that frequently occupies the media. Given the historical importance of boundaries as a marker of cultural identity, as well as their function to separate and unite people, the Body Type Dictionary (BTD; Wilson, 2006) represents a suitable computerized content analysis measure to analyse vocabulary qualified to measure body boundaries and their penetrability. Out of this context, this study aimed to assess the inter-method reliability of the BTD (Wilson, 2006) in relation to Fisher and Cleveland’s (1956, 1958) manual scoring system for high and low barrier personalities. The results indicated that Fisher and Cleveland’s manually coded barrier and penetration imagery scores showed an acceptable positive correlation with the computerized frequency counts of the BTD’s coded barrier and penetration imagery scores, thereby indicating an inter-method reliability. In addition, barrier and penetration imagery correlated positively with primordial thought language in the picture response test, and narratives of everyday and dream memories, thereby indicating correlational validity. 2014/05/17 - 12:45

We define a model of discourse coherence based on Barzilay and Lapata’s entity grids as a stylometric feature for authorship attribution. Unlike standard lexical and character-level features, it operates at a discourse (cross-sentence) level. We test it against and in combination with standard features on nineteen book-length texts by nine nineteenth-century authors. We find that coherence alone performs often as well as and sometimes better than standard features, though a combination of the two has the highest performance overall. We observe that despite the difference in levels, there is a correlation in performance of the two kinds of features. 2014/05/17 - 12:45

Most authorship attribution studies have focused on works that are available in the language used by the original author (Holmes, 1994; Juola, 2006) because this provides a direct way of examining an author's linguistic habits. Sometimes, however, questions of authorship arise regarding a work only surviving in translation. One example is ‘Constance’, the putative ‘last play’ of Oscar Wilde, only existing in a supposed French translation of a lost English original.
The present study aims to take a step towards dealing with cases of this kind by addressing two related questions: (1) to what extent are authorial differences preserved in translation; (2) to what extent does this carry-over depend on the particular translator?
With these aims, we analysed 262 letters written by Vincent van Gogh and by his brother Theo, dated between 1888 and 1890, each available in the original French and in an English translation. We also performed a more intensive investigation of a subset of this corpus, comprising forty-eight letters, for which two different English translations were obtainable. Using three different indices of discriminability (classification accuracy, Hedge's g, and area under the receiver operating characteristic curve), we found that much of the stylistic discriminability between the two brothers was preserved in the English translations. Subsidiary analyses were used to identify which lexical features were contributing most to inter-author discriminability.
Discrimination between translation sources was possible, although less effective than between authors. We conclude that ‘handprints’ of both author and translator can be found in translated texts, using appropriate techniques. 2014/05/17 - 12:45

In this article I develop a set of simple algorithms for deriving syllable count information for words from fixed-meter poetry. The focus is on the determination of what features of language or meter might be most useful. I therefore first review what factors might be useful for this, selecting those that require as little information as possible about the language in question and making as few computational demands as possible. We end up with algorithms based on: (i) the number of syllables in each line, (ii) the number of words in each line, (iii) the number of letters in those words, and (iv) the frequency of those words.
I test these algorithms on corpora from English and Welsh, getting parallel results in both cases. The results establish that the variables I identify do have significant success in deriving syllable count, but that work remains to be done. 2014/05/17 - 12:45

In this article, the application of Ant-Colony Optimization (ACO) to a morphological segmentation task is described, where the aim is to analyse a set of words into their constituent stem and ending. A number of criteria for determining the optimal segmentation are evaluated comparatively while at the same time investigating more comprehensively the effectiveness of the ACO system in defining appropriate values for system parameters. Owing to the characteristics of the task at hand, particular emphasis is placed on studying the ACO process for learning sessions of a limited duration. Morphological segmentation becomes hardest in highly inflectional languages, where each stem is associated with a large number of distinct endings. Consequently, the present article investigates morphological segmentation of words from a highly inflectional language, specifically Ancient Greek, by combining pattern-recognition principles with limited linguistic knowledge. To weigh these sources of knowledge, a set of weights is used as a set of system parameters, to be optimized via ACO. ACO-based experimental results are shown to be of a higher quality than those achieved by manual optimisation or ‘randomised generate and test’ methods. This illustrates the applicability of the ACO-based approach to the morphological segmentation task. 2014/05/17 - 12:45

Four centuries before modern statistical linguistics was born, Leon Battista Alberti (1404–72) compared the frequency of vowels in Latin poems and orations. Using a corpus of twenty Latin texts, Alberti’s observations are statistically assessed. Letter counts prove that poets used significantly more a’s, e’s, and y’s, whereas orators used more of the other vowels. The sample sizes needed to justify the assertions are studied, and proved to be within the reach of Alberti’s scholarship. Alberti appears to have made the first quantified observation of a stylistic difference ever, anticipating by more than four centuries the developments of stylometry and statistics. 2014/05/17 - 12:45

Quantifying the similarity or dissimilarity between documents is an important task in authorship attribution, information retrieval, plagiarism detection, text mining, and many other areas of linguistic computing. Numerous similarity indices have been devised and used, but relatively little attention has been paid to calibrating such indices against externally imposed standards, mainly because of the difficulty of establishing agreed reference levels of inter-text similarity. The present article introduces a multi-register corpus gathered for this purpose, in which each text has been located in a similarity space based on ratings by human readers. This provides a resource for testing similarity measures derived from computational text-processing against reference levels derived from human judgement, i.e. external to the texts themselves. We describe the results of a benchmarking study in five different languages in which some widely used measures perform comparatively poorly. In particular, several alternative correlational measures (Pearson r, Spearman rho, tetrachoric correlation) consistently outperform cosine similarity on our data. A method of using what we call ‘anchor texts’ to extend this method from monolingual inter-text similarity-scoring to inter-text similarity-scoring across languages is also proposed and tested. 2014/03/21 - 00:21

MONK is a web-based text mining software application hosted by the University of Illinois Library that enables researchers to analyze encoded digital texts from select databases and digital archives. This study examines sets of quantitative and qualitative data to explore the usage of MONK as a research tool: the author analyzes eighteen months of web analytics data from the MONK website and responses from five interviews with MONK users to examine the ways in which MONK has been most commonly used by researchers. In the paper's analysis, the author considers the implications of MONK's use in digital humanities research and teaching, and how a digital humanities tool such as MONK can be maintained for public use. This study ultimately explores how user studies of digital humanities tools can reveal insights into humanities scholars' needs for using digital tools to pursue new research methodologies, and argues that studying the usability and preservation of digital humanities tools will enable information professionals to address humanities scholars' needs for their digital scholarship. 2014/03/21 - 00:21

We compute the rate of textual signals of risk of war recognizable in series of consecutive political speeches about a disputed issue serious enough to entail an international conflict. The speeches concern Iran’s nuclear program. We trace textual signals forewarning of risks of war that reactions to this affair lead to. The thrust of the textual analysis rests on the interplay of affiliation and power words in continuous texts, following D. C. McClelland’s model for anticipating wars. The speeches are those of Iranian President Mahmoud Ahmadinejad, US Secretary of State Hillary R. Clinton, Iranian Grand Ayatollah Ali Khamenei, and Israeli Prime Minister Benjamin Netanyahu. Prefiguring a military confrontation before it occurs involves structuring information from unstructured data. Despite such imperfect knowledge, by the end of January 2012, our results show a receding risk of war on the Iranian side, but an increasing risk on the American one, while remaining ambiguous on the Israeli one. 2014/03/21 - 00:21

In recent years, great availability of various language resources in different forms as well as rapid development of computer technology and programming skills have made researchers in the fields of linguistics and computer science cooperate in solving different problems of computational linguistics and natural language processing. Building large monolingual as well as bilingual corpora in digital forms and storing them in computer memories has enabled linguists and language engineers to automatically explore techniques for processing information with the help of various computer programs without any need to manually collect and analyze data.
One of the main applications of monolingual corpora can be seen in developing automatic spell-checking systems. In such systems, a large monolingual corpus can function as a database instead of a monolingual dictionary. In the present study, it has been tried to demonstrate the effectiveness of a large monolingual corpus of Persian in improving the output quality of a spell-checker developed for this language.
In the present spelling correction system, the three phases of error detection, making suggestions, and ranking suggestions are performed in three separate stages. An experiment was carried out to evaluate the performance of the spell-checking system. 2014/03/21 - 00:21

One of the major challenges in the process of machine translation is word sense disambiguation (WSD), which is defined as choosing the correct meaning of a multi-meaning word in a text. Supervised learning methods are usually used to solve this problem. The disambiguation task is performed using the statistics of the translated documents (as training data) or dual corpora of source and target languages. In this article, we present a supervised learning method for WSD, which is based on K-nearest neighbor algorithm. As the first step, we extract two sets of features: the set of words that have occurred frequently in the text and the set of words surrounding the ambiguous word. In order to improve the classification accuracy, we perform a feature selection process and then propose a feature weighting strategy to tune the classifier. In order to show that the proposed schemes are not language dependent, we apply the suggested schemes to two sets of data, i.e. English and Persian corpora. The evaluation results show that the feature selection and feature weighting strategies have a significant effect on the accuracy of the classification system. The results are also encouraging compared with the state of the art. 2014/03/21 - 00:21

Although the aggregation of many linguistic variables has provided new insights into the structure of language varieties, aggregation studies have been criticized for obscuring the behavior of individual input variables. Previous solutions to this criticism consisted of extensive post-hoc calculations, simple correlation measures, or highly complex algorithms. We think that these solutions can be improved. Therefore, the current article proposes a creative use of Individual Differences Scaling (INDSCAL) as an alternative, more straightforward solution. INDSCAL is a branch of Multidimensional Scaling, which is currently the preferred dimension reduction technique for most aggregation studies. The link to the existing methodology and the simplicity of its rationale are the main advantages of INDSCAL. The article introduces INDSCAL by means of a non-linguistic example, a discussion of the mathematical properties, and a case study on the lexical convergence between Belgian and Netherlandic Dutch in a corpus of language from 1950 and 1990. The case study shows how INDSCAL reproduces the results of a typical aggregation study, but elegantly keeps open the possibility of investigating the behavior of individual variables. 2014/03/21 - 00:21

This study investigates data from the BBC Voices project, which contains a large amount of vernacular data collected by the BBC between 2004 and 2005. The project was designed primarily to collect information on vernacular speech around the UK for broadcasting purposes. As part of the project, a web-based questionnaire was created, to which tens of thousands of people supplied their way of denoting thirty-eight variables that were known to exhibit marked lexical variation. Along with their variants, those responding to the online prompts provided information on their age, gender, and—significantly for this study—their location, this being recorded by means of their postcode. In this study, we focus on the relative frequency of the top ten variants for all variables in every postcode area. By using hierarchical spectral partitioning of bipartite graphs, we are able to identify four contemporary geographical dialect areas together with their characteristic lexical variants. Even though these variants can be said to characterize their respective geographical area, they also occur in other areas, and not all people in a certain region use the characteristic variant. This supports the view that dialect regions are not clearly defined by strict borders, but are fuzzy at best. 2014/03/21 - 00:21

This study draws from a large corpus of Congressional speeches from the 101st to the 110th Congress (1989–2008), to examine gender differences in language use in a setting of political debates. Female legislators’ speeches demonstrated characteristics of both a feminine language style (e.g. more use of emotion words, fewer articles) and a masculine one (e.g. more nouns and long words, fewer personal pronouns). A trend analysis found that these gender differences have consistently existed in the Congressional speeches over the past 20 years, regardless of the topic of debate. The findings lend support to the argument that gender differences in language use persist in professional settings like the floor of Congress. 2014/03/21 - 00:21