Hi Friends,

Even as I launch this today ( my 80th Birthday ), I realize that there is yet so much to say and do. There is just no time to look back, no time to wonder,"Will anyone read these pages?"

With regards,
Hemen Parekh
27 June 2013

Now as I approach my 90th birthday ( 27 June 2023 ) , I invite you to visit my Digital Avatar ( www.hemenparekh.ai ) – and continue chatting with me , even when I am no more here physically

Thursday 18 July 2013

Data Mining of 5 Million Job Advts


Rohini

 

 

Even though 5 million job advts may contain 500 million “ words “ , these are not Unique

Most of these are used again and again , hundreds or thousands of times

Thru data mining , it is not difficult to compute their “ Frequency of Usage

And then , these frequencies can be graphically plotted against any particular time-period

Such Graphical Representations can be further broken up by ,

 

Ø City Names

 

Ø Company Names

 

Ø Industry Names

 

Ø Function Names

 

Ø Designations ( Vacancy Names ).. etc

 

And such graphical analysis can be done , not only for “ Keywords “ but even for “ Key Phrases “ and “ Sentences “ !

Regards

 

Hcp

 

 

 

 

 A Google database Ngram helps to understand American novels better

By New York Times | 15 Jul, 2013, 05.00AM IST
7
Share More
  •  
  •  
  •  
  •  
  •  
By examining the changing frequencies of key words in books published in the US, researchers can gain new perspectives on America and its novels.By examining the changing frequencies of key words in books published in the US, researchers can gain new perspectives on America and its novels.
ET SPECIAL:
By Marc Egnal

Can the technologies of Big Data, which are transforming so many areas of life, change our understanding of American novels? After conducting research with Google's Ngram database, which tabulates the frequency of words used in over five million books, I believe the answer is yes.

Consider the question of which themes and books characterise a literary era. The time-honoured approach to this problem has been for a critic or a group of scholars to select and analyse key novels. That methodology has its flaws. No one person or team of readers can do more than dip their toes into the vast sea of literary works. By the 1840s, Americans wrote more than 100 novels annually; by the 1880s, more than 1,000; by the early 21st century, more than 10,000. In addition, there is the threat of subjective bias. Not long ago, for example, critics focused their attention almost exclusively on white male authors.

The Ngram database offers an alternative approach. By examining the changing frequencies of key words in books published in the US, researchers can gain new perspectives on America and its novels. There are important caveats in using this source. The "American English" subset of the Ngram database includes a broad selection of books published in the US — not just fiction or writings by American authors. It excludes the dime novels favoured by the lower class, and so has a middle-class bias. But as a guide to the works that middle-class Americans read, it is a fruitful source of hypotheses and a healthy check on subjective opinion.

In a number of instances, Ngram data suggests challenges to common assumptions.

Word Processing

Take the role of women in mid-19th century American novels. Scholars have argued that domesticity shaped the world of middle-class women. Women were supposed to be submissive, pious, domestic and pure. But Ngram indicates that the use of those words peaked, respectively, in 1807, 1814, 1835 and 1847. All fell off by 1950.

By contrast, striking gains were recorded in the usage of woman's rights. Virtually unknown before the 1840s, the term soared in frequency after the Seneca Falls Convention in 1848. Perhaps we need to invert conventional wisdom and declare as "representative" those mid-century novels criticising domesticity and celebrating independent women, like Fanny Fern's Ruth Hall (1854) and Emma Southworth's Hidden Hand.

Ngram data also provides a new perspective on the novels of the 1930s. These years are traditionally viewed as the heyday of the proletarian novel, a time of gloom and a period when business leaders were despised. John Steinbeck's 1939 novel, The Grapes of Wrath, is considered a quintessential novel of the decade. But according to Ngram data, the use of businessman, a term virtually unknown before 1930, surged during the decade. Of course, you might guess that those citations were negative, but trends in other terms point to a more positive reading.

Mentions of the American dream, a term rarely seen before 1930, also climbed precipitously. So instead of Steinbeck's novel, works highlighting scrappy entrepreneurs may best mark this decade. In Their Eyes Were Watching God (1939), for example, the heroine's first two husbands were successful businessmen who overcame racial prejudice. Similarly, Gone With the Wind (1936) details Scarlett regaining the affluence she once enjoyed.

Our view of postmodern fiction might also need adjusting. Chaos, conspiracy and nihilism are thought to reign in this literary world. Word usage, however, indicates the growing attention paid to children. Among the terms whose frequency escalates after 1960 are caring, nurturing, infant, toddler and childhood.

Perhaps the representative works of this era are novels like Toni Morrison's Beloved, Philip Roth's American Pastoral and Cormac McCarthy's The Road, all of which feature deep parent-child bonds. These hypotheses are suggestive, but as tools like Ngram improve, it should encourage scholars to revisit longstanding assumptions.

(The writer is a professor of history at York University, Toronto)



No comments:

Post a Comment