Text Mining

Weeks 7 and 8 of our course are taught by Dr. Carolyn Penstein Rosé who is on the computer and languages faculties at Carnegie Mellon University. She’s particularly interested in collaborative learning.

The unit is on “text mining.” Text mining is a subset of “data mining.” As far as I can tell (relying heavily on Wikipedia) the goal of data mining is to simplify and visualize the data in big datasets. In essence, I think that “data mining” is really “predictive analysis.”

Text mining is used in a number of fields including classifying movie reviews, understanding medical records, and tracking consumer purchasing.

A potentially very interesting application is automated essay grading (AES) or improvement. Lightside Labs has software that does this in the K-12 realm.

Our assignment is to learn to use lightSIDE, which is an open source machine learning program (no mention of the “data mining” buzzword in the user manual). LightSIDE is built on WEKA but I’m not yet sure what its advantages are over WEKA.

LightSIDE seems to do the basic linear and binomial regressions, tree models, etc. It also has a lot of text processing capabilities. I haven’t had time to read the entire user manual yet, but doing so would be like taking an entire course in text analysis – it seems very good.

My next step is to choose some kind of analysis to do. I have a bunch of scored essays – I might start with that. Back soon.









Are the effects of exercise due to placebo?

Interesting new post that provides a way to study a potential placebo effect.  The idea is that, if a placebo effect can explain a finding (in this case the effect of exercise on memory),  then people need to believe in the potential relationship.  The authors argue that the effects of exercise on memory can’t be placebo effects, because people don’t believe exercise affects memory.


More than that, the article has really good ancillary material, including the dataset and the R code that runs the analyses.


Predicting post-bacc outcomes

My research on predicting outcomes was based on 373 students who graduated from college over the past two years. I had information on two binary variables – whether or not they had received a job offer by the day of their graduation and whether or not they had been admitted to graduate school on the day of graduation. I predicted each of these binary variables using these variables:

  • Student Gender
  • Cumulative GPA
  • SAT total scores (from admission)
  • Did the student participated as a Research Assistant in college (Y/N)
  • Did the student participated as a Teaching Assistant in college (Y/N)
  • Did the student participate in service activities in college (Y/N)

I made the predictions using R with the glm package with option binary.

I also did the prediction using regular linear modeling.

The report for predicting admission to grad school is below:










Although the overall model and two of the predictors are statistically significant (mostly cumulative GPA) overall only a small amount of the variance can be predicted.

Tools for Learning Analytics

My course asks me to keep track of the software that I am using in my learning analytics activities. I will do so here.

My major tool so far is the programming language R. It’s open source and is one of the most popular programming languages worldwide. It does a lot of things – I use it to organize and statistically analyze my data. It’s taken me years to learn R but now that I have the language in hand it’s really useful for me.

I’m experimenting with Tableau which is part of the course. Tableau is data visualization software designed to allow people to easily get into and visualize their data. It’s a lot of point-and-click activity. Really very powerful but also very expensive.

Edx course on learning analytics

So I’ve been taking this course on Data Analytics and Learning from EdX. It’s taught by a team of researchers from UT Arlington. The course got off to a rocky start for me because they were trying to do too much and because I wasn’t really very keen on their approach. They have a visual syllabus:


which threw me a bit until I realized what it was. It’s pretty, but I’d prefer to be able to view it in text format.

Here’s an image of the course structure:

Honestly, I think it’s probably a good try, but really too confusing.


Somewhere in there you’ll see this thing called “Prosolo.” I tried to check that out, but it turns out you can finish the course without it so I stopped. One problem is that the instructions for it are all in video format. I try not to do video, but there is no accompanying transcription. This annoys me and is also not consistent with accessibility standards. There also seems to be some concern from students about the privacy of data published to ProSolo.

The intro to the course presents a “red pill” and a “blue pill” path choice (we begin with drugs!). One is more individual, one is more social. I think I can do either, but they want me to be social. I’ll try to be. The instructors believe that participation in social networks improves learning. The may convince me – more on that later.

A big thing is the #DALMOOC Twitter hashtag. I suppose I should Tweet – maybe I will.

Will there be group assignments? Not clear yet.

I’m looking forward to the course.




So we’re analyzing social networks using Gephi in the #dalmooc learning analytics course:

If I have this right,

Actors (aka nodes) are the people (.e.g. email addresses)

Relations (aka edges) are the links among people (they can have signed (e.g. positive or negative) relations: For instance advice (+) vs. annoyance (-) types.

Relations can have weights.

Relations can be directed or undirected.

There are various measures that describe a group structure:

Diameter: the longest distance between any pair of two nodes in the network.

Density: The proportion of relations that are active

Decreased centrality: The number of connections for each actor

In-degree centrality: How many other nodes are trying to establish communication or are talking to a particular node.

Outgoing connections may mean how many e-mails an individual sent to somebody else

Betweenness centrality: The ease of connection with anybody else in the network

Closeness centrality: The ease or the shortest distance of a node to anybody else in the network.

Communities (modules) can form. They are subgroups

The “giant component” is the largest subgroup

I did an analysis on one of the sample datasets from a prior course (CCK11 Blogs, 6 weeks) and it produces (YES!) lovely pictures, BUT

  •     The pictures are completely different given different “layouts”
  •     It’s not very clear what it means without some identifying information – who is who?


If you want to try some other datasets, there are a bunch listed here:


NB: Some of them are too big for Gephi.


Although it’s been in development for 10 years it seems that Tableau is suddenly making a splash in the education community.  It’s a great product — makes it easy to create pretty complex analyses using point and click.

The software seems to be super expensive — hundreds of thousands of dollars for an enterprise license.

I didn’t find it all that intuitive at first, and I got tired of so much pointing and clicking.  But I can see it’s going to revolutionize how people interact with data.

I found that it was missing some obvious stuff (you can’t compute a standard error directly) and in the end I may prefer to just continue working with R and ggplot.

I began by looking at this article in the New York Times on racial gaps in higher education:


It turns out that the website contains data on 999 universities in the US.

I used the R package XML to read the table into JSON format and then the R package jsonlite to turn the table into a .csv file. Then I opened the file in Tableau.

It was easy to see the proportion of each university that was African American

    Drag a discrete variable “school name” to columns

    Drag a continuous variable “percent black” to rows

    Sort descending on “percent black.”

NB (from http://www.theinformationlab.co.uk/2011/09/23/blue-things-and-green-things/)



Let’s put them on a tree map and add a filter that allows you to choose the size of school. This allows you to do something pretty important – namely to see schools that meet criteria.



So, if you’re an African American student looking for a smaller college (less than about 10,00 students) and you want (1) a high graduation rate and (2) a high proportion of black students, you’d look for schools to the left side of the map but which are also in a dark color. Howard University pops out (you have to mouse over the smaller squares to see the school name).