Jason Bak // Projects

Analysis of my personal journal

[ Python | NLTK | textgenrnn | TextBlob | Stanford CoreNLP ]

Wan Shen Lim

May 2018 @ Carnegie Mellon University

I've been consistently journaling since the start of my high school senior year (Sept 2014), producing around 337,000 words (or 320 pages of single-spaced 12 font) of unfiltered thoughts. This is a unique source of personal data that I used to reveal trends in my life that promote happier writing, which may be correlated with a happier life!

I’ve been consistently journaling since the start of my high school senior year (Sept 2014), producing around 337,000 words (or 320 pages of single-spaced 12 font) of unfiltered thoughts.

This is a unique source of personal data which we can use to reveal trends in my life. Wan and I were particularly interested in identifying trends that promote happier writing, which may be correlated with a happier life!

There is a non-trivial amount of processing to be done on the journals as I have journals for different years in different file types and formats. I have a single .docx file for 2014, 2015, and 2016; a single .tex file for 2017; and my 2018 entries have all been in Dropbox Paper. Each of these different file formats have a different scheme of formatting journal entries. Wan and I needed to consider a data structure to make analysis possible across varying time frames, i.e., weeks, months, years. At the same time, our data format needs to be compatible with various natural language processing libraries.

We experimented with TextBlob, NLTK’s VADER, and Stanford’s CoreNLP sentiment analyzers. We tested all three on a random sample of my journal entries and found that VADER was the best at accurately reflecting my sentiment.

We also used RNNs and LSTMs via textgenrnn to generate Jason-like entries.

Lately, I’ve been incorporating meta-data about my days into each daily entry. Currently, this isn’t much data to warrant analysis, but in the future I’d like to perform similar analysis on this meta-data.