Who Do You Write Like?

Via a couple of Facebook friends, I just stumbled upon this website which seeks to statistically answer the above question. You put in a sample of your writing, and it tells you which author you resemble stylistically. Apparently it’s been making its way around the interwebs for some months now, but I guess I’m getting old enough it’s okay if I’m not always a first adopter anymore, right?

Anyway, the website is frustratingly “black boxish” for a geek like me who likes to know how things work. It tells us only that, through some sort of statistical analysis, it looks at a writing sample you submit and tells you which famous author has the most stylistically similar prose. Fair enough; it’s easy enough to imagine a number of ways in which such a program could be written But that’s not the interesting part. What sort of statistics do they use? What is their metric of “distance” between styles? I wish it would tell me. It’s not as though that could be some kind of proprietary secret. But being a bit of a dork I’ve tried to find out anyway. The results were curious but inconclusive so far.

The first thing I did, of course, was pop in a sample of my own writing; a few paragraphs of one of my more “flowery” blog posts, and was a bit disappointed that it’s result was Edgar Allan Poe (I have nothing against Poe; love him in fact. But the friend who helped me discover this apparently writes like P.G Wodehouse, which is infinitely cooler). But if I wanted to disagree with the assessment, I was going to need to know a bit more about how it was made, and that’s when I got to start having fun.

Having established that “Poe” was one of the authors it had indexed, I did the next most obvious thing and dropped in a paragraph from The Murders in the Rue Morgue. The website (which I will henceforth refer to as “IWL”) correctly identified that Poe, like me, writes like himself.

I tried the same thing with a snippet of Wodehouse from Carry On Jeeves; same result. This wasn’t going anywhere. Was IWL just comparing whatever text you gave it directly to the stored texts of dozens of classic novels and stories to see who you had the most consecutive words in common with?

I went in search of some of the more obscure Poe stories out there. Gave it “Manuscript Found in a Bottle”; couldn’t fool it. Then I hit it with the first paragraph of “The Balloon Hoax” (no, not the inevitable tell-all book by Richard and Mayumi Heene). Same result. So I tried “The Angel of the Odd” and I managed to throw it for a loop. IWL thought it was H. P. Lovecraft. Good guess, but no. But this didn’t really prove much other than that it’s database of authors’ writings wasn’t exhaustive. It was time to up the ante a bit.

First, I revisited some of the Poe snippets it had correctly identified before. If it was doing a sophisticated statistical analysis of things like the frequency of adjectives versus nouns, placement of clauses, etc, then I reasoned that it’s ability to correctly identify an author would break down as the sample got shorter. On the other hand, if its method relied predominantly on just looking for near matches to strings of text, then it should be able to tell me that Poe is Poe even from just a single sentence. I gave it the first line of Rue Morgue, the rather unremarkable sentence “The mental features discoursed of as the analytical, are, in themselves, but little susceptible of analysis.” IWL still knew who to pin the blame on. Seemed like clear evidence that it was just looking for the closest text match. Just to be sure, I gave it a single sentence from Wodehouse (“He was a pal of my cousin Gussie, who was in with a lot of people down Washington Squareway”) but found that it could not be fooled. It was really beginning to look like cheap parlor tricks after all.

But then it occurred to me that just because string-matching was part of its algorithm didn’t necessarily mean it was the only component, or even, in general, the dominant one. Maybe it just does that first in case you write so much like somebody that you’re subconsciously plagiarizing them. Or maybe it does it just to mess with nerds who try to reverse-engineer its inner workings. Either way, I had to try something different. Using this little applet, I scrambled the words of the paragraphs mentioned above into a random order, and then fed them into IWL. All of the results were the same as last time. Score one for the website– if it can correctly recognize Poe by the words he uses, regardless of the order they are in, then it must have at least a few genuinely meaningful metrics of style similarity up its sleeve. Impressively, all of the single sentences it had correctly identified before also still produced the correct results when scrambled. So the program must at least be comparing to a vocab list built from the authors’ writings.

To test this “vocab comparison” theory a little further, I revisited my Wodehouse sentence. Surely, I figured, there was nothing particularly “Wodehousian” about it besides the name “Gussie,” a name so quintessentially Jeeves-and-Wooster that it’s almost painful to type without adding “Fink-Nottle” at the end. Sure enough, with that single word removed, the resulting sentence (“He was a pal of my cousin, who was in with a lot of people down Washington Square way”) comes up not as Wodehouse, but as Raymond Chandler. And I confess, as soon as I re-read it in a slightly nasal Brooklyn accent, I could see exactly why. Just to be on the safe side, I googled “Raymond Chandler” and the sentence above, both with and without quotes. As far as I could tell, nothing exactly like it appears in any of Chandler’s writings, so it must be making it’s claims off of word analysis alone.

And this is where I’ve been forced by schedule for the rest of the day to curtail my curiosity. As I said above, this isn’t really conclusive in either direction. Since it can identify a writer correctly by a single, word-for-word sentence, and also by a large paragraph with the words scrambled up, it can’t be relying on simple pattern-matching or on naive vocabulary analysis alone. But as to whether this means it’s using some brilliant third method or simply a dumb combination of the two, I can’t say. Which means it’s up to you. Admit it, you were going to go play around with it anyhow. Here are some thoughts to get you started: what happens when you blend the paragraphs of two authors, alternating sentences? I it possible to add two of them together like this and wind up with a third? What if you inject a fairly unique word or name from one author (like “Gussie,” “Bertie,” or “Jeeves” from Wodehouse) into another author’s writing? How many times do you have to add it before it tips the balance? And, on a different note, how good are you at fooling our friend IWL? If I tell you to take a passage from Margret Atwood and re-write it as Orwell, can you fool it?

Have fun. If you find anything good, be sure to comment here.

UPDATE: Apparently, Ms. Atwood has actually seen IWL, has tried submitting a sample of her own (unpublished) writing, and was told she writes most like Stephen King. And yes, that’s how slow I am on the uptake with this particular internet sensation: a 71 year-old woman beat me to it.

UPDATE II: In this particular post, apparently I’m writing a lot like Lovecraft.

Advertisements

About Colin West
Colin West is a graduate student in quantum information theory, working at the Yang Institute for Theoretical Physics at Stony Brook University. Originally from Colorado (where he attended college), his interests outside of physics include politics, paper-folding, puzzles, playing-cards, and apparently, plosives.

2 Responses to Who Do You Write Like?

  1. Moominmamma says:

    I took you up on your challenge to play around with the IWL site, and, though I didn’t have a lot of time, I did find some interesting results. I used the first line of Poe’s Rue Morgue, as you did, and played around with changes in it. There seems to be a strong vocabulary component, and part of it is just a simple word match. The more words match exactly, the more likely the system was to identify it as Poe. For instance, when I broke the Poe sentence into smaller bits, my first sentence, “We shall discuss analyzing mental features” was attributed to Vladimir Nobokov. When I added “They are not very susceptible to analysis”, Poe returned as the author. However, when I substituted different nouns and adjectives in the Poe sentence, the system identified it as from different authors. Thus, “The car wheels discoursed of as the fastest, are, in themselves, but little amenable to speed” was identified as being similar to Dickens, while “The Martian spaceships discoursed of as the most advanced are, in themselves, but little gray machines” was Jonathan Swift. The antiquated verb “discoursed of” seemed to have high priority in the identification scheme, because you’ll notice that regardless of the modernity of the other vocabulary, the authors identified were all from the Victorian period or earlier (Swift 1667-1745, Poe 1809-1845, Dickens 1812-1870). Curiously, when I substituted the word “discussed” for “discoursed of” in the Martian ships example, the system leapt ahead 300 years and identified the style as that of Arthur C. Clarke. He was also the stylist when “vehicles” was substituted for “car wheels” in the “Dickens” sentence described above. These observations lead me to think there is probably some kind of frequency-of-use of vocabulary (at least for nouns) criteria assigned to authors. However, since you rarely write about the same kinds of topics as Poe did, there is probably also a syntactic complexity component. You mentioned that the sample you submitted was “flowery”: How do you mean this? If you identify your sample, I might have more ideas about how to tease out the complexities of the system. “Flowery” is definitely an adjective used to describe Victorian writing, referring to all the embedded modifications, and you write with that kind of complexity. I haven’t had time to play with the sentence length/syntactic complexity aspects of, but I’ll try to get to it. Thanks for the fun diversion!

    P.S. Your previous post is written in the style of H.P. Lovecraft. Send me the “Poe” post so I can try to identify the difference.

  2. Katrina Nilsson-Gorman says:

    HaHA! I’m glad to hear this was such a hit, Mr. Poe.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: