I initially thought that myself but this is covered in end note [4]:
"[4] I think I once heard of an MSN chat transcript dataset that was really awesome, but I can't seem to find mention of it anymore. Let me know if you know where I can find that or any other instant message datasets. I know that some IRC rooms get publicly logged — is there a single place where one could grab all of them at once?"
I wouldn't assume just grab an existing log, wouldn't it be just like twitter? (ie, just "follow" a bunch of conversations/channels and build your own corpus by "tapping into the firehose").
You could use pre or post filtering to weed out connect/disconnect and other noise.
I only thought of this because there was like two "TWSS" replies in the past couple of days on #rubyonrails
You could write a simple client that just ignores connect/disconnect messages from the server.
Reading this post I remembered a small project called sociograph a friend of mine created. It's an IRC bot that logs messages and draws a graph of the people communicating with each other in realtime, see for an example http://www.youtube.com/watch?v=A_ah-SE-cNY.
"[4] I think I once heard of an MSN chat transcript dataset that was really awesome, but I can't seem to find mention of it anymore. Let me know if you know where I can find that or any other instant message datasets. I know that some IRC rooms get publicly logged — is there a single place where one could grab all of them at once?"