You Ask, I Answer: Twitter Bot Detection Algorithms?

Warning: this content is older than 365 days. It may be out of date and no longer relevant.

You Ask, I Answer: Twitter Bot Detection Algorithms?

Joanna asks, “In your investigation of automated accounts on Twitter, how do you define a bot?”

This is an important question because very often, we will take for granted what a software package’s definitions are. The ONLY way to know what a definition is when it comes to a software model is to look in the code itself.

You Ask, I Answer: Twitter Bot Detection Algorithms?

Watch this video on YouTube.

Can’t see anything? Watch it on YouTube here.

Listen to the audio here:

Download the MP3 audio here.

Machine-Generated Transcript

What follows is an AI-generated transcript. The transcript may contain errors and is not a substitute for watching the video.

Email.

In today’s episode, Joanne asks, in your investigation, automated accounts on Twitter, how do you define a bot? So this is really important question.

A lot of the time when we use software packages that are trying to do detection of something and are using machine learning in it, we have a tendency to just kind of accept the outcome of the software, especially if we’re not technical people.

And it says like, this is a bottle.

This is a knob, which kind of accept it as really dangerous is really dangerous because it’s not clear how a model is making its decisions, what goes into it out as it makes its decisions.

How accurate is it? And without that understanding, it’s very easy for things like errors to creep in for bias to creep in.

For all sorts of things to go wrong and we don’t know it.

Because we don’t know enough about what’s going on under the hood to be able to say, Hey, this is clearly not right, except to inspect the outputs.

And then again, if you’re not technical, you are kind of stuck in the situation of either I accept that the outputs are wrong or I find another piece of software.

So, in our Saturday night data parties that we’ve been doing identifying Twitter accounts that may be automated in some fashion, there are a lot of different things that go into it.

Now, this is not my software.

This is software by Michael Kennedy from the University of Nebraska.

It’s open source, it’s free to use it’s part of the our, it’s in our package, so uses the programming language.

And that means that because it’s free and open source, we can actually go underneath, go under the hood and inspect to see what goes in the model on how the model works.

So let’s, let’s move this around here.

If you’re unfamiliar with open source software, particularly uncompetitive Which the our programming language is a scripting language and therefore it is uncompelled.

It’s not a binary pieces of code, you can actually look at not only just the software itself, right and explain, the author goes through and explains how to use the software.

But you can, if you’re, again, if you’re a technical person, you can actually click into the software itself and see what’s under the hood, see what the software uses to make decisions.

This and this is this is why open source software is so powerful because I can go in as another user, and see how you work.

How do you work as a piece of software? How are the pieces being put together? And do they use a logic that I agree with now? We can have a debate about whether my opinions about how well the software works should be part of the software, but at the very least, I can know how this works.

So let’s Go into the features.

And every piece of software is going to be different.

This is just this particular author’s syntax and he has done a really good job with it.

We can see the data it’s collecting.

If I scroll down here, like since the last time time of day, the number of retweets number of quotes, all these things, the different clients that it uses, tweets per year, years on Twitter, friends, count follows count ratios.

And all these are numeric.

Many of these are numeric features, that you get the software’s going to tabulate and essentially create a gigantic numerical spreadsheet for it.

And then it’s going to use an algorithm called gradient boosting machines to attempt to classify whether or not an account is is likely about based on some of these features, and there’s actually two sets of features.

There’s that initial file and then there’s another file that looks at things like sentiment tone, uses of different emotions and emotional keywords and the range the it’s called emotional valence, the range of that within an author’s tweets.

So if you’re sharing, for example, in an automated fashion a particular point of view, let’s say it’s, it’s a propaganda for the fictional state of wadiya, right from the movie the dictator, and you are just promoting Admiral General aladeen over and over and over again and you’re gonna have a very narrow range of emotional expression, right? And there’s a good chance you’re going to use one of these pieces of scheduling software, there’s good chance that you will have automated on certain time interval.

And those are all characteristics that this model is looking for to say, you know what this looks kind of like an automated account, your posts are at the same time every single day.

The amount of time between tweets is the exact same amount each time.

The emotion range, the context is all very narrow, almost all the same, probably about as opposed to the way a normal user a human user functions where the, the space between tweets is not normal, it’s not regular, because you’re interacting and participating in conversations, the words you use and the emotions and the sentiment of those words is going to vary sometimes substantially because somebody may angry you or somebody may make you really happy.

And that will be reflected in the language that you use.

And so the way the software works, is essentially quantifying all these different features hundreds of them, and then using this this machine learning technique gradient boosting machines to build sequential models of how likely is this a contributor to a bot like outcome? How regular is this, this data spaced apart? Now the question is, once you know how the model works, do you agree with it? Do you agree that all these different things Factoring sticks are relevant.

Do you agree that all of these are important? In going through this, I have seen some things that like, I don’t agree with that.

Now, here’s the real cool part about open source software, I can take the software, and what’s called fork it basically make a variant of it, that is mine.

And I can make changes to it.

So there are, for example, some Twitter clients in here that aren’t really used anymore, like the companies that made them or have gone out of business.

So you won’t be seeing those in current day tweets, we still want to leave those in big for historical Twitter data.

But I also I want to go into Twitter now and pull a list of the most common Twitter clients being used today and make sure that they’re accounted for in the software, make sure that we’re not missing things that are features that could help us to identify the things I saw in the model itself, they made a very specific choice about the amount of cross validation folds in the in the gradient boosted tree.

If that was just a bunch of words you crossed validation is basically trying over and over again, how many times you we run the experiment to see, is the result substantially similar to what happened the last time? Or is there a wide variance like, hey, that seems like what happened these two times or three times or however many times it was random chance, and is not a repeatable result.

They use a specific number of the software, I think it’s a little low, I would tune that up in my own version.

And then what I would do is I would submit that back to the authors of like a pull request, and say, Hey, I made these changes.

What do you think? And the author go? Yep, I think that’s a sensible change.

Yep.

I think I’ve tweeted a client should be included.

Now, I disagree with you about how many iterations we need or how many trees we need, or how many cross validation folds we need.

And that’s the beauty of this open source software is that I can contribute to it and make those changes.

But to Joanne’s original question.

This is how we define a bot.

Right? The software has an algorithm in it and algorithm, as my friend Tom Webster says is data plus opinions, data plus opinions that we choices we make.

And so by being able to deconstruct the software and see the choices that were made, the opinions that were encoded into code and the data that it relies on, we can say, yes, this is a good algorithm, or no, this algorithm could use some work.

So that’s how we define a bot here.

Maybe in another Saturday night data party will actually hack on the algorithm some and see if it comes up with different results.

I think that would be a fun, very, very, very, very technical Saturday night party.

But it’s a good question.

It’s a good question, I would urge you to ask all of the machine learning systems that you interact with on a regular basis, all the software you interact with on a regular basis.

Is there a bias? Is their opinion being expressed by the developer? What is it and do you agree with it? Does it fit your needs? And if it doesn’t, you may want to consider a solution like open source software where you can customize it to the way you think the system should function.

So good question.

follow up questions, leave them in the comments box below.

Subscribe to the YouTube channel on the newsletter.

I’ll talk to you soon.

Take care I want help solving your company’s data analytics and digital marketing problems.

This is Trust insights.ai today and let us know how we can help you

Machine-Generated Transcript

Comments

Leave a Reply Cancel reply

Pin It on Pinterest