Analyzing and classifying educational YouTube videos based on their visual characteristics

Video classification is a fairly common task. Our goal in this post is to explore how we can label different styles of educational videos. The project is part of a larger effort to create a recommendation algorithm for learning materials that is ethical and transparent.

Here we focus on analyzing the video data. Detecting topics automatically is a separate task that we will discuss later.

Our motivation to decimate videos as we show below is based on the belief that students have different learning styles and therefore different types of videos can be useful for different students (or to the same students but at different stages of their learning). This assumption is partially empirically supported by the existence of a large variety of popular YouTube educational videos on the same topic.

Video / data processing

Our focus here is on the style of the video, not the content. Thus, we want to highlight the difference between the visual features of the video themselves and see how it corresponds to the other YouTube characteristics such as number of views and likes.

We use YOLOv5 to find objects in the frames at 1 fps individually and in our experiments for the given set of videos we realized that the most relevant marker is the ‘person’. It is intuitively understandable since there are distinct types of videos with live lecturers in front of the blackboard or videos with no people at all with background narration only.

Another important feature is the amount of text on the screen. We used VGG-16 at 0.2 fps to identify these text blocks. Again, it is something easy to imagine since there are videos with very little on screen text or with a lot of tedious derivation.

Overall our efforts allow us to estimate how busy the videos are.

After having all of these markers available for many frames per video, we can assign labels to the videos based on how often objects appear in the videos, and how much text appears on the screen.

Also, as a proxy for how fast people talk in the video, we also added characteristics such as subtitle speed and video length.

Dimensionality reduction

Now we have a lot of labels for each video and our task is to interpret them. Some of these labels are likely to be important or highly correlated with each other. To extract the most important features that put videos apart we look for a two-dimensional subspace that would allow us to visually see how videos are clustered based on their properties.

The number of points in this study is too low for using an autoencoder to find a latent space. Thus, we use simpler non-linear algorithms for dimensionality reduction.

We used Linear Discriminant Analysis to reduce ~60 video feature labels that we had per each video to reduce them to 2 components for visualization purposes. With this pretty straightforward approach we can immediately see clusters of different styles.

To test the validity of the approach we can select videos from creators with multiple videos in our dataset and color the points accordingly. The image below confirms an obvious fact that the videos from the same creators are visually similar.

We can examine the videos in the clusters to understand how the videos are clustered:

There is a lot of fine structure, but the most obvious trend is the number of people present in the videos. Roughly, from left to right we have increase of the number of people present. In the figure below gray region represents the videos with a blackboard, orange region with people occasionally appearing in the frame, and the red region is a constant presence of people on the screen.

The y-axis in these figures corresponds to the ‘complexity’ of the video that is encoded in the amount of text present on the screen, number of words said per minute, etc.


The parameter space we defined here is incomplete, because we only considered K-12 math videos while other educational videos, for instance those involving physical experiments, can be stylistically completely different. This is something we will be studying in the future posts.

Also, we mentioned only two general trends related to the number of people and complexity, even though the distribution has more features embedded into it that might be essential for recommendation algorithms. We will be discussing them in future studies.

Discussion & next steps

We have shown that the style space of educational videos can be automatically determined with the available computer vision algorithms and the result is easily interpretable by a human.

The islands in this style space represent certain styles of filming and editing. This distribution does not immediately tell a user where their preferred or most efficient learning materials yet. However, it is a very useful layer of abstraction in our journey of creating an ethical and transparent recommendation algorithm for learning materials.

The visual complexity of videos can be important for some people to select the most appropriate learning material for them and in some cases when the cognitive load should be controlled it can be essential.

However, this space can be used for analytical purposes. For instance, we can see which regions in the parameter space are less competitive or if there are any ‘missing’ videos of a certain style about a certain concept. For instance, in the figure below we color the videos based on the number of views and we see a clear trend for ‘simpler’ videos (on the bottom) to be more popular. It can be a result of YouTube algorithm or the presence of younger audience on YouTube.