Rennerherkenning voor synchronisatie van Vlaams Wielererfgoed met koers op TV

Student:Simon De Smul
Richting:Master of Science in de industriële wetenschappen: informatica
Abstract:Simon De Smul, Master of Science in de industriële wetenschappen : informatica, Universiteit Gent Abstract behorend tot master thesis, Ingegeven 21 augustus 2019 Rennerherkenning voor synchronisatie van Vlaams Wielererfgoed met koers op TV Het doel van deze masterproef is het herkennen van wielerploegen op basis van videobeelden. Meer specifiek de uitzending van een wielerwedstrijd, zoals uitgezonden door een bepaalde omroep. Dit werd gerealiseerd door het ontwerpen en trainen van een Convolutional Neural Network (CNN). Een CNN valt onder de noemer van Artificial Neural Network (ANN). Een ANN kan best omschreven worden als een verzameling van algoritmes die samenwerken om een hoeveelheid complexe data te verwerken. In het geval van een CNN is deze complexe data een set geannoteerde afbeeldingen, waarvan elke koppel afbeeldingen, annotatie vervolgens geanalyseerd wordt. In een eerste fase van deze masterproef werd het CNN getraind met als einddoel het classificeren van afbeeldingen, in een tweede fase werd overgegaan naar regressie. Deze overgang wordt verderop in deze paper in meer detail behandeld. De afbeeldingen werden afgeleid uit de videobeelden door periodiek een frame van deze videobeelden op te slaan als afbeelding. Deze afbeeldingen werd hierna als input gebruikt voor het CNN, waarna het CNN aangaf welke wielerploeg(en) op dat moment zichtbaar was/waren. Trefwoorden die het onderwerp omschrijven : artificial intelligence, machine learning, deep learning, image classification
Abstract (Eng):Extended abstract Simon De Smul Supervisor(s): Prof. Dr. Steven Verstockt, Prof. Dr. Nico Van de Weghe  AbstractThis article explains the goal of this master thesis, Keywords Machine learning, Convolutional Neural Networks, Deep learning, Artificial intelligence I. INTRODUCTION The sport of cycling is not what it used to be. Times in which cyclists such as Eddy Merckx, Bernard Hinault or Fausti Coppi were treated as demi-gods have passed. Obviously there are still great athletes such as Chris Froome or Tom Boonen, yet they do not seem to equal their predecessors in fame. Young people seem to have lost interest in the sport and cycling tourism is on the decline. A cycling museum too seems to have lost interest in the eyes of the public. But what if such a museum could show its impressive collection in a different, perhaps more modern way? Cycling museum “KOERS” at Roeselare decided to do just that. In collaboration with IDLab Ghent a project was proposed to digitalize the collection and present it to the public. The idea is the following. At every moment the position of the lead group/rider of the race is monitored. When this lead group/rider is shown to the audience, a piece of the collection is shown that’s relevant to the position of that lead group/rider. II. AIM OF MASTER THESIS This master thesis proposes the use of a Convolutional Neural Network (CNN) to detect riders in frames of a television broadcast of a certain race. If these detected riders match the description of the riders in the lead group, information should be shown. The riders are detected based on the team jersey, as such each rider is assigned to a cycling team. III. METHODOLOGY AND RESULTS During the course of the master thesis, both classification and regression techniques were applied to process the data. A. Classification The classification model that was used underwent small changes due to disappointing results. These changes varied from the amount of layers the model consisted of to the amount of nodes every layer was assigned. Another change, which resulted in a enormous improvement in model accuracy, was the chosen activation function in the last layer of the model. Initially the Rectified Linear Unit (ReLu) function was used, which was replaced with the sigmoid function. The sigmoid function was choses since the range of the function is always between zero and one. Therefore it was an optimal choice for the given problem [5]. Figure 1 shows the difference in accuracy between a model which used the ReLu activation function and the sigmoid activation function. Figure 1 : difference model accuracy due to change in activation fucntion The input data of the model also underwent changes. In a first stage the input data consisted of images, collected using different data sources. Among these data sources were Flickr, Google Images but also manually collected images. In a second stage, the individual images were processed using You Only Look Once (YOLO), an object detection framework accessible in Python. YOLO was used to detect the individual cyclists in the images and discard the background. As such the quality of the dataset improved dramatically. Figure 2 shows an example of cyclist detection using YOLO. Figure 2 : cyclist detection using YOLO As figure 2 shows, cyclist detection consisted of detecting a person and a bike. If the relative position of these two detection objects resembled the relative position of a cyclist and his bike, the person was assumed to be a cyclist. Using the coordinates of the bounding box of the cyclist, shown in red in figure 2, the image was cropped to contain only this remaining part of the image. The best result, obtained using classification techniques, was a model accuracy of 73,91%. In hopes of improving this accuracy and obtaining more insight into the inner workings of the model, the decision was made to transition from classification to regression. B. Regression As a result of the shift from classification to regression, the model no longer predicted a single class per images, but a collection of probabilities. Each probability representing the likelihood the image contained a rider of the corresponding class.The regression model also underwent few changes, these were however not always related to disappointing results. Most of these changes were made in hopes of improving the accuracy of the results, with the accuracy already being of an acceptable quality. Examples of changes that were made are the amount of layer the model consisted of, the activation functions that were applied and kernel dimensions. The decision to transition to regression also went hand in hand with the decision to further manipulate the dataset in order to improve the accuracy of the model. In order to implement this, Openpose was used. Openpose, a human pose estimation library accessible in Python, allows for detection of human posture in an image. Figure 3 : Human posture detection using Openpose As shown in figure 3, Openpose overlays the image with a skeleton corresponding to the person that was shown. This skeleton, more specifically the keypoints and corresponding coordinates, were used to further crop the image to the upper body of the cyclist. This ensured that the dataset now contained cropped images containing only the essential part of the original images. After persisting the changes to both the model and dataset, an optimal accuracy of 79,23% was reached. In an effort to improve this result, the model and dataset were further analyzed. Because the model had already being modified in multiple ways, without leading to an improvement worth mentioning, the result could only be improved by modifying the dataset. The dataset was then transformed into 4 different datasets, each containing a different set of images. 3 of these datasets contained images which showed cyclist in the same pose. These 3 poses, called ‘front pose’, ‘back pose’ and ‘rest pose’ contained cyclists shown in front view, back view and all other possible views. The fourth dataset contained the combination of all images, however divided into 3 times the amount of classes. This allowed for every team to be translated into 3 pose specific versions of that team. After training the model using the combined dataset, the accuracy shown in figure 4 was achieved. Figure 4 : model accuracy pose specific dataset Remarkably, this accuracy was lower than the one previously achieved. This was remarkable because pose specific datasets do not allow for a loss in accuracy due to cyclists being shown in different poses. The loss in accuracy, estimated at 15%, was attributed to a lack in data. IV. CONCLUSION AND DISCUSSION The goal of this master thesis was the design and training of a CNN which would allow the end user to detect cyclists in any given frame of a television broadcast of a cycling race, based on the team jersey. Although this was realized by training a model with an accuracy of 79,23%, the acknowledgment has to be made that the accuracy could be improved if more data was gathered. Independent of possible improvements, the model suffices to be used in future applications. An example of such a future application is the use of the model during a race to determine which teams were most shown during the broadcast. V. ACKNOWLEDGEMENTS I would like to thank Prof. Dr. Steven Verstockt and Prof. Dr. Nico Van de Weghe for providing me with a clear project structure during the course of this master thesis. I would like to give a special thanks to Prof. Dr. Steven Verstockt for recommending different data manipulation techniques at crucial times during the course of the master thesis. In particular I would also like to thank ir. Jelle De Bock for offering advice concerning the subject of deep learning. Finally I would also like to thank Thomas Ameye, museum coordinator of cycling museum KOERS for the enormous hospitality he has shown. REFERENCES [1] C. Enyinna Nwankpa, W. Ijomah, A. Gachagan en S. Marshall, „Activation Functions: Comparison of Trends in Practice and Research for Deep Learning,” 2018.