Lately, I have been reading about YOLO and this note is about how the annotations are stored for training YOLO.

Each image in the dataset should have a corresponding txt file. The bounding box in the image is represented by each line in the text file. The syntax of the line is as follows:

class_id x y w h

x and y are coordinates of the mid point of the bounding box. w and h are the width and height of the bounding box. The values for x, y, w and h are expressed relative to the image(ratio).

Let’s say, im_w and im_h are the width and height of an image, and (x_min, y_min) and (x_max, y_max) are two diagonally opposite coordinates of a bounding box. To convert them into YOLO metrics, we find the midpoint of the bounding box- ((x_min + x_max) / 2 , (y_min + y_max) / 2). The width and height of bounding box is given by x_max - x_min and y_max - y_min. Then to express the values relative to image, we divide these values by im_w and im_h.

x = (x_min + x_max) / (2 * im_w)

y = (y_min + y_max) / (2 * im_h)

w = (x_max - x_min) / im_w

h = (y_max - y_min) / im_h

Below is the Python code.

def convert(im_w, im_h, x_min, x_max, y_min, y_max):
    dw = 1./im_w
    dh = 1./im_h
    x = (x_min + x_max)/2.0
    y = (y_min + y_max)/2.0
    w = x_max - x_min
    h = y_max - y_min
    x = x*dw
    w = w*dw
    y = y*dh
    h = h*dh
    return (x,y,w,h)

def deconvert(im_w, im_h, x, y, w, h):
    ox = float(x)
    oy = float(y)
    ow = float(w)
    oh = float(h)
    x = ox*im_w
    y = oy*im_h
    w = ow*im_w
    h = oh*im_h
    xmax = (((2*x)+w)/2)
    xmin = xmax-w
    ymax = (((2*y)+h)/2)
    ymin = ymax-h
    return [int(xmin),int(ymin),int(xmax),int(ymax)]

Manivannan has shared an annotation tool and wrote about how to use it. Please note that his tool is in Python 2 and with some edits will work on Python 3.