Data labeling
techniques

Crowdsourcing:
Amazon mechanical Turk is a good crowdsourcing example. The automatically annotations they provide is done by freelancers.

Outsourcing:
typically similar to crowdsourcing, just that you look for the workers yourself by visiting freelancer sites for instance

Synthetic labeling:
Approach entails generating data that imitates real data. It is produced by a generative model trained and validated on an original dataset.
3 types of generative models: GANs, Autoregressive Models and Variational Autoencoders (VAEs)

Advantage:
They are time-saving and protects data privacy (for instance Fintect generated synthetic data to train models for fraud detection, medical researchers generate data to save time and protect client's privacy)

Disadvantage:
Huge computational resources required (renting cloud services can do). Synthetic data is not real data, thus the model will need to be trained on real data as soon as it is available, to improve the model.

Data Programming

Programmatically write scripts to create labels and data. The results can be less accurate compared to labeled data due to noise, thus used for weak supervision.

Objective

Label input an unlabeled dataset as input, with a list of its representative labels.

Issue

There are a variety of different datasets depending on structure, content, category and attribute type. We need to specialize in one of these features first, then move towards the feasibility.


Other issues

  • There will need to study the demand for each data type, category and required features, in order to automatically generate and label a representative
  • Most of the work would be done in creating single data samples for clients approval first, before moving forth to automatically generate a bunch