[Teachable NLP Challenge] What should I consider before acquiring data?

It is unusual to get a whole dataset for myself. Is there any recommendations or instructions I should follow? (e.g. Copyright)