Cleaning Twitter location data
I came across an interesting data pre-processing problem while I was looking at Pfizer vaccination tweets data from Kaggle. In this dataset, there are (1) hashtags (2) irrelevant words like “Your Bed”, “Global” and (3) inconsistent location format issue such as (country), (state, country), (country, country), (country code). Here’s a snippet of how the data looks like:
First, let’s remove those rows containing hashtags.
loc_df = loc_df[~loc_df.user_location.str.contains(“#”)]
Next, we can remove the irrelevant words using named entity extraction.
loc_df[‘extracted_user_location’] = loc_df[‘user_location’].apply(lambda x: list(nlp(x).ents) if len(list(nlp(x).ents))>0 else np.nan) #return na value if list is empty
Lastly, the inconsistent formatting issue makes it hard for us to extract the location information. I’ve attempted to extract out the location coordinates using the GeoPy API. Although this method is not 100% accurate, it does give back reasonably good results. Note that this part will take a long while if you have a lot of data.
# Ref: https://geopy.readthedocs.io/en/stable/#usage-with-pandasgeolocator = Nominatim(user_agent=”my-application”)
geocode = RateLimiter(geolocator.geocode, min_delay_seconds=3, max_retries=5)
loc_df[‘location’] = loc_df[‘user_location’].progress_apply(geocode, language=”en”) # Some locations are in hindi, chinese. Language=’en’ returns location in english
loc_df[‘coordinates’] = loc_df[‘location’].apply(lambda loc: tuple(loc.point) if loc else None)
loc_df[‘state’] = loc_df[‘location’].apply(lambda loc: loc[0].split(‘,’)[0] if loc else None)
loc_df[‘country’] = loc_df[‘location’].apply(lambda loc: loc[0].split(‘,’)[-1] if loc else None)loc_df
Hope this helps! :)
If you found this article helpful, I would really appreciate if you could follow my account and give this article a clap! Thank you!! :D