Wednesday, July 29, 2020

Solve NLP task : PyTorch Transformer Pipeline





Question answering (QA) is a computer science discipline within the fields of information retrieval and natural language processing (NLP), which is concerned with building systems that automatically answer questions posed by humans in a natural language.

Sentiment analysis (SA) is the interpretation and classification of emotions (positive, negative and neutral) within text data using text analysis techniques. Sentiment analysis tools allow businesses to identify customer sentiment toward products, brands or services in online feedback.


Here we are going to try out the PyTorch based transformer pipeline (deep learning NLP) to use the SQuAD from Stanford (References [1]).


# Q&A pipeline
!pip install qq transformers 
from transformers import pipeline
qapipe = pipeline('question-answering')
qapipe({
 'question': """how can question answering service produce answers""",
 'context': """One such task is reading comprehension. Given a passage of text, we can ask questions about the passage that can be answered by referencing short excerpts from the text. For instance, if we were to ask about this paragraph, "how can a question be answered in a reading comprehension task" ..."""
})

Output:

{'score': 0.38941961529900837,
 'start': 128,
 'end': 169,
 'answer': 'referencing short excerpts from the text.'}


# Sentiment Analysis pipeline
from transformers import pipeline 
sentiment_pipe= pipeline('sentiment-analysis') 
sentiment_pipe ("I sure would like to see a resurrection of a up dated Seahunt series with the tech they have today it would bring back the kid excitement in me.I grew up on black and white TV and Seahunt with Gunsmoke were my hero's every week.You have my vote for a comeback of a new sea hunt.We need a change of pace in TV and this would work for a world of under water adventure.Oh by the way thank you for an outlet like this to view many viewpoints about TV and the many movies.So any ole way I believe I've got what I wanna say.Would be nice to read some more plus points about sea hunt.If my rhymes would be 10 lines would you let me submit,or leave me out to be in doubt and have me to quit,If this is so then I must go so lets do it.")

Output:
[{'label': 'POSITIVE', 'score': 0.91602623462677}]


Experiments: Above Python codes tried under Google Colab. You can try other pipelines like NER (Named Entity Recognition), Feature Extraction. The Huggingface link is useful References[2].

Conclusion:
With the advancement of deep learning in NLP is rapidly growing, via transformer driven architectures, 
it's pretty convenient to use with minimal coding and put these models in practice, while maintaining higher accuracy
for a given NLP task. In order to use the existing model for a custom data sets, and then do epochs(train/test) on that
custom data sets, that also feasible and most of cases, you can achieve higher accuracy than the 
baseline model. So keep exploring the new transformer based models!

References:
1. The Stanford data set for Q&A is available here. https://rajpurkar.github.io/SQuAD-explore
2. More on Transformer Pipeline: https://huggingface.co/transformers/main_classes/pipelines.html


Monday, March 23, 2020

COVID-19 data analysis using Pentaho tools..



The world is going under pandemic and is being caused by novel corona virus (disease is COVID-19). In order to understand how it is impacted around the world, JHU's Corona Virus Research center, has provided data sets.
Data Source: https://data.humdata.org/dataset/novel-coronavirus-2019-ncov-cases

For the analysis here, have chosen global narrow data sets (dates are in one column) for ETL processing (they have also data sets where each date is a column).


Pentaho Data Integration used here. A job was designed to download the files (Confirmed, Recovered, Deaths). Also a transformation was created to load data into a MySQL table. Later used DSW [Data Source Wizard] from PUC[Pentaho User Console], which is a Mondrian model based model and generated reporting Pentaho Analyzer (PAZ) reports to put under PDD (Pentaho Designer Dashboard). More snapshot of the process is below here.

ETL via Pentaho Data Integration:

















Dash Board for analysis:

As you can see there are #3 reports are collected, which are controlled through two dashboard prompts (country and date). As clear, from confirmed cases China's data is already flatten (stable health), where as other countries are going upward. The next prominent country is "Italy" and so also we have seen many deaths there and the improvement on medical recovery is still challenging (recovered).

Mar22, 2020




Mar27, 2020

As you can see, US and Italy has surpassed China on confirmed cases and deaths, but yet to see recovery numbers to grow higher.


Hoping the recovery curve will be uplifted in the weeks to come (sooner).

Apr 3, 2020

You can see the numbers are rising still in confirmed cases (scaled independently to each country). It will still take some time to get both recovery and confirmed lines to merge (or come close, like ex: China). Hoping sooner.



Social Distancing Does Matter - Running a Python program via PDI to show a spread variable can make difference on infected population
- Spread =1, with no social-distancing, it infected all 100K population, 
- Spread =  0.5 [1 person infected , other 1 person maintained social distancing, that brought down infected population to 80K]
Spread = 0.25 [1 infected 3 maintained social distancing it brought down infected population to 35K].
Spread = 0.2 [1 infected 4 maintained social distancing it brought down infected population to 20K].



Apr 13, 2020:

This chart captures the daily delta on confirmed COVID-19 cases and as you can see an early sign that US has started flattening. Also other European countries, we can see similar trend. From now and next few weeks, will be better to maintain social distancing to completely flatten out these curves!

Apr 27, 2020:

The scale is independent.
The recovery curve is still progressing slowly in US. UK's recovery is very small. Germany,Spain are doing better in recovery trajectory.

July 29, 2020:

As you see now the cases have been growing in countries like Brazil, India. Here is a projection of daily delta on confirmed cases.














Here is the trend of countries by almost end of July 2020. US, India, Brazil are going upward, with Spain as well. It's clear that Germany, Italy, UK are tending towards a stable curve.



Chart on Deaths trend by country:

Sep 10, 2020:

When #of cases to project on delta count compared to previous day, India has crossed Brazil and US.




















Recovery pattern analysis: Brazil, India, Germany are following a pattern where the recovery rate is close to cases, where as the recovery rate in Spain, US, Italy  (Cases and recovery) are not close. UK recovery number can be exception to this.


















Nov 11, 2020:

The US confirmed cases are increasing pretty rapidly. The current daily numbers is more than two times the peak from Jul.
















Nov 19 2020:

Just to see 5Days Moving-Avg of confirmed cases, it's clear that in the US, cases are going up.



















Apr12 2021:

As world is gradually moving towards vaccination and also opening up economy, so it's a mixed outcome coming out of many countries. Cases are spiking up in India, Brazil which is clear from this, comparing the vaccinated individuals to population size.

--5Day Moving Average Confirmed cases
















--Daily Delta Confirmed cases













Apr20 2021:

As the Corona virus cases are spiking from the second wave in India, guides lines have come from CDC Guidelines on travel to India

It's going to be sometime, till we see the curve gets flatten for India.

Apr27 2021:

As you can still see the daily cases in India are still growing, with the second wave of virus, in matter of weeks cases are surging.