2016年8月13日土曜日

「Titanic: Machine Learning from Disaster-Getting Started With Excel」翻訳してみた2(ExcelでTitanicの予想モデルを向上してみる編)

 ※訳注:記事(https://www.kaggle.com/c/titanic/details/getting-started-with-excel )の前半、ExcelでTitanicモデルを作ってKaggleに応募してみるまでの流れの翻訳・試行は、http://techinfo4dog.blogspot.jp/2016/08/titanic-machine-learning-from.html 参照。
 ここからは、Kaggle用の訓練ファイルなどが手元にあり、一度Kaggleに応募したことがあることを前提に話を進めていますので、まだ準備ができていない人は、上記前半記事などを参考に準備・試行を行った上で、読み進めてください。


****

元記事:https://www.kaggle.com/c/titanic/details/getting-started-with-excel (後半)

2つ目の予想:より正答率を上げるために

1つめの予想では、コンペの終わりに良い成績は取れないでしょう。コンペの成長、発展に応じ、コンペ参加者はデータについて学び、フォーラムで技術について語り、モデルを改良する方法について考えます。ではいよいよ改良していきましょう。
改良を施すため、train.csv(Excelで開き、ピボットテーブルにSurvived、sexの分析が行われている)に戻ります。最初の仮説をよりよくするために、年齢も予想に入れていきましょう。先ほどのピボットテーブルに年齢を追加することができます。
ageを列ラベル(すでにsexがある部分) にドラッグし、性と年齢で一致する船客の数が表示されます。



しかし、お気づきの通り、これではどうしようもありません。年齢をグループ分け、たとえば大人と子供、のように分ける必要があります。ifを使って新しい列(訳注:ここではAdultChildとしてM列に追加)を作ります。新しい列には、F2セルが年齢の場合、「=IF(F2>18,"adult","child")」を入力します。そして数式を最終列まで指定し、ピボットテーブルを作り直し(全部のデータを選択して挿入→ピボットテーブル。前半と同様)、新しい列を見ていきます。




 

このテーブルによると、大人の女性(18歳以上)の生き残る確立は78%ありましたが、大人の男性はたった18%です。ただこれでは元の予測とあまり変わっていません。つまり年齢という変数は追加の情報としては価値が多くありません。ではさらにひとつ変数を追加してみましょう。(訳注:先ほど入れたAdultChildは一度はずす。)船客のクラス(訳注:Pclass。1,2,3。)です。ピボットテーブルに追加してみると結果が分かります。





 




Adding Class to the pivot table




ごらんの通り、予測が劇的に変わり、この変数を追加する意味があったことが分かります(訳注:女性のPclassが3が、生存率50%)。ですが、まだ男女の分け方のほうに軍配が上がっています。では、次に乗船チケットの値段(訳注:Fare)について見ていきましょう。各クラスはさらにチケットの値段で分けられます。
 (i) $10以下
 (ii) $10から$20の間
 (iii) $20から$30の間
 (iv) $30以上

(訳注:test.csvの N列に新しくFareClassという行を作り、J列のFareを上記4つに分けるため、N2列に
「=IF(J2<10,"1",IF(J2<20,"2",IF(J2<30,"3","4")))」
と記載し、最終行までセルコピーしてみた。
そしてまたピボットテーブルをつくり、Sexと当FareClassとを入れる。)






Price, Class and gender in my pivot table
たくさんに分かれましたね。半数以上が生き残っているグループと、そうではないグループとがあります。 ご覧の通り、男性は相変わらずあまり生き残っていませんが、女性の3rdクラス、$20以上払ったグループもあまり生き残っていないことがわかりました。この小さな発見がleaderboardの結果に響くでしょう。test.csvを開き、新たに1つ目の列「Survived」を作り、以下のIf分を入力します。
=IF(E2="male",0,IF(C2=3,IF(J2>20,0,1),1))
そして、この列の値だけを別の新たな列「Survived」にコピーし、新たなSurvived列とPassengerIdの列だけにして、csv形式で保存して、Kaggleの結果として送信してください。

(訳注:作ったcsvファイルは、test.csvなどを取得した場所にあるgenderclassmodel.csvと同じ。前回と同様に、Titanicトップページの左上のほうにあるMake a submissionからファイルを選択すると数秒でアップされてスコアが出る。)

  


(訳注:前回の男女別だけのスコアが 0.76555だったのと比べ、0.77990に向上している。)



ご覧の通り、3rdクラスに多く払うか払わなかったかの差が、送信したスコアに響いたことが分かります。
I am number 1!
終わりに
ここでは、Excelがデータの基本的な理解に使えることを示しました。この少しの作業で、チケットを購入するのに多く払う必要はなかったことが分かります。ある人たちは、1stクラスの人より多く支払って3rdクラスのチケットを購入したのです。
初回のチュートリアルの締めくくりとして、データをダウンロードし、問題に対する最初の理解を元に簡単なモデルを作りました。そしてKaggleに結果を投稿してみて、さらに向上させたモデルを作りました。さらにモデルをつなげたり、複雑な思い付きをモデルにするために、Excelより洗練されたソフトを使いたくなってくるでしょう。次のチュートリアル(https://www.kaggle.com/c/titanic/details/getting-started-with-python)では、シンプルなスクリプト言語であるPythonを使い、分析を自動化してみます。












****以下、翻訳時点での記事コピー(元記事:https://www.kaggle.com/c/facial-keypoints-detection/details/submission-instructions)*****

Getting Started With Excel

Getting Started with Excel: Kaggle's Titanic Competition

For those who are not experienced with handling large data sets, logging into the Kaggle website for the first time may be slightly daunting. Many of these competitions have a six figure prize and data which can, at times, be extremely involved. Here at Kaggle, we understand that this may seem like an insurmountable barrier to entry, so we have created a "getting started" competition to guide you through the initial steps required to get your first decent submission on the board.

The challenge

The competition is simple: we want you to use the Titanic passenger data (name, age, gender, socio-economic class, etc) to try to predict who will survive and who will die.
I want to compete! What do I do next?
The first thing to do is get the data from the Kaggle website. You will need two files:train.csv and test.csv. The .csv filename stands for comma separated values, where every value (name, age, gender, etc) in a row is separated by a comma. This allows Excel to interpet the data as columns.  If you load the file up in notepad--or your text editor of choice--it would look like this:
CSV in notepad
 However, in Excel it looks like this:
CSV in Excel
To get the data…Click on Get The Data, and this will take you to the Data page:
Download the data
In order to download, click on the blue file extensions (.csv). The first time you do this you will be taken to the rules acceptance page. You must accept the competition rules in order to download the data. These rules govern how many submissions you can make per day, the maximum team size, and other competition specific details. Click "I Understand and Accept" and then re-click the filenames.  You will need both files.
I have the data what do I have to do?!
You have two files, train.csv and test.csv. Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and will tell you their details and whether they survived or not. Using the patterns you find in the train.csv data, you will have to predict whether the other 418 passengers on board (found in test.csv) survived. 
So I need to find patterns in the data… Where do I start?!
Open up train.csv in Excel. As you can see, we have given you the details of a number of passengers on board the Titanic. We have given you the Pclass, the class of their ticket (1st, 2nd or 3rd), their name, their sex, age, Subsp (which is the number of siblings / spouses they had on board with them), Parch (which is the number of parents / children they had on board with them), their ticket type, the fare they paid for their ticket, their cabin number, and Embarcation point (where they got on: Queenstown,Cherbourg or Southampton). We also give you whether they survived or not (1 = yes, 0 = sadly, not). This is the information you are going to use to make your predictions. We want to find if there is a relationship between one of the variables and ultimate survival.
In data science, your intuition is often a great place to start. If you have ever seen the film Titanic, they try to save the women and children first.  This would be a good guess to start! Excel has a helpful tool for this kind of exploration called a pivot table. Highlight the entire set of data and go to 'Insert'--> 'Pivot Tables'.
Insert a Pivot Table!
This should create a new spreadsheet in your document. On the right should be all the variables that you selected and four boxes at the bottom. We are interested in who survived, so you want to see how this value varies as you select other variables. On the right hand side, drag the word 'survived', which has a check box next to it, down to the bottom right hand box. This will show the sum of the survived box. Now, to see how many women and men survived, drag the 'Sex' variable to the 'Row labels'. Since Survived is a 1 or 0, the sum of this is the total number who survived. If you want to find the proportion, drag the Survived variable again into the values box (so now there are two in there) click on the down arrow, select value field settings, and change 'Sum' to 'count'. This will tell you the total number of rows (or passengers in this case). Now, in the cell next to the table you can just type = B4 / C4, which would show the proportion on females that survived.
Work out the proportion of survivors
From this you can see that almost 75% of the females survived! However, only 19% of the males lived to tell about it. This is quite a promising first guess!
Making my predictions
Gender seems to be a strong indicator of survival, with women having a much better chance! To make your first predictions based on this, now open test.csv in Excel and insert a new column in the first column, and give it the header Survived. Your submission to Kaggle MUST have your predictions in a column named 'Survived' (more info here)!
To make a model which states that if the passenger is female then she survives in the first cell write a logical 'if' statement. Type =if(E2="female",1,0) which means if(sex=female, then make this cell = 1, if not then make this cell = 0). Then double click on the bottom right hand corner of this box so it drags down through all rows. You've now created a calculated formula for each row in Excel, but for Kaggle you want a definitive value, 0 or 1. So re-paste the same column onto itself with Paste As > Values.  Save the file as something memorable. I'll use 'genderbasedmodel.csv'. 
A gender based model!
I’ve made my predictions, how well have I done?
One last step:  Kaggle can only accept 2 columns in your submitted answer, PassengerId and Survived. Delete any extra columns, and save (or Save As) this new version of your csv which only has those 2 columns. Now go back to your internet browser and the competition page and click 'Make a submission'.
  • This should bring up your team page. This is where you can set up your team. (On Kaggle you are always a member of a team, whether it is one person or 20 people.) Each team has a team leader. Teams can be added to at a later date, however members cannot be removed, so choose carefully! You can compete anonymously as well, if don’t want people to see your name.
  • Hit 'Continue' and the first thing you may notice is 'You have 2 (of 2 entries) left today'. Kaggle limits the number of submissions you make so you can't use the leaderboard score to gain and unfair advantage with your submissions. Click on the button 'Click or Drop Submission Here' and select the file you would like to submit (here, 'genderbasedmodel.csv') and click 'Submit'.
The Kaggle Leaderboard!
  • Once you have submitted, you will be taken to the leaderboard and shown your score and ranking on the leaderboard. Each competition is scored based on different evaluation metrics, whose details are described on theEvaluation page.  In this competition the metric is simply the fraction of passengers you got correct.

My second submission: I want to do better!

Your first submission will likely not be your best at the end of the competition. As the competition grows and evolves, people learn about the data, posts appear on the forum discussing techniques which give insight, and you will think of new ways to improve your model. Let's try to improve ours.
Going back to the train.csv, you want to improve what you currently have. Extending the earlier hypothesis that maybe age may also be predictive, you can add age to the pivot table. Drag age to the Row labels (where sex was) and it will show the number of passengers who were match on both gender and an age. However, as you may notice, the age is not binned up in any way. You may want to group these up, perhaps starting with just adults and children. Just like before, make an extra column with a 'if' statement. In this case, it will read =IF(F2>18,"adult","child"), where F2 was the age column. Now recreate the pivot table (note you have to re-make it) and add the same variables before, you can see if there are more patterns! 
According to the table, adult women (over 18) had a 78% chance of survival and male adults only had an 18% chance of survival. You can see that this isn’t much changed from the original proportions. This tells us that there is not much additional information in the age variable. Let's look at one more variable: the class of passenger. Adding this to the pivot table we can see the results:
Adding Class to the pivot table
As you can see, the proportions have now changed dramatically, meaning that there is some predictive merit to this variable. However, this still doesn't surpass the male/female divide. Now let's bin up what people paid for their ticket, so that each class is split into payments of: (i) less than $10, (ii) between $10 and $20, (iii) between $20 and $30, and (iv) greater than $30:
Price, Class and gender in my pivot table
Now we have the proportions for many different variables. I decide to just assume that any group with more than half survivors I will model to always survive, and all those with less than half will model to never survive. So as you can see, all the males will still not survive; however now women in third class who paid more than $20 will also not survive. This small improvement should make a difference on the leaderboard! So now going to test.csv  I do a nested IF statement in each cell of a new first column I create called Survived:
=IF(E2="male",0,IF(C2=3,IF(J2>20,0,1),1))
and then re-paste my formula column with Paste As Values, reduce my file to only 2 columns, Survived & PassengerId, then save to a csv file, then submit! As you see, we now differentiated between those who paid a lot for their third class ticket, and those who did not, and look at the effect on my submitted score!
I am number 1!
Conclusion
Here Excel was used to get the basic understanding of the data. From just this basic bit of work, we have found that people did not necessarily pay the same amount of money for a ticket; some people paid more for a third class than a first class!
This concludes our first tutorial. We have been able to download the data and make a simple model based on our initial understanding of the problem. We made a submission to Kaggle, and then built on it and made an improvement.   As we incorporate more models and more complicated ideas, we may want to move to something slightly more sophisticated than Excel. The next tutorial will look at using Python, a simple scripting language which can help automate our analysis.





0 件のコメント:

コメントを投稿