2016年8月14日日曜日

「Titanic: Machine Learning from Disaster－Getting Started With Python」翻訳してみた１（Pythonを使って解いてみる編）

Titanic: Machine Learning from Disaster

（タイタニック号沈没：災害の機械学習）

Getting Started With Python
（Pythonを使った入門）

元記事：https://www.kaggle.com/c/titanic/details/getting-started-with-python

Excelでの作業の振り返り：

（訳注：当ブログの翻訳記事：

　＊前半（とりあえず解いてUpする）：http://techinfo4dog.blogspot.jp/2016/08/titanic-machine-learning-from.html

　＊後半（少しモデルの向上をはかって再度Upする）：http://techinfo4dog.blogspot.jp/2016/08/titanic-machine-learning-from_13.html

）

ここまでで、コンペティション（コンペ）のデータをダウンロードし、２つのモデルをKaggleにUp（送信）しました。１つは、性別だけのモデル、もうひとつは、性別とクラス、チケットの価格も考慮したモデルです。これは初回の提案としては良いものですが、多少複雑なモデルに対応しようとするのは難しいですし、Excelでこのように解いていくのは時間がかかります。より複雑なモデルに対応したくなり、ただ与えられた変数に対して生存率を手動で考えていくほどの時間が無い場合にはどうすればよいでしょうか。このような場合、コンピュータが役立ちます。

もっと変数を追加したいけど時間かかるよね

プログラミングのスクリプト言語は、計算を早くし、ピボットテーブルの割合を手動で計算する面倒な作業を避けるすばらしい方法です。スクリプト言語はいろいろな種類があり、それぞれ利点と欠点があります。ここでは、簡単に使えるスクリプト言語として、Python version 2.7を使っていきます。もしまだPythonをインストールしていない場合、「 Python web site （https://www.python.org/downloads/）」にアクセスし、記載されている内容に従ってください。あるいは「 Anaconda （https://www.continuum.io/downloads）」を使うと、データサイエンスにかなり有用なライブラリが既にバンドルされています。（Anacondaのさらに良いところとしては、一行一行順番に実行するのに便利なインタフェースを提供する iPython（インタラクティブなPython）が含まれていることです。）

他のケース：もし Python version 3.xを使う場合、2.x系と文法が違います。チュートリアルで出るエラーへの対応については、「 as people point out in the forum　（https://www.kaggle.com/c/titanic/forums/t/4937/titanic-problem-with-getting-started-with-python）」を参照してください。

インストール完了後、「 python 」、あるいは「 ipython 」、「 ipython notebook 」と入力することで開始できます。

Python の大きな利点としては、Pythonが持つパッケージです。これらのパッケージの中で、Kaggleのコンペで最も利用できるものは「 Numpy」、「 Scipy」、「 Pandas」、「 matplotlib」、「 csv package」です。これらのパッケージが使えるかどうかは、Pythonのコマンドラインで、「 import numpy 」などと入力するだけで確認できます。このチュートリアルでは、Pythonを使って前回（Excelの場合）と同じことをしてみます。

Python: train.csvを読む

Pythonには、csvを呼んでメモリに格納する良いcsv readerがあります。各列をリストに入れているように読めます。下記のようにすばやく配列に入れることもできます。

# The first thing to do is to import the relevant packages
# that I will need for my script, 
# these include the Numpy (for maths and arrays)
# and csv for reading and writing csv files
# If i want to use something from this I need to call 
# csv.[function] or np.[function] first

import csv as csv 
import numpy as np

# Open up the csv file in to a Python object
csv_file_object = csv.reader(open('../csv/train.csv', 'rb')) 
header = csv_file_object.next()  # The next() command just skips the 
                                 # first line which is a header
data=[]                          # Create a variable called 'data'.
for row in csv_file_object:      # Run through each row in the csv file,
    data.append(row)             # adding each row to the data variable
data = np.array(data)           # Then convert from a list to an array
            # Be aware that each item is currently
                                 # a string in this format

既にExcelでデータの中身を見ているから分かるでしょうが、Pythonでも中身を見てみましょう。 「print data 」と入力すると以下のように表示されます。

[['1' '0' '3' ..., '7.25' '' 'S']
 ['2' '1' '1' ..., '71.2833' 'C85' 'C']
 ['3' '1' '3' ..., '7.925' '' 'S']
 ..., 
 ['889' '0' '3' ..., '23.45' '' 'S']
 ['890' '1' '1' ..., '30' 'C148' 'C']
 ['891' '0' '3' ..., '7.75' '' 'Q']]

見ると、ヘッダが無く、ただ値だけのようです。そして、各値は引用符（”）でくくられているのが分かるでしょう。つまり、各値は文字列型（String型）で格納されているということです。残念なことに、データの一部は「...」になっています。1行目をすべて見たい場合には、「 print data[0] 」と入力します。

['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171' '7.25' '' 'S']

最後の行を見たい場合には、「 print data[-1] 」と入力します。

['891' '0' '3' 'Dooley, Mr. Patrick' 'male' '32' '0' '0' '370376' '7.75' '' 'Q']

1行目の4番目の列を見たい場合「 print data[0,3] 」と入力します。

Braund, Mr. Owen Harris

データは入ったからいろいろやってみたいね

特定の列を呼び出せることが分かったところで、性別の列を呼ぶために「 data[0::,4] 」を使います。「0::」は、すべて（最初から最後まで）を意味します。Pythonは1ではなく0から始まります。csv readerはデフォルトでは文字列型で格納するため、数値計算が必要な部分はfloats型に変換します。たとえば、「Pclass」の列を floatsに変換する場合、「

data[0::,2].astype(np.float)」です。これを使って、以下のようにTitanicの生存率を計算できます。

# The size() function counts how many elements are in
# in the array and sum() (as you would expects) sums up
# the elements in the array.

number_passengers = np.size(data[0::,1].astype(np.float))
number_survived = np.sum(data[0::,1].astype(np.float))
proportion_survivors = number_survived / number_passengers

「Numpy」は幾つか使いやすい機能があります。たとえば、性別の列を探し、どの要素が「female（あるいはmale。femaleに該当しないもの）」に該当するかを探し、幾人の女性、あるいは男性が生き残ったかを確認することができます。

women_only_stats = data[0::,4] == "female" # This finds where all 
                                           # the elements in the gender
                                           # column that equals “female”
men_only_stats = data[0::,4] != "female"   # This finds where all the 
                                           # elements do not equal 
                                           # female (i.e. male)

オリジナルのtrain.csvデータに、「mask」として新しく２つの変数を追加し、男女の生き残った割合を計算します。

# Using the index from above we select the females and males separately
women_onboard = data[women_only_stats,1].astype(np.float)     
men_onboard = data[men_only_stats,1].astype(np.float)

# Then we finds the proportions of them that survived
proportion_women_survived = \
                       np.sum(women_onboard) / np.size(women_onboard)  
proportion_men_survived = \
                       np.sum(men_onboard) / np.size(men_onboard) 

# and then print it out
print 'Proportion of women who survived is %s' % proportion_women_survived
print 'Proportion of men who survived is %s' % proportion_men_survived

訓練データで見ると、女性の生き残った割合が高いことが分かります。

テストデータを読んで、性別によるモデルをcsvとして出力する

前回と同様に、今度はPythonを使ってtest.csvファイルを読み、別のcsvファイルに結果を書き出す必要があります。まず、test.csvファイルを読んで、ヘッダをスキップします。

test_file = open('../csv/test.csv', 'rb')
test_file_object = csv.reader(test_file)
header = test_file_object.next()

新しいファイル（まだ中身がからっぽのもの）を書き込みできるように開きます。結果を記入してKaggleにアップロードするためのファイルです。

prediction_file = open("genderbasedmodel.csv", "wb")
prediction_file_object = csv.writer(prediction_file)

test用のファイルを１行ごとに読んで、「female」なら1、「male」なら0を、生存予測として新しいファイルに記入します。

prediction_file_object.writerow(["PassengerId", "Survived"])
for row in test_file_object:       # For each row in test.csv
    if row[3] == 'female':         # is it a female, if yes then                                       
        prediction_file_object.writerow([row[0],'1'])    # predict 1
    else:                              # or else if male,       
        prediction_file_object.writerow([row[0],'0'])    # predict 0
test_file.close()
prediction_file.close()

さて、これでKaggleに登録するための「genderbasedmodel.csv」ができました。
データページ（訳注：test.csvとかダウンロードできるところ）に、「gendermodel.py」というファイル名でこれらのステップを記載したファイルをダウンロードできます。Pythonを使うことの利点として、たとえばもし新たな訓練データを入手した際に、ここまでのステップをもう一度すぐに実施できることがあります。

＊＊＊＊以下、翻訳時点での記事コピー（元記事：https://www.kaggle.com/c/titanic/details/getting-started-with-python）＊＊＊＊＊

Getting Started With Python

Getting Started with Python: Kaggle's Titanic Competition

Recapping our work with Excel: we have been able to successfully download the competition data and submit 2 models: one based on just the gender and another on the gender, class, and price of ticket. This is good for an initial submission, however problems arise when we move to slightly more complicated models and the time taken to formulate approaches in Excel takes longer. What do you do if you want to make a more complicated model but don’t have the time to manually find the proportion of survivors for a given variable? We should make the computer do the hard work for us!

I want to add more variables but it takes so much time!

Programming scripts are a great way to speed up your calculations and avoid the arduous task of manually calculating the pivot table ratios. There are many languages out there, each with its own advantages and disadvantages. Here we are going to use Python version 2.7, which is an easy to use scripting language. If you do not have this installed, please visit the Python web site and folllow the instructions there -- or, you could install one specific distribution of Python called Anaconda that already bundles the most useful libraries for data science. (Another advantage of Anaconda is that it includes iPython (Interactive Python) which makes the interface easier for stepping through lines of programming one by one.)

NOTE in either case: if you use Python version 3.x, you may discover some Python syntax has changed in that version, which can cause errors on this tutorial as people point out in the forum.

When you have things installed, to begin just type python , or ipython , or ipython notebook.

One of the great advantages of Python is its packages. Of these packages, the most useful (for Kaggle competitions) are the Numpy, Scipy, Pandas, matplotlib and csv package. In order to check whether you have these, just go to your python command line and type import numpy (and so on). If you don’t you will need them! This tutorial is going to guide you through making the same submissions as before, only this time using Python.

Python: Reading in your train.csv

Python has a nice csv reader, which reads each line of a file into memory. You can read in each row and just append a list. From there, you can quickly turn it into an array.

# The first thing to do is to import the relevant packages
# that I will need for my script, 
# these include the Numpy (for maths and arrays)
# and csv for reading and writing csv files
# If i want to use something from this I need to call 
# csv.[function] or np.[function] first

import csv as csv 
import numpy as np

# Open up the csv file in to a Python object
csv_file_object = csv.reader(open('../csv/train.csv', 'rb')) 
header = csv_file_object.next()  # The next() command just skips the 
                                 # first line which is a header
data=[]                          # Create a variable called 'data'.
for row in csv_file_object:      # Run through each row in the csv file,
    data.append(row)             # adding each row to the data variable
data = np.array(data)           # Then convert from a list to an array
            # Be aware that each item is currently
                                 # a string in this format

Although you've seen this data before in Excel, just to be sure let's look at how it is stored now in Python. Type print data and the output should be something like

[['1' '0' '3' ..., '7.25' '' 'S']
 ['2' '1' '1' ..., '71.2833' 'C85' 'C']
 ['3' '1' '3' ..., '7.925' '' 'S']
 ..., 
 ['889' '0' '3' ..., '23.45' '' 'S']
 ['890' '1' '1' ..., '30' 'C148' 'C']
 ['891' '0' '3' ..., '7.75' '' 'Q']]

You can see this is an array with just values (no descriptive header). And you can see that each value is being shown in quotes, which means it is stored as a string. Unfortunately in the output above, the full set of columns is being obscured with "...," so let's print the first row to see it clearly. Type print data[0]

['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171' '7.25' '' 'S']

and to see the last row, type print data[-1]

['891' '0' '3' 'Dooley, Mr. Patrick' 'male' '32' '0' '0' '370376' '7.75' '' 'Q']

and to see the 1st row, 4th column, type print data[0,3]

Braund, Mr. Owen Harris

I have my data now I want to play with it

Now if you want to call a specific column of data, say, the gender column, I can just type data[0::,4], remembering that "0::" means all (from start to end), and Python starts indices from 0 (not 1). You should be aware that the csv reader works by default with strings, so you will need to convert to floats in order to do numerical calculations. For example, you can turn the Pclass variable into floats by usingdata[0::,2].astype(np.float). Using this, we can calculate the proportion of survivors on the Titanic:

# The size() function counts how many elements are in
# in the array and sum() (as you would expects) sums up
# the elements in the array.

number_passengers = np.size(data[0::,1].astype(np.float))
number_survived = np.sum(data[0::,1].astype(np.float))
proportion_survivors = number_survived / number_passengers

Numpy has some lovely functions. For example, we can search the gender column, find where any elements equal female (and for males, 'do not equal female'), and then use this to determine the number of females and males that survived:

women_only_stats = data[0::,4] == "female" # This finds where all 
                                           # the elements in the gender
                                           # column that equals “female”
men_only_stats = data[0::,4] != "female"   # This finds where all the 
                                           # elements do not equal 
                                           # female (i.e. male)

We use these two new variables as a "mask" on our original train data, so we can select only those women, and only those men on board, then calculate the proportion of those who survived:

# Using the index from above we select the females and males separately
women_onboard = data[women_only_stats,1].astype(np.float)     
men_onboard = data[men_only_stats,1].astype(np.float)

# Then we finds the proportions of them that survived
proportion_women_survived = \
                       np.sum(women_onboard) / np.size(women_onboard)  
proportion_men_survived = \
                       np.sum(men_onboard) / np.size(men_onboard) 

# and then print it out
print 'Proportion of women who survived is %s' % proportion_women_survived
print 'Proportion of men who survived is %s' % proportion_men_survived

Now that I have my indication that women were much more likely to survive, I am done with the training set.

Reading the test data and writing the gender model as a csv

As before, we need to read in the test file by opening a python object to read and another to write. First, we read in the test.csv file and skip the header line:

test_file = open('../csv/test.csv', 'rb')
test_file_object = csv.reader(test_file)
header = test_file_object.next()

Now, let's open a pointer to a new file so we can write to it (this file does not exist yet). Call it something descriptive so that it is recognizable when we upload it:

prediction_file = open("genderbasedmodel.csv", "wb")
prediction_file_object = csv.writer(prediction_file)

We now want to read in the test file row by row, see if it is female or male, and write our survival prediction to a new file.

prediction_file_object.writerow(["PassengerId", "Survived"])
for row in test_file_object:       # For each row in test.csv
    if row[3] == 'female':         # is it a female, if yes then                                       
        prediction_file_object.writerow([row[0],'1'])    # predict 1
    else:                              # or else if male,       
        prediction_file_object.writerow([row[0],'0'])    # predict 0
test_file.close()
prediction_file.close()

Now you have a file called 'genderbasedmodel.csv', which you can submit!

On the Data page you will find all of the steps above in a single python script named 'gendermodel.py'. One advantage of python is that you can quickly run all of the steps you did again in the future -- if you receive a new training file, for example.

2016年8月13日土曜日

「Titanic: Machine Learning from Disaster－Getting Started With Excel」翻訳してみた２（ExcelでTitanicの予想モデルを向上してみる編）

※訳注：記事（https://www.kaggle.com/c/titanic/details/getting-started-with-excel　）の前半、ExcelでTitanicモデルを作ってKaggleに応募してみるまでの流れの翻訳・試行は、ｈttp://techinfo4dog.blogspot.jp/2016/08/titanic-machine-learning-from.html　参照。
　ここからは、Kaggle用の訓練ファイルなどが手元にあり、一度Kaggleに応募したことがあることを前提に話を進めていますので、まだ準備ができていない人は、上記前半記事などを参考に準備・試行を行った上で、読み進めてください。

＊＊＊＊

元記事：https://www.kaggle.com/c/titanic/details/getting-started-with-excel　（後半）

2つ目の予想：より正答率を上げるために

１つめの予想では、コンペの終わりに良い成績は取れないでしょう。コンペの成長、発展に応じ、コンペ参加者はデータについて学び、フォーラムで技術について語り、モデルを改良する方法について考えます。ではいよいよ改良していきましょう。

改良を施すため、train.csv（Excelで開き、ピボットテーブルにSurvived、sexの分析が行われている）に戻ります。最初の仮説をよりよくするために、年齢も予想に入れていきましょう。先ほどのピボットテーブルに年齢を追加することができます。
ageを列ラベル（すでにsexがある部分）にドラッグし、性と年齢で一致する船客の数が表示されます。

しかし、お気づきの通り、これではどうしようもありません。年齢をグループ分け、たとえば大人と子供、のように分ける必要があります。ifを使って新しい列（訳注：ここではAdultChildとしてM列に追加）を作ります。新しい列には、F2セルが年齢の場合、「=IF(F2>18,"adult","child")」を入力します。そして数式を最終列まで指定し、ピボットテーブルを作り直し（全部のデータを選択して挿入→ピボットテーブル。前半と同様）、新しい列を見ていきます。

このテーブルによると、大人の女性（18歳以上）の生き残る確立は78%ありましたが、大人の男性はたった18%です。ただこれでは元の予測とあまり変わっていません。つまり年齢という変数は追加の情報としては価値が多くありません。ではさらにひとつ変数を追加してみましょう。（訳注：先ほど入れたAdultChildは一度はずす。）船客のクラス（訳注：Pclass。1,2,3。）です。ピボットテーブルに追加してみると結果が分かります。

ごらんの通り、予測が劇的に変わり、この変数を追加する意味があったことが分かります（訳注：女性のPclassが3が、生存率50%）。ですが、まだ男女の分け方のほうに軍配が上がっています。では、次に乗船チケットの値段（訳注：Fare）について見ていきましょう。各クラスはさらにチケットの値段で分けられます。
　(i) $10以下
　(ii) $10から$20の間
　(iii) $20から$30の間
　(iv) $30以上

（訳注：test.csvの N列に新しくFareClassという行を作り、J列のFareを上記４つに分けるため、N2列に
「=IF(J2<10,"1",IF(J2<20,"2",IF(J2<30,"3","4")))」
と記載し、最終行までセルコピーしてみた。
そしてまたピボットテーブルをつくり、Sexと当FareClassとを入れる。）

Price, Class and gender in my pivot table

たくさんに分かれましたね。半数以上が生き残っているグループと、そうではないグループとがあります。ご覧の通り、男性は相変わらずあまり生き残っていませんが、女性の3rdクラス、$20以上払ったグループもあまり生き残っていないことがわかりました。この小さな発見がleaderboardの結果に響くでしょう。test.csvを開き、新たに1つ目の列「Survived」を作り、以下のIf分を入力します。

=IF(E2="male",0,IF(C2=3,IF(J2>20,0,1),1))

そして、この列の値だけを別の新たな列「Survived」にコピーし、新たなSurvived列とPassengerIdの列だけにして、csv形式で保存して、Kaggleの結果として送信してください。

（訳注：作ったcsvファイルは、test.csvなどを取得した場所にあるgenderclassmodel.csvと同じ。前回と同様に、Titanicトップページの左上のほうにあるMake a submissionからファイルを選択すると数秒でアップされてスコアが出る。）

（訳注：前回の男女別だけのスコアが 0.76555だったのと比べ、0.77990に向上している。）

ご覧の通り、3rdクラスに多く払うか払わなかったかの差が、送信したスコアに響いたことが分かります。

終わりに

ここでは、Excelがデータの基本的な理解に使えることを示しました。この少しの作業で、チケットを購入するのに多く払う必要はなかったことが分かります。ある人たちは、1stクラスの人より多く支払って3rdクラスのチケットを購入したのです。

初回のチュートリアルの締めくくりとして、データをダウンロードし、問題に対する最初の理解を元に簡単なモデルを作りました。そしてKaggleに結果を投稿してみて、さらに向上させたモデルを作りました。さらにモデルをつなげたり、複雑な思い付きをモデルにするために、Excelより洗練されたソフトを使いたくなってくるでしょう。次のチュートリアル（https://www.kaggle.com/c/titanic/details/getting-started-with-python）では、シンプルなスクリプト言語であるPythonを使い、分析を自動化してみます。

＊＊＊＊以下、翻訳時点での記事コピー（元記事：https://www.kaggle.com/c/facial-keypoints-detection/details/submission-instructions）＊＊＊＊＊

Getting Started With Excel

Getting Started with Excel: Kaggle's Titanic Competition

For those who are not experienced with handling large data sets, logging into the Kaggle website for the first time may be slightly daunting. Many of these competitions have a six figure prize and data which can, at times, be extremely involved. Here at Kaggle, we understand that this may seem like an insurmountable barrier to entry, so we have created a "getting started" competition to guide you through the initial steps required to get your first decent submission on the board.

The challenge

The competition is simple: we want you to use the Titanic passenger data (name, age, gender, socio-economic class, etc) to try to predict who will survive and who will die.

I want to compete! What do I do next?

The first thing to do is get the data from the Kaggle website. You will need two files:train.csv and test.csv. The .csv filename stands for comma separated values, where every value (name, age, gender, etc) in a row is separated by a comma. This allows Excel to interpet the data as columns. If you load the file up in notepad--or your text editor of choice--it would look like this:

However, in Excel it looks like this:

To get the data…Click on Get The Data, and this will take you to the Data page:

In order to download, click on the blue file extensions (.csv). The first time you do this you will be taken to the rules acceptance page. You must accept the competition rules in order to download the data. These rules govern how many submissions you can make per day, the maximum team size, and other competition specific details. Click "I Understand and Accept" and then re-click the filenames. You will need both files.

I have the data what do I have to do?!

You have two files, train.csv and test.csv. Train.csv will contain the details of a subset of the passengers on board (891 to be exact) and will tell you their details and whether they survived or not. Using the patterns you find in the train.csv data, you will have to predict whether the other 418 passengers on board (found in test.csv) survived.

So I need to find patterns in the data… Where do I start?!

Open up train.csv in Excel. As you can see, we have given you the details of a number of passengers on board the Titanic. We have given you the Pclass, the class of their ticket (1^st, 2^nd or 3^rd), their name, their sex, age, Subsp (which is the number of siblings / spouses they had on board with them), Parch (which is the number of parents / children they had on board with them), their ticket type, the fare they paid for their ticket, their cabin number, and Embarcation point (where they got on: Queenstown,Cherbourg or Southampton). We also give you whether they survived or not (1 = yes, 0 = sadly, not). This is the information you are going to use to make your predictions. We want to find if there is a relationship between one of the variables and ultimate survival.

In data science, your intuition is often a great place to start. If you have ever seen the film Titanic, they try to save the women and children first. This would be a good guess to start! Excel has a helpful tool for this kind of exploration called a pivot table. Highlight the entire set of data and go to 'Insert'--> 'Pivot Tables'.

This should create a new spreadsheet in your document. On the right should be all the variables that you selected and four boxes at the bottom. We are interested in who survived, so you want to see how this value varies as you select other variables. On the right hand side, drag the word 'survived', which has a check box next to it, down to the bottom right hand box. This will show the sum of the survived box. Now, to see how many women and men survived, drag the 'Sex' variable to the 'Row labels'. Since Survived is a 1 or 0, the sum of this is the total number who survived. If you want to find the proportion, drag the Survived variable again into the values box (so now there are two in there) click on the down arrow, select value field settings, and change 'Sum' to 'count'. This will tell you the total number of rows (or passengers in this case). Now, in the cell next to the table you can just type = B4 / C4, which would show the proportion on females that survived.

From this you can see that almost 75% of the females survived! However, only 19% of the males lived to tell about it. This is quite a promising first guess!

Making my predictions

Gender seems to be a strong indicator of survival, with women having a much better chance! To make your first predictions based on this, now open test.csv in Excel and insert a new column in the first column, and give it the header Survived. Your submission to Kaggle MUST have your predictions in a column named 'Survived' (more info here)!

To make a model which states that if the passenger is female then she survives in the first cell write a logical 'if' statement. Type =if(E2="female",1,0) which means if(sex=female, then make this cell = 1, if not then make this cell = 0). Then double click on the bottom right hand corner of this box so it drags down through all rows. You've now created a calculated formula for each row in Excel, but for Kaggle you want a definitive value, 0 or 1. So re-paste the same column onto itself with Paste As > Values. Save the file as something memorable. I'll use 'genderbasedmodel.csv'.

I’ve made my predictions, how well have I done?

One last step: Kaggle can only accept 2 columns in your submitted answer, PassengerId and Survived. Delete any extra columns, and save (or Save As) this new version of your csv which only has those 2 columns. Now go back to your internet browser and the competition page and click 'Make a submission'.

This should bring up your team page. This is where you can set up your team. (On Kaggle you are always a member of a team, whether it is one person or 20 people.) Each team has a team leader. Teams can be added to at a later date, however members cannot be removed, so choose carefully! You can compete anonymously as well, if don’t want people to see your name.
Hit 'Continue' and the first thing you may notice is 'You have 2 (of 2 entries) left today'. Kaggle limits the number of submissions you make so you can't use the leaderboard score to gain and unfair advantage with your submissions. Click on the button 'Click or Drop Submission Here' and select the file you would like to submit (here, 'genderbasedmodel.csv') and click 'Submit'.

Once you have submitted, you will be taken to the leaderboard and shown your score and ranking on the leaderboard. Each competition is scored based on different evaluation metrics, whose details are described on theEvaluation page. In this competition the metric is simply the fraction of passengers you got correct.

My second submission: I want to do better!

Your first submission will likely not be your best at the end of the competition. As the competition grows and evolves, people learn about the data, posts appear on the forum discussing techniques which give insight, and you will think of new ways to improve your model. Let's try to improve ours.

Going back to the train.csv, you want to improve what you currently have. Extending the earlier hypothesis that maybe age may also be predictive, you can add age to the pivot table. Drag age to the Row labels (where sex was) and it will show the number of passengers who were match on both gender and an age. However, as you may notice, the age is not binned up in any way. You may want to group these up, perhaps starting with just adults and children. Just like before, make an extra column with a 'if' statement. In this case, it will read =IF(F2>18,"adult","child"), where F2 was the age column. Now recreate the pivot table (note you have to re-make it) and add the same variables before, you can see if there are more patterns!

According to the table, adult women (over 18) had a 78% chance of survival and male adults only had an 18% chance of survival. You can see that this isn’t much changed from the original proportions. This tells us that there is not much additional information in the age variable. Let's look at one more variable: the class of passenger. Adding this to the pivot table we can see the results:

As you can see, the proportions have now changed dramatically, meaning that there is some predictive merit to this variable. However, this still doesn't surpass the male/female divide. Now let's bin up what people paid for their ticket, so that each class is split into payments of: (i) less than $10, (ii) between $10 and $20, (iii) between $20 and $30, and (iv) greater than $30:

Now we have the proportions for many different variables. I decide to just assume that any group with more than half survivors I will model to always survive, and all those with less than half will model to never survive. So as you can see, all the males will still not survive; however now women in third class who paid more than $20 will also not survive. This small improvement should make a difference on the leaderboard! So now going to test.csv I do a nested IF statement in each cell of a new first column I create called Survived:

=IF(E2="male",0,IF(C2=3,IF(J2>20,0,1),1))

and then re-paste my formula column with Paste As Values, reduce my file to only 2 columns, Survived & PassengerId, then save to a csv file, then submit! As you see, we now differentiated between those who paid a lot for their third class ticket, and those who did not, and look at the effect on my submitted score!

Conclusion

Here Excel was used to get the basic understanding of the data. From just this basic bit of work, we have found that people did not necessarily pay the same amount of money for a ticket; some people paid more for a third class than a first class!

This concludes our first tutorial. We have been able to download the data and make a simple model based on our initial understanding of the problem. We made a submission to Kaggle, and then built on it and made an improvement. As we incorporate more models and more complicated ideas, we may want to move to something slightly more sophisticated than Excel. The next tutorial will look at using Python, a simple scripting language which can help automate our analysis.

登録: 投稿 (Atom)

2016年8月14日日曜日

「Titanic: Machine Learning from Disaster－Getting Started With Python」翻訳してみた１（Pythonを使って解いてみる編）

Titanic: Machine Learning from Disaster

（タイタニック号沈没：災害の機械学習）

Getting Started With Python（Pythonを使った入門）

Getting Started With Python

Getting Started with Python: Kaggle's Titanic Competition

2016年8月13日土曜日

「Titanic: Machine Learning from Disaster－Getting Started With Excel」翻訳してみた２（ExcelでTitanicの予想モデルを向上してみる編）

2つ目の予想：より正答率を上げるために

Getting Started With Excel

Getting Started with Excel: Kaggle's Titanic Competition

The challenge

My second submission: I want to do better!

Getting Started With Python
（Pythonを使った入門）