2016年8月14日日曜日

「Titanic: Machine Learning from Disaster-Getting Started With Python」翻訳してみた1(Pythonを使って解いてみる編)


Titanic: Machine Learning from Disaster

(タイタニック号沈没:災害の機械学習)

Getting Started With Python
(Pythonを使った入門)

元記事:https://www.kaggle.com/c/titanic/details/getting-started-with-python

Excelでの作業の振り返り:
(訳注:当ブログの翻訳記事:
 *前半(とりあえず解いてUpする):http://techinfo4dog.blogspot.jp/2016/08/titanic-machine-learning-from.html
 *後半(少しモデルの向上をはかって再度Upする):http://techinfo4dog.blogspot.jp/2016/08/titanic-machine-learning-from_13.html
こ こまでで、コンペティション(コンペ)のデータをダウンロードし、2つのモデルをKaggleにUp(送信)しました。1つは、性別だけのモデル、もうひ とつは、性別とクラス、チケットの価格も考慮したモデルです。これは初回の提案としては良いものですが、 多少複雑なモデルに対応しようとするのは難しいですし、Excelでこのように解いていくのは時間がかかります。より複雑なモデルに対応したくなり、ただ 与えられた変数に対して生存率を手動で考えていくほどの時間が無い場合にはどうすればよいでしょうか。このような場合、コンピュータが役立ちます。
もっと変数を追加したいけど時間かかるよね
プ ログラミングのスクリプト言語は、計算を早くし、ピボットテーブルの割合を手動で計算する面倒な作業を避けるすばらしい方法です。スクリプト言語はいろい ろな種類があり、それぞれ利点と欠点があります。ここでは、簡単に使えるスクリプト言語として、Python version 2.7を使っていきます。もしまだPythonをインストールしていない場合、「 Python web sitehttps://www.python.org/downloads/)」にアクセスし、記載されている内容に従ってください。あるいは「 Anacondahttps://www.continuum.io/downloads)」を使うと、データサイエンスにかなり有用なライブラリが既にバンドルされています。(Anacondaのさらに良いところとしては、一行一行順番に実行するのに便利なインタフェースを提供する iPython(インタラクティブなPython)が含まれていることです。)
他のケース:もし Python version 3.xを使う場合、2.x系と文法が違います。チュートリアルで出るエラーへの対応については、「 as people point out in the forum (https://www.kaggle.com/c/titanic/forums/t/4937/titanic-problem-with-getting-started-with-python)」を参照してください。
インストール完了後、「  python 」、あるいは「 ipython 」、ipython notebook 」と入力することで開始できます。
Python の大きな利点としては、Pythonが持つパッケージです。これらのパッケージの中で、Kaggleのコンペで最も利用できるものは「 Numpy」、「 Scipy」、「 Pandas」、「 matplotlib」、「 csv package」です。これらのパッケージが使えるかどうかは、Pythonのコマンドラインで、「 import numpy  」などと入力するだけで確認できます。このチュートリアルでは、Pythonを使って前回(Excelの場合)と同じことをしてみます。
Python:  train.csvを読む
Pythonには、csvを呼んでメモリに格納する良いcsv readerがあります。各列をリストに入れているように読めます。下記のようにすばやく配列に入れることもできます。
# The first thing to do is to import the relevant packages
# that I will need for my script, 
# these include the Numpy (for maths and arrays)
# and csv for reading and writing csv files
# If i want to use something from this I need to call 
# csv.[function] or np.[function] first

import csv as csv 
import numpy as np

# Open up the csv file in to a Python object
csv_file_object = csv.reader(open('../csv/train.csv', 'rb')) 
header = csv_file_object.next()  # The next() command just skips the 
                                 # first line which is a header
data=[]                          # Create a variable called 'data'.
for row in csv_file_object:      # Run through each row in the csv file,
    data.append(row)             # adding each row to the data variable
data = np.array(data)           # Then convert from a list to an array
            # Be aware that each item is currently
                                 # a string in this format
既にExcelでデータの中身を見ているから分かるでしょうが、Pythonでも中身を見てみましょう。  「print data 」と入力すると以下のように表示されます。
[['1' '0' '3' ..., '7.25' '' 'S']
 ['2' '1' '1' ..., '71.2833' 'C85' 'C']
 ['3' '1' '3' ..., '7.925' '' 'S']
 ..., 
 ['889' '0' '3' ..., '23.45' '' 'S']
 ['890' '1' '1' ..., '30' 'C148' 'C']
 ['891' '0' '3' ..., '7.75' '' 'Q']]

見 ると、ヘッダが無く、ただ値だけのようです。そして、各値は引用符(”)でくくられているのが分かるでしょう。つまり、各値は文字列型(String型) で格納されているということです。残念なことに、データの一部は「...」になっています。1行目をすべて見たい場合には、「  print data[0] 」と入力します。
['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171' '7.25' '' 'S']

最後の行を見たい場合には、「  print data[-1] 」と入力します。
['891' '0' '3' 'Dooley, Mr. Patrick' 'male' '32' '0' '0' '370376' '7.75' '' 'Q']

1行目の4番目の列を見たい場合「  print data[0,3] 」と入力します。
Braund, Mr. Owen Harris

データは入ったからいろいろやってみたいね
特定の列を呼び出せることが分かったところで、性別の列を呼ぶためにdata[0::,4] 」を使います。「0::」は、すべて(最初から最後まで)を意味します。Pythonは1ではなく0から始まります。csv readerはデフォルトでは文字列型で格納するため、数値計算が必要な部分はfloats型に変換します。たとえば、「Pclass」の列を floatsに変換する場合、「data[0::,2].astype(np.float)」です。これを使って、以下のようにTitanicの生存率を計算できます。
# The size() function counts how many elements are in
# in the array and sum() (as you would expects) sums up
# the elements in the array.

number_passengers = np.size(data[0::,1].astype(np.float))
number_survived = np.sum(data[0::,1].astype(np.float))
proportion_survivors = number_survived / number_passengers

「Numpy」は幾つか使いやすい機能があります。たとえば、 性別の列を探し、どの要素が「female(あるいはmale。femaleに該当しないもの)」に該当するかを探し、幾人の女性、あるいは男性が生き残ったかを確認することができます。
women_only_stats = data[0::,4] == "female" # This finds where all 
                                           # the elements in the gender
                                           # column that equals “female”
men_only_stats = data[0::,4] != "female"   # This finds where all the 
                                           # elements do not equal 
                                           # female (i.e. male)

オリジナルのtrain.csvデータに、「mask」として新しく2つの変数を追加し、男女の生き残った割合を計算します。
# Using the index from above we select the females and males separately
women_onboard = data[women_only_stats,1].astype(np.float)     
men_onboard = data[men_only_stats,1].astype(np.float)

# Then we finds the proportions of them that survived
proportion_women_survived = \
                       np.sum(women_onboard) / np.size(women_onboard)  
proportion_men_survived = \
                       np.sum(men_onboard) / np.size(men_onboard) 

# and then print it out
print 'Proportion of women who survived is %s' % proportion_women_survived
print 'Proportion of men who survived is %s' % proportion_men_survived

訓練データで見ると、女性の生き残った割合が高いことが分かります。
テストデータを読んで、性別によるモデルをcsvとして出力する
前回と同様に、今度はPythonを使ってtest.csvファイルを読み、別のcsvファイルに結果を書き出す必要があります。まず、test.csvファイルを読んで、ヘッダをスキップします。
test_file = open('../csv/test.csv', 'rb')
test_file_object = csv.reader(test_file)
header = test_file_object.next()

新しいファイル(まだ中身がからっぽのもの)を書き込みできるように開きます。結果を記入してKaggleにアップロードするためのファイルです。
prediction_file = open("genderbasedmodel.csv", "wb")
prediction_file_object = csv.writer(prediction_file)

test用のファイルを1行ごとに読んで、「female」なら1、「male」なら0を、生存予測として新しいファイルに記入します。
prediction_file_object.writerow(["PassengerId", "Survived"])
for row in test_file_object:       # For each row in test.csv
    if row[3] == 'female':         # is it a female, if yes then                                       
        prediction_file_object.writerow([row[0],'1'])    # predict 1
    else:                              # or else if male,       
        prediction_file_object.writerow([row[0],'0'])    # predict 0
test_file.close()
prediction_file.close()

さて、これでKaggleに登録するための「genderbasedmodel.csv」ができました。
デー タページ(訳注:test.csvとかダウンロードできるところ)に、「gendermodel.py」というファイル名でこれらのステップを記載した ファイルをダウンロードできます。Pythonを使うことの利点として、たとえばもし新たな訓練データを入手した際に、ここまでのステップをもう一度すぐ に実施できることがあります。




 ****以下、翻訳時点での記事コピー(元記事:https://www.kaggle.com/c/titanic/details/getting-started-with-python)*****



Getting Started With Python

Getting Started with Python: Kaggle's Titanic Competition

Recapping our work with Excel: we have been able to successfully download the competition data and submit 2 models: one based on just the gender and another on the gender, class, and price of ticket. This is good for an initial submission, however problems arise when we move to slightly more complicated models and the time taken to formulate approaches in Excel takes longer. What do you do if you want to make a more complicated model but don’t have the time to manually find the proportion of survivors for a given variable? We should make the computer do the hard work for us!
I want to add more variables but it takes so much time!
Programming scripts are a great way to speed up your calculations and avoid the arduous task of manually calculating the pivot table ratios. There are many languages out there, each with its own advantages and disadvantages. Here we are going to use Python version 2.7, which is an easy to use scripting language. If you do not have this installed, please visit the Python web site and folllow the instructions there -- or, you could install one specific distribution of Python called Anaconda that already bundles the most useful libraries for data science. (Another advantage of Anaconda is that it includes iPython (Interactive Python) which makes the interface easier for stepping through lines of programming one by one.) 
NOTE in either case: if you use Python version 3.x, you may discover some Python syntax has changed in that version, which can cause errors on this tutorial as people point out in the forum.
When you have things installed, to begin just type  python , or  ipython , or ipython notebook.
One of the great advantages of Python is its packages. Of these packages, the most useful (for Kaggle competitions) are the Numpy, Scipy, Pandas, matplotlib and csv package. In order to check whether you have these, just go to your python command line and type  import numpy  (and so on). If you don’t you will need them! This tutorial is going to guide you through making the same submissions as before, only this time using Python.
Python: Reading in your train.csv
Python has a nice csv reader, which reads each line of a file into memory. You can read in each row and just append a list. From there, you can quickly turn it into an array. 
# The first thing to do is to import the relevant packages
# that I will need for my script, 
# these include the Numpy (for maths and arrays)
# and csv for reading and writing csv files
# If i want to use something from this I need to call 
# csv.[function] or np.[function] first

import csv as csv 
import numpy as np

# Open up the csv file in to a Python object
csv_file_object = csv.reader(open('../csv/train.csv', 'rb')) 
header = csv_file_object.next()  # The next() command just skips the 
                                 # first line which is a header
data=[]                          # Create a variable called 'data'.
for row in csv_file_object:      # Run through each row in the csv file,
    data.append(row)             # adding each row to the data variable
data = np.array(data)           # Then convert from a list to an array
            # Be aware that each item is currently
                                 # a string in this format
Although you've seen this data before in Excel, just to be sure let's look at how it is stored now in Python.  Type  print data  and the output should be something like
[['1' '0' '3' ..., '7.25' '' 'S']
 ['2' '1' '1' ..., '71.2833' 'C85' 'C']
 ['3' '1' '3' ..., '7.925' '' 'S']
 ..., 
 ['889' '0' '3' ..., '23.45' '' 'S']
 ['890' '1' '1' ..., '30' 'C148' 'C']
 ['891' '0' '3' ..., '7.75' '' 'Q']]
You can see this is an array with just values (no descriptive header). And you can see that each value is being shown in quotes, which means it is stored as a string. Unfortunately in the output above, the full set of columns is being obscured with "...," so let's print the first row to see it clearly.  Type  print data[0]
['1' '0' '3' 'Braund, Mr. Owen Harris' 'male' '22' '1' '0' 'A/5 21171' '7.25' '' 'S']
and to see the last row, type  print data[-1]
['891' '0' '3' 'Dooley, Mr. Patrick' 'male' '32' '0' '0' '370376' '7.75' '' 'Q']
and to see the 1st row, 4th column, type  print data[0,3]
Braund, Mr. Owen Harris
I have my data now I want to play with it
Now if you want to call a specific column of data, say, the gender column, I can just type data[0::,4], remembering that "0::" means all (from start to end), and Python starts indices from 0 (not 1). You should be aware that the csv reader works by default with strings, so you will need to convert to floats in order to do numerical calculations. For example, you can turn the Pclass variable into floats by usingdata[0::,2].astype(np.float). Using this, we can calculate the proportion of survivors on the Titanic: 
# The size() function counts how many elements are in
# in the array and sum() (as you would expects) sums up
# the elements in the array.

number_passengers = np.size(data[0::,1].astype(np.float))
number_survived = np.sum(data[0::,1].astype(np.float))
proportion_survivors = number_survived / number_passengers
Numpy has some lovely functions. For example, we can search the gender column, find where any elements equal female (and for males, 'do not equal female'), and then use this to determine the number of females and males that survived: 
women_only_stats = data[0::,4] == "female" # This finds where all 
                                           # the elements in the gender
                                           # column that equals “female”
men_only_stats = data[0::,4] != "female"   # This finds where all the 
                                           # elements do not equal 
                                           # female (i.e. male)
We use these two new variables as a "mask" on our original train data, so we can select only those women, and only those men on board, then calculate the proportion of those who survived:
# Using the index from above we select the females and males separately
women_onboard = data[women_only_stats,1].astype(np.float)     
men_onboard = data[men_only_stats,1].astype(np.float)

# Then we finds the proportions of them that survived
proportion_women_survived = \
                       np.sum(women_onboard) / np.size(women_onboard)  
proportion_men_survived = \
                       np.sum(men_onboard) / np.size(men_onboard) 

# and then print it out
print 'Proportion of women who survived is %s' % proportion_women_survived
print 'Proportion of men who survived is %s' % proportion_men_survived
Now that I have my indication that women were much more likely to survive, I am done with the training set.
Reading the test data and writing the gender model as a csv
As before, we need to read in the test file by opening a python object to read and another to write. First, we read in the test.csv file and skip the header line: 
test_file = open('../csv/test.csv', 'rb')
test_file_object = csv.reader(test_file)
header = test_file_object.next()
Now, let's open a pointer to a new file so we can write to it (this file does not exist yet). Call it something descriptive so that it is recognizable when we upload it:
prediction_file = open("genderbasedmodel.csv", "wb")
prediction_file_object = csv.writer(prediction_file)
We now want to read in the test file row by row, see if it is female or male, and write our survival prediction to a new file. 
prediction_file_object.writerow(["PassengerId", "Survived"])
for row in test_file_object:       # For each row in test.csv
    if row[3] == 'female':         # is it a female, if yes then                                       
        prediction_file_object.writerow([row[0],'1'])    # predict 1
    else:                              # or else if male,       
        prediction_file_object.writerow([row[0],'0'])    # predict 0
test_file.close()
prediction_file.close()
Now you have a file called 'genderbasedmodel.csv', which you can submit!
On the Data page you will find all of the steps above in a single python script named 'gendermodel.py'. One advantage of python is that you can quickly run all of the steps you did again in the future -- if you receive a new training file, for example.

0 件のコメント:

コメントを投稿