Thinking Notes: PythonでMCA（コレスポンデンス分析）

Pythonでコレポンをやる

macでやってる

$python3 plain_mca3.py cross_table_for_mca.csv

plain_mca3.pyの中身

import sys

# mca,pandasをインポート
import mca
import pandas as pd

# csvデータ読み込む
# index_col：行のインデックスに用いる列番号。 (デフォルト: None)
df = pd.read_csv(sys.argv[1],index_col=0)

# コレスポンデンス分析
# ncol = df.shape[1]
# Benzécri補正
# mca_ben = mca.MCA(df, ncols=ncol, benzecri=False, TOL=1e-8)
mca_ben = mca.MCA(df, benzecri=False, TOL=1e-8)


# Rowsのスコア（座標）を書き出す
result_row = pd.DataFrame(mca_ben.fs_r(N=2))
result_row.index = list(df.index)
print ("Rows:")
print(result_row)
print('\n', end='')

# Columnsのスコア（座標）を書き出す
result_col = pd.DataFrame(mca_ben.fs_c(N=2))
result_col.index = list(df.columns)
print ("Columns:")
print(result_col)
print('\n', end='')



# N（成分：固有値の数）の算出:表頭と表側の少ない方から1を引いた数にする
cnt_column = len(list(df.columns))
cnt_index = len(list(df.index))

if(cnt_column >= cnt_index) :
     cnt_eigenvalue = cnt_index-1
else :
    cnt_eigenvalue = cnt_column-1


# 固有値（eigenvalue）と寄与率（explained variance of eigen vectors）
data = {'value': pd.Series(mca_ben.L),
            'ratio': mca_ben.expl_var(greenacre=False, N=cnt_eigenvalue)}
columns = ['value', 'ratio']
table2 = pd.DataFrame(data=data, columns=columns).fillna(0)
table2.index += 1
table2.loc['Σ'] = table2.sum()
table2.index.name = 'Factor'
print ("Principal inertias(eigenvalues):")
print(table2)
print('\n', end='')



# 作図用ライブラリ
import matplotlib.pyplot as plt
import matplotlib
# import random as rnd #ラベル つけるときに使用

# Jupyterの中で表示したい場合は、プログラム初頭で、%matplotlib inlineとする。
# すると、インライン表示される（しなければ、別ウインドウが開く）。
# %matplotlib inline


# グラフのサイズを指定
plt.rcParams["figure.figsize"] = [7, 7]

fig, ax = plt.subplots()

# print(matplotlib.colors.cnames) #色の確認

# 表頭をプロット
result_col.plot(0, 1, kind='scatter', ax=ax, color='C0', s=20, marker="o")
for k, v in result_col.iterrows():
    ax.annotate(k, v)

# 表側をプロット
result_row.plot(0, 1, kind='scatter', ax=ax, color='#FFA500', s=20, marker='.')
for k, v in result_row.iterrows():
    ax.annotate(k, v)

# plt.rcParams['font.family'] = 'IPAexGothic' #全体のフォントを設定
# plt.rcParams['font.size'] = 12 #フォントサイズを設定 default : 12
# plt.rcParams['xtick.labelsize'] = 10 # 横軸のフォントサイズ
# plt.rcParams['ytick.labelsize'] = 10 # 縦軸のフォントサイズ
# matplotlib.font_manager._rebuild()

# X軸Y軸の目盛線とラベル
plt.axhline(0, color='gray')
plt.axvline(0, color='gray')
plt.xlabel('Factor 1')
plt.ylabel('Factor 2')

# 任意（図の設定）
# plt.figure(figsize=(4,4)) #図の設定
# plt.rcParams["font.size"] = 10 #なにかの指定


# 図を見てみる
plt.show()

----

cross_table_for_mca.csvの中身

----

,yamamoto,instant,twice,facecare,linestone

strong,80,30,50,90,70

beautiful,20,80,60,30,50

cute,20,90,70,10,60

clever,60,20,50,70,40

big,80,10,50,90,40

smart,60,30,20,70,40

charming,30,60,80,10,90

lovely,40,90,70,50,60

fresh,10,80,40,20,30

traditional,70,10,20,50,40

----

実行結果

----

Rows:

0 1

strong -0.345675 -0.033748

beautiful 0.418075 0.078905

cute 0.584282 -0.040474

clever -0.356028 -0.006797

big -0.541367 0.010732

smart -0.401739 0.149244

charming 0.390651 -0.354082

lovely 0.244705 0.090889

fresh 0.596563 0.282539

traditional -0.550051 -0.081841

Columns:

0 1

yamamoto -0.514181 -0.033323

instant 0.624395 0.192199

twice 0.251629 -0.113170

facecare -0.523083 0.164357

linestone 0.110477 -0.198570

Principal inertias(eigenvalues):

value ratio

Factor

1 0.197552 0.856996

2 0.023801 0.103250

3 0.007040 0.030540

4 0.002124 0.009214

Σ 0.230517 1.000000

クロス表の結果がこのcsvくらいのものであれば、上記コードできちんと結果が出る。

ところがモノによってはこれではきちんと出力されない場合がある。

その場合は、以下TOLの閾値を緩めておく必要がある。

mca_ben = mca.MCA(df, benzecri=False, TOL=1e-8)

で、最終的にいくつにしたかというと、この記述自体をなくしたか、もっと桁数増やしたと思う、たぶん。。。

以下参考文献と途中過程

Google Analytics のデータを python でコレスポンデンス分析する

https://www.monotalk.xyz/blog/google-analytics-のテータを-python-てコレスホンテンス分析する/

ワインの味を分析してリア充達のクリスマスディナーを台無しにしよう

https://qiita.com/nabesaan/items/f88bbacdd4f9217cd802

Pandas で CSV ファイルやテキストファイルを読み込む

https://pythondatascience.plavox.info/pandas/csv%E3%83%95%E3%82%A1%E3%82%A4%E3%83%AB%E3%81%AE%E8%AA%AD%E3%81%BF%E8%BE%BC%E3%81%BF

【Python】pipの使い方

https://www.task-notes.com/entry/20150810/1439175600

Python Matplotlib pyplot

http://www.ne.jp/asahi/hishidama/home/tech/python/matplotlib/pyplot.html

pyplotはグラフが描画できるグローバルなオブジェクト。

MCAの寄与率の説明的な

http://vxy10.github.io/2016/06/10/intro-MCA/

Building a student intervention system: MCA for dimensionality reduction

http://vxy10.github.io/2016/06/24/si-mca/

Annotate Explain

https://matplotlib.org/gallery/userdemo/annotate_explain.html

mca prince

https://github.com/MaxHalford/prince#multiple-correspondence-analysis-mca

グラフのデフォルトのブルー（Matplotlib, Mathematica, Matlab）

http://cyanatlas.hatenablog.com/entry/2018/01/17/180000

PythonのCounterでリストの各要素の出現個数をカウント

https://note.nkmk.me/python-collections-counter/

RとPythonで良さげなラベル付き散布図を書く

プロットした点とラベルの位置が重ならないよう、適当にズラして表示してくれる

https://upura.hatenablog.com/entry/2018/07/05/181500

散布図の各要素に文字を付ける。

http://nekoyukimmm.hatenablog.com/entry/2015/10/08/224607

mca usage

https://github.com/esafak/mca/blob/master/docs/usage.rst

matplotlibの日本語文字化けを解消する(Windows編)

https://datumstudio.jp/blog/matplotlib%E3%81%AE%E6%97%A5%E6%9C%AC%E8%AA%9E%E6%96%87%E5%AD%97%E5%8C%96%E3%81%91%E3%82%92%E8%A7%A3%E6%B6%88%E3%81%99%E3%82%8Bwindows%E7%B7%A8

windows10の場合

ipaexg.ttfはここに置いた

C:/users/[user_name]/AppData/Local/Continuum/anaconda3/Lib/site-packages/matplotlib/mpl-data/fonts

C:/users/[user_name]/AppData/Local/Continuum/anaconda3/Lib/site-packages/matplotlib/mpl-data/matplotlibrc

に

font family : IPAexGothic

を加筆して保存

C:/users/[user_name]/.matplotlib/tex.cache

C:/users/[user_name]/.matplotlib/font_List.json

を削除する

macで隠しファイル、フォルダの表示非表示切り替え

command+shift+.(dot)

https://qiita.com/TsukasaHasegawa/items/fa8e783a556dc1a08f51

matplotlibで描画したグラフの文字化けを解消する

https://qiita.com/hatunina/items/a77128c7f50b19ad2c51

https://gcbgarden.com/2017/05/04/matplotlib-japanese/

http://akiyoko.hatenablog.jp/entry/2017/04/11/080446

https://openbook4.me/sections/1674

ipaexg.ttfはここに置いた

/usr/local/lib/python3.6/site-packages/matplotlib/mpl-data/fonts

/usr/local/lib/python3.6/site-packages/matplotlib/matplotlibrcを

/Users/[user_name]/.matplotlibにペーストして、このファイルの中身を

font.family : IPAexGothic

となるようにした

同フォルダにある、fontlist-v300.json tex.cache fontList.py3k.cache は削除する

----

# ターミナル起動する

# cdでloadフォルダに移動する（読み込むcsvのあるディレクトリに行く）

# インストールされてるか確認

MacBook:load [user_name]$ which brew

/usr/local/bin/brew

MacBook:load [user_name]$ which python3

/usr/local/bin/python3

MacBook:load [user_name]$ which pip3

/usr/local/bin/pip3

# mcaをインストールする

MacBook:load [user_name]$ pip3 install mca

# python開始

MacBook:load [user_name]$ python3

# mcaをインポート

>>> import mca

# pandasをpdという名前でインポート

>>> import pandas as pd

# csvデータ読み込む

## index_col：行のインデックスに用いる列番号。 (デフォルト: None)

>>> df = pd.read_csv("cross_table_for_mca.csv",index_col=0)

# 中身確認

>>> df.head(5)

# コレスポンデンス分析

>>> ncol = df.shape[1]

# Benzécri? 補正するかどうか、データ構造の問題か、True だとエラーとなったため、False で設定

>>> mca_ben = mca.MCA(df, ncols=ncol, benzecri=False)

>>> mca_ben.fs_r(N=2)

array([[-0.34567532, -0.0337482 ],

[ 0.41807479, 0.07890458],

[ 0.58428204, -0.04047382],

[-0.35602754, -0.00679698],

[-0.54136724, 0.01073183],

[-0.40173869, 0.14924389],

[ 0.39065106, -0.35408174],

[ 0.24470519, 0.09088925],

[ 0.59656262, 0.28253887],

[-0.55005132, -0.08184098]])

>>> mca_ben.fs_c(N=2)

array([[-0.51418126, -0.03332318],

[ 0.62439464, 0.19219933],

[ 0.25162872, -0.11316976],

[-0.52308313, 0.16435747],

[ 0.11047684, -0.19856991]])

# 表頭の座標を書き出す

>>> result_row = pd.DataFrame(mca_ben.fs_r(N=2))

>>> result_row.index = list(df.index)

>>> result_row

result_row.columns = list('dim1','dim2')

result_row = df.rename(columns={'0': 'dim1', '1': 'dim2'}

0 1

strong -0.345675 -0.033748

beautiful 0.418075 0.078905

cute 0.584282 -0.040474

clever -0.356028 -0.006797

big -0.541367 0.010732

smart -0.401739 0.149244

charming 0.390651 -0.354082

lovely 0.244705 0.090889

fresh 0.596563 0.282539

traditional -0.550051 -0.081841

# 表側の座標を書き出す

>>> result_col = pd.DataFrame(mca_ben.fs_c(N=2))

>>> result_col.index = list(df.columns)

>>> result_col

0 1

yamanoto -0.514181 -0.033323

instant 0.624395 0.192199

twice 0.251629 -0.113170

facecare -0.523083 0.164357

linestone 0.110477 -0.198570

# 作図用ライブラリ

>>> import matplotlib.pyplot as plt

>>> import matplotlib

# Jupyterの中で表示したい場合は、プログラム初頭で、%matplotlib inlineとする。

# すると、インライン表示される（しなければ、別ウインドウが開く）。

>>> %matplotlib inline

# 任意（このフォントがないので不要かな）

>>> plt.rcParams['font.family'] = 'IPAPGothic' #全体のフォントを設定

# グラフのサイズを指定

# これは指定しといたがよさそう（デカすぎたりするので）

>>> plt.rcParams["figure.figsize"] = [7, 7]

# 不要かな

>>> plt.rcParams['font.size'] = 12 #フォントサイズを設定 default : 12

>>> plt.rcParams['xtick.labelsize'] = 10 # 横軸のフォントサイズ

>>> plt.rcParams['ytick.labelsize'] = 10 # 縦軸のフォントサイズ

>>> matplotlib.font_manager._rebuild()

#よくわからない

import random as rnd

# X軸Y軸の目盛線とラベル

plt.axhline(0, color='gray')

plt.axvline(0, color='gray')

plt.xlabel('Factor 1')

plt.ylabel('Factor 2')

# 任意（図の設定）

>>> plt.figure(figsize=(4,4))

>>> plt.rcParams["font.size"] = 10

# 表頭をプロット

## scatter=散布図

>>> plt.scatter(result_col[0], result_col[1], s=20, marker="o")

# ラベル付け

# *0.01のところでラベル位置を指定してるっぽい

cnt = 0

for label in list(result_col.index):

r = rnd.random() * 0.01

plt.text(result_col.iloc[cnt, 0]+r, result_col.iloc[cnt, 1]+r, label)

plt.plot([result_col.iloc[cnt, 0]+r, result_col.iloc[cnt, 0]], [result_col.iloc[cnt, 1]+r, result_col.iloc[cnt, 1]])

cnt += 1

cnt = 0

for label in list(result_col.index):

r = rnd.random() * 0.01

plt.text(result_col.iloc[cnt, 0]+r, result_col.iloc[cnt, 1]+r, label, color='C0')

cnt += 1

# 表側をプロット

plt.scatter(result_row[0], result_row[1], s=20, marker='.')

# ラベル付け

cnt = 0

for label in list(result_row.index):

r = rnd.random() * 0.01

plt.text(result_row.iloc[cnt, 0]+r, result_row.iloc[cnt, 1]+r, label)

plt.plot([result_row.iloc[cnt, 0]+r, result_row.iloc[cnt, 0]], [result_row.iloc[cnt, 1]+r, result_row.iloc[cnt, 1]])

cnt += 1

cnt = 0

for label in list(result_row.index):

r = rnd.random() * 0.01

plt.text(result_row.iloc[cnt, 0]+r, result_row.iloc[cnt, 1]+r, label, color='k')

cnt += 1

# adjustText

from adjustText import adjust_text

# 図を見てみる

plt.show()

# 固有値（inertia, eigen values）と寄与率（explained variance of eigen vectors）

# N（成分：固有値の数）は表頭と表側の少ない方から1を引いた数にする

data = {'Iλ': pd.Series(mca_ben.L),

'τI': mca_ben.expl_var(greenacre=False, N=4)}

columns = ['Iλ', 'τI']

table2 = pd.DataFrame(data=data, columns=columns).fillna(0)

table2.index += 1

table2.loc['Σ'] = table2.sum()

table2.index.name = 'Factor'

table2

----

たぶん最初に

import sys

が必要だろう

tname = sys.argv[1]

lname = sys.argv[2]

[1] 読み込むcsv

[2] N

の引数指定でできるようにしたり、

Nを計算で導出できるとカッコいいな

# Nを計算で導出できるとカッコいいな

cnt_column = len(list(df.columns))

cnt_index = len(list(df.index))

print(cnt_column)

print(cnt_index)

if(cnt_column >= cnt_index) :

cnt_eigenvalue = cnt_index-1

else :

cnt_eigenvalue = cnt_column-1

cnt_eigenvalue

print(cnt_eigenvalue)

len(list(df.columns)) > len(list(df.index))

then

>>> print(len(list(df.columns)))

>>> print(len(list(df.index)))

# floatをintにしたい

df = pd.read_csv("cross_table_for_mca_float.csv",index_col=0)

>>> df.dtypes

yamamoto float64

instant float64

twice float64

facecare float64

linestone float64

dtype: object

# カラム名指定してintにする

print(df.astype({'yamamoto': int}))

# 2列目以降の指定

df.ix[1:,]

print(df.astype({df.ix[1:,]: int}))

Thinking Notes

ページ

2018年12月29日土曜日

PythonでMCA（コレスポンデンス分析）

0 件のコメント: