SageMaker

SageMaker用のコードをローカルで動かす – scikit-learnの決定木でアヤメの種類を分類

投稿日 2021年05月23日

SageMakerはローカルで使うことができるので、それを試してみた。
この記事を書くにあたって以下の公式の記事を参考にしています。
オンプレミス環境から Amazon SageMaker を利用する

機械学習のHelloWorld

アヤメデータをschikit-learnの決定木分類器で学習して種類を予測する.
結構いろいろなところで機械学習のHelloWorldとして使われている例題を題材にしていく.
sagemaker-python-sdk/scikit_learn_iris

既に SageMaker用のサンプルコードがあるので、
これをローカルで学習・推論できるように修正していく。

構成は以下の通り.
公式のブログの通り、SageMaker Notebook用に書かれた.ipynb を Local用に微修正するだけで動く。

SageMaker Notebookで動かすための.ipynb
.ipynbから呼ぶschikit-learnコード

SageMakerのサンプルはSageMakerのJupyterNotebookで動くように書かれているがが、
ちょっと修正するだけでローカルで動くようになる様子。(1つしか試してないけど)

前準備

学習と推論をローカルで行うが、そのために裏でDockerのコンテナが走る。
ローカルコンピュータ用にDockerをインストールしておく必要がある。

CredentialsとIAM

以下が必要。

AmazonSageMakerFullAccess 権限をもった IAM ユーザの Credential
AmazonSageMakerFullAccess の IAM ロール

ローカルコードからAWSリソースにアクセスするために aws configure を使って設定する。
Credentialsが書かれたcsvをダウンロードし aws configure の応答に答えていく。


$ pip install awscli --upgrade --user
$ aws configure
AWS Access Key ID [None]: ******************
AWS Secret Access Key [None]: ************************************
Default region name [None]: ap-northeast-1
Default output format [None]: json

SageMaker PythonSDKインストール

SageMaker PythonSDKをインストールする。
実行するコードに応じてSDKのバージョンを指定することができる。


$ pip install -U sagemaker >=2.15

バージョンを指定しない場合は以下の通り。


$ pip install sagemaker

ローカルのJupyter Notebookでファイルを修正

scikit_learn_estimator_example_with_batch_transform.ipynb を
ローカルのJupyter Notebookで修正していく。

SageMaker ローカルSessionを開始

SageMakerを想定したコードは以下。


# S3 prefix
prefix = "Scikit-iris"

import sagemaker
from sagemaker import get_execution_role

sagemaker_session = sagemaker.Session()

# Get a SageMaker-compatible role used by this Notebook Instance.
role = get_execution_role()

それをローカルで動かすために以下のように修正する
localSession()というセッションが用意されているのでそれを使用する。
ローカルでは get_execution_role()ではロールを取得できないので直接ロールのARNを指定する。


# S3 prefix
prefix = "Scikit-iris"

import sagemaker
from sagemaker import get_execution_role

# LocalSession()を使用する
sagemaker_session = sagemaker.local.LocalSession() # sagemaker.Session()から変更

# Get a SageMaker-compatible role used by this Notebook Instance.
# ローカルでは get_execution_role()は使えない。直接ロールのARNを指定する。
# role = get_execution_role()
role = 'arn:aws:iam::(12桁のAWSアカウントID):role/(ロール名)'

学習用データの準備 (変更なし)

学習用データが巨大であればS3にデータを準備する(と書かれている).
アヤメデータは軽量なので、ローカルファイルに保存する。


import numpy as np
import os
from sklearn import datasets

# Load Iris dataset, then join labels and features
iris = datasets.load_iris()
joined_iris = np.insert(iris.data, 0, iris.target, axis=1)

# Create directory and write csv
os.makedirs("./data", exist_ok=True)
np.savetxt("./data/iris.csv", joined_iris, delimiter=",", fmt="%1.1f, %1.3f, %1.3f, %1.3f, %1.3f")

その後、用意したローカルデータをSageMaker Python SDKに食わせる。


WORK_DIRECTORY = "data"

train_input = sagemaker_session.upload_data(
    WORK_DIRECTORY, key_prefix="{}/{}".format(prefix, WORK_DIRECTORY)
)

Scikit learn Estimator

scikit-learnの機械学習は以下の3段構成になっている。

Estimator: 与えられたデータから学習(fit)する
Transformer: 与えられたデータを変換(transform)する
Predictor: 与えられたデータから結果を予測(Predict)する

SageMakerは機械学習プラットフォームであって、かなり多くのライブラリや手法がサポートされている。
その中で、scikit-learnもサポートされていて、SKLearn Estimatorとして使用できる。
要は、schikit-learn のI/Fに準じたコードを SageMaker に内包することができる。
SKLearn Estimatorに scikit-learn コードを食わせると SageMakerから SKLearn インスタンスとして操作できる.

例えば、.ipynbで以下のように書く.


from sagemaker.sklearn.estimator import SKLearn

FRAMEWORK_VERSION = "0.23-1"
script_path = "scikit_learn_iris.py"

sklearn = SKLearn(
    entry_point=script_path,
    framework_version=FRAMEWORK_VERSION,
    instance_type="local",
    role=role,
    sagemaker_session=sagemaker_session,
    hyperparameters={"max_leaf_nodes": 30},
)

SKLearnにentry_pointとして渡しているのがscikit-learnのコード本体。
内容は以下。普通の決定木分類のコードにSageMakerとのIFに関わるコードが追加されている。

実行時引数として、SM_MODEL_DIR、SM_OUTPUT_DATA_DIR、SM_CHANNEL_TRAINが渡される。
fitで学習した結果(つまり係数)をシリアライズしSM_MODEL_DIRに保存する。

model_fnでは、SM_MODEL_DIRにシリアライズされた係数をデシリアライズし、
scikit-learnの決定木分類木オブジェクトを返す。


#  Copyright 2018 Amazon.com, Inc. or its affiliates. All Rights Reserved.
#
#  Licensed under the Apache License, Version 2.0 (the "License").
#  You may not use this file except in compliance with the License.
#  A copy of the License is located at
#
#      http://www.apache.org/licenses/LICENSE-2.0
#
#  or in the "license" file accompanying this file. This file is distributed
#  on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either
#  express or implied. See the License for the specific language governing
#  permissions and limitations under the License.

from __future__ import print_function

import argparse
import os

import joblib
import pandas as pd
from sklearn import tree

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    # Hyperparameters are described here. In this simple example we are just including one hyperparameter.
    parser.add_argument("--max_leaf_nodes", type=int, default=-1)

    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument("--output-data-dir", type=str, default=os.environ["SM_OUTPUT_DATA_DIR"])
    parser.add_argument("--model-dir", type=str, default=os.environ["SM_MODEL_DIR"])
    parser.add_argument("--train", type=str, default=os.environ["SM_CHANNEL_TRAIN"])

    args = parser.parse_args()

    # Take the set of files and read them all into a single pandas dataframe
    input_files = [os.path.join(args.train, file) for file in os.listdir(args.train)]
    if len(input_files) == 0:
        raise ValueError(
            (
                "There are no files in {}.\n"
                + "This usually indicates that the channel ({}) was incorrectly specified,\n"
                + "the data specification in S3 was incorrectly specified or the role specified\n"
                + "does not have permission to access the data."
            ).format(args.train, "train")
        )
    raw_data = [pd.read_csv(file, header=None, engine="python") for file in input_files]
    train_data = pd.concat(raw_data)

    # labels are in the first column
    train_y = train_data.iloc[:, 0]
    train_X = train_data.iloc[:, 1:]

    # Here we support a single hyperparameter, 'max_leaf_nodes'. Note that you can add as many
    # as your training my require in the ArgumentParser above.
    max_leaf_nodes = args.max_leaf_nodes

    # Now use scikit-learn's decision tree classifier to train the model.
    clf = tree.DecisionTreeClassifier(max_leaf_nodes=max_leaf_nodes)
    clf = clf.fit(train_X, train_y)

    # Print the coefficients of the trained classifier, and save the coefficients
    joblib.dump(clf, os.path.join(args.model_dir, "model.joblib"))


def model_fn(model_dir):
    """Deserialized and return fitted model

    Note that this should have the same name as the serialized model in the main method
    """
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf

学習

SageMaker(またはローカル)の.ipynb は、SKLearnインスタンスに対して fit() を実行するだけで良い。


sklearn.fit({"train": train_input})

推論

なんと、推論はWebインターフェースになっている。
推論コンテナ内でnginxが動作し、PythonWebAppが wsgi(gunicorn) を介してnginxから入力/応答する。

SageMaker(またはローカル)の.ipynbからは、SKLearnインスタンスに対してdeploy()を実行する。
推論コンテナへのインターフェースとなるインスタンスが生成され、後はこのインスタンスに対してpredict()を呼ぶ。

推論用にデータを集めて predict()を実行する例。
テストデータと推論の結果を並べて表示している。うまくいっていれば同じになるはず。
(こんな風に訓練データとテストデータを拾って良いのかはさておき…)


predictor = sklearn.deploy(initial_instance_count=1, instance_type="local")

import itertools
import pandas as pd

shape = pd.read_csv("data/iris.csv", header=None)

a = [50 * i for i in range(3)]
b = [40 + i for i in range(10)]
indices = [i + j for i, j in itertools.product(a, b)]

test_data = shape.iloc[indices[:-1]]
test_X = test_data.iloc[:, 1:]
test_y = test_data.iloc[:, 0]

print(predictor.predict(test_X.values))
print(test_y.values)

/invocationsというURLに対してPOSTリクエストが発行されている。
応答は以下の通り、
テストデータの説明変数と、predict()の結果得られた値が一致していそう。


hqy7i6eoyi-algo-1-vq8df | 2021-05-22 16:26:50,617 INFO - sagemaker-containers - No GPUs detected (normal if no gpus installed)
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2.
 2. 2. 2. 2. 2.]hqy7i6eoyi-algo-1-vq8df | 172.23.0.1 - - [22/May/2021:16:26:51 +0000] "POST /invocations HTTP/1.1" 200 360 "-" "python-urllib3/1.26.4"

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 2. 2. 2. 2.
 2. 2. 2. 2. 2.]

まとめ

SageMaker用の機械学習のHelloWorldをローカルで動かしてみた。
（かなり雑だけども), SageMakerのサンプルコードをちょっと修正するだけでローカルで動くことがわかった。

MacでDockerお砂場を作る

docker on vagrant Docker for Macが遅いので,Webのコード書くのはvagrant+ansibleで済ませていたのだけれども, vagrant上にdockerを立てることでLinux上で走らせるのと同レベルの速度を得られるようなので, docker on vagrantを立ててdockerに入門してみる。 Web開発していない人から見ると,なんでdocker使わないの? ってことなのだが, 気持ちは以下の通り. 工夫無しだと1度試したら2度とやりたくないレベルで使い物にならない. Docker For Macが遅い：対策の実験 Macのdockerが遅いストレスから解放されよう ansibleで自動化してたから手間に気づかなかったけど, 公式がイメージ配ってるんだから,手間なしなのはdocker. 本番環境と開発環境を揃えにくいかなとも思うけど, AWS ECS的なものもあって,そもそもdockerだけで完結する世界が主流になりそうな感. docker on vagrantで遅さを回避できるので, 言い訳してないでアップデートしていく.. なぜ遅いのか Docker for Macは, AlpineLinuxベースのHyperkitVMの上で透過的にContainerを扱う. Mac上でDockerコマンドを実行すると,透過的にHyperkitVM上に反映される. Macの上で直接Containerが動いているのではなくこのVMの上でContainerを動かしている. HostとGuestのファイルシステムをマウントする観点ではvagrantも同じで, 実際,VagrantfileでマウントオプションとしてNFSを指定したとしてもNativeよりかなり遅い. ファイルシステムの対称性がある分,vagrantはHostとGuestをSyncする方法に手を入れやすく, HostとGuestのファイルシステムをNativeと同等レベルの速度でSyncする手段を導入できる. この仕組みによると, vagrantの上でDockerを動かすパフォーマンスがNative並に速くなる. Mutagen HostとGuestのファイルシステムをSyncするためにMutagenを利用する. Overviewによると,Mutagenは双方向の同期ツールで,低レイテンシをうたっている. もともとHostToGuestというよりはLocalToCloudの統合を目指している感じ. キーボードを打ってから反映されるまでのラグが気になる, とかOSが違う場合のアレコレ(permissionとかシンボリックリンックとか),とか, そういうところの解消を目指している様子. エージェントレス,TCP,でリモートへの導入コストが少ないのが良い. VagrantにはMutagen over SSHの形を取る. インストール手順こちらが参考になりました。わかりやすく,手順通りやれば10分かかりません. Vagrantを使う「Mac最速のDocker環境」を初心者向けに解説【遅いMac for Dockerを卒業】とりあえず箱だけ作った.Laravelだと分かりづらいのでRailsで試したところ激しく速かった. 副次的な効果として、Macを汚さないのでお砂場にぴったり.

Ward距離を使った階層型クラスタリング (お試し実行なし)

[mathjax] 階層型クラスタリングについてわかりやすい解説を聞いたので頭の中にあるものを書き出してみます。 (せっかく聞いたシャープな解説が台無しになってしまっているかと思いますが...) 本当はもっとアクティブなアウトプット方法があれば良いのですが、今のところはブログしかないのと、 WordPressのエントリという単位が、アウトプットの量として個人的に丁度良いなと思いますので。初学なので解釈の誤りとか検討不足があるかと思いますが、本日も書いていきます！試しに実行してみるためのデータを作る面倒さの洗礼を受けまして、実際に試すのは別記事になります。 [clink url=\"https://ikuty.com/2019/07/16/hierarchical_clustering_by_method/\"] 階層型クラスタリング小さいクラスタを大きいクラスタで内包する様が木構造(階層構造)になることから、階層型クラスタリング、という名前が付いているんでしょうかね。最初にk個のクラスタに分割する！みたいな問題であればk-means法、何個のクラスタに分割すべきかはやってみて考えるのであれば、階層型クラスタリング。木の深さとクラスタ数を対応させるところに見事な美しさを感じます。クラスタ間距離として以下のWard距離を使うことで空間拡張性と非単調性に優れた結果を得られやすく、よく傾向が分かっていないデータに対して一発目に適用すべき方法の様子。 begin{eqnarray} d_{ward} (C,C\') = sum_{iin Ccup C\'} || x_i - bar{x}_{Ccup C\'}||^2 - left( sum_{i in C} ||x_i-bar{x}_C||^2 + sum_{iin C\'}||x_i - bar{x}_{C\'}||^2 right) end{eqnarray} 階層型クラスタリングのアルゴリズム以下の通り。最初、全てのデータポイントをクラスタとみなすところからスタートする。クラスタとクラスタを併合してより大きいクラスタを作る処理であって、繰り返しを止めなければ最終的に一つのクラスタまで成長する。つまり、どこかで止めることで、その時点のクラスタ分割を得ることができる。全データポイントそれぞれをクラスタとする全てのデータポイントが初期クラスタ以外に含まれるまで以下の計算を繰り返す各クラスタ間の距離を計算する。最もクラスタ間距離が小さいクラスタを併合する。クラスタを成長させる際に、うまい具合に併合相手を選ばないとよくない。全体としてクラスタがバラけて欲しいのだけれども、クラスタが隣のクラスタを併合した結果、次の併合先としてその隣のクラスタが対象となってしまうかもしれない。そうすると、単調に端っこから順番に成長することになり、「大きなクラスタ」と「その他の小さなクラスタ」みたいになる。何らかの距離に基づいてクラスタとクラスタを併合するのだけれども、クラスタを併合した結果、併合後のクラスタと別のクラスタの距離が、より短くなってしまう、というケースも起こってしまう。クラスタ内のデータポイント間の距離、とか、クラスタの重心間の距離とかでは上記の問題が起こりやすい。先に距離があって次に併合を考えるのではなく、併合前後の条件を使って距離を作ることで、上記の問題が起こりにくくなる。その一つがWard距離。 Ward距離データポイント間の距離はユークリッド距離を使うとして、クラスタ間の距離をどう決めるのか？データポイントをクラスタ(C)とクラスタ(C\')に分けるとする。分ける前は(Ccup C\')。 (逆? クラスタ(C)とクラスタ(C\')を併合してクラスタ(Ccup C\')を作る)。クラスタ(C)とクラスタ(C\')の距離(d_{ward}(C,C\'))を以下のように定義する。 begin{eqnarray} d_{ward} (C,C\') = sum_{iin Ccup C\'} || x_i - bar{x}_{Ccup C\'}||^2 - left( sum_{i in C} ||x_i-bar{x}_C||^2 + sum_{iin C\'}||x_i - bar{x}_{C\'}||^2 right) end{eqnarray} 併合後のクラスタ(Ccup C\')における重心との距離の2乗和( sum_{iin Ccup C\'} || x_i - bar{x}_{Ccup C\'}||^2)から、併合前のクラスタ(C)における重心との距離の2乗和(sum_{i in C} ||x_i-bar{x}_C||^2)、 (C\')における重心との距離の2乗和(sum_{iin C\'}||x_i - bar{x}_{C\'}||^2)を引いたもの。当たり前だけれども、大きいクラスタよりも小さいクラスタの方が凝集度が小さい。つまり、併合前のクラスタ(C)、(C\')を併合してクラスタ(Ccup C\')を作るとなると、 (Ccup C\')の凝集度よりも、(C)、(C\')の凝集度の方が必ず小さい。 (C)の凝集度と(C\')の凝集度の和よりも、(Ccup C\')の凝集度の方が大きくなることがポイント。要するに(Ccup C\')を(C)、(C\')に分割することは、全体の凝集度が小さくなるような操作を行うということ。無数にある分割の仕方の中から、凝集度の総和が小さくなるような「マシな分割」を決めたい！ (sum_{i in C} ||x_i-bar{x}_C||^2+sum_{iin C\'}||x_i - bar{x}_{C\'}||^2)よりも( sum_{iin Ccup C\'} || x_i - bar{x}_{Ccup C\'}||^2)の方が大きいから、 (d_{ward})は正の値となる。(d_{ward})が大きければそれはマシな分割。そもそも、データ達が何をもってクラスタを形成しているのかって考えるとかなり微妙で、人間が解釈するための合理的な根拠を作り出す手段でしかないことがわかった。データポイント間の\"距離\"を何らかの方法で定義し、\"距離が近しい塊をクラスタと言う\" という話であるとすれば、確かに合理的なのだろうけども。モヤモヤしているのは、k個のクラスタ集合とk+1個のクラスタ集合を比較したときに、前者と後者で異なるクラスタに属するデータポイントが発生するところ。クラスタリングとは単に境界を引く処理なのであって、それにどういう意味を付けるかは目的次第なのだろうと改めて認識した次第。そもそも、これらは単なる道具でしかないので、他の手段も同じく出力結果にどう意味付けするのかは目的次第なんだろうなと。最初から何個のクラスタに分割すべきか分かっているときはk-means法を使えば良さそう。今回は、そもそも何個に分割すべきか分かっていないときに使う手法。 [clink url=\"https://ikuty.com/2019/06/29/k-means_idea/\"]

AWS SAM CLIを使ってローカルでLambda関数をビルド・実行・デプロイする

Lambdaで何かをするときチマチマAWSコンソールを触らないといけないとなると面倒すぎる。ローカルでデバッグ・デプロイできるとかなり楽になる。AWS SAMを使ってみる。 AWS SAM(Serverless Application Model)。広くAWSのServerlessサービスがまとめられている。 AWS SAM CLIのGAは2020年8月。それから何回かアップデートされている。 AWS SAMの実体はCloudFormation。CloudFormationを使ってリソースの構築が走る。普段CloudFormationを使っていないとSAMのコマンドがコケた時に意味不明なエラーで悩むことになる。で、悩みながらHelloWorldしてみた。 [arst_toc tag=\"h4\"] Permissions CloudFormationで各種リソースを作る仕組みであるため、同等のPermission設定が必要。 https://docs.aws.amazon.com/ja_jp/serverless-application-model/latest/developerguide/sam-permissions.html AWS SAM は、AWS リソースへのアクセスを制御するために、AWS CloudFormation と同じメカニズムを使用します。詳細については、AWS CloudFormation ユーザーガイドの「Controlling access with AWS Identity and Access Management」を参照してください。サーバーレスアプリケーションを管理するためのユーザー権限の付与には、3 つの主なオプションがあります。各オプションは、ユーザーに異なるレベルのアクセスコントロールを提供します。 - 管理者権限を付与する。 - 必要な AWS 管理ポリシーをアタッチする。 - 特定の AWS Identity and Access Management (IAM) 許可を付与する。必要な管理ポリシーは以下。 AWSCloudFormationFullAccess IAMFullAccess AWSLambda_FullAccess AmazonAPIGatewayAdministrator AmazonS3FullAccess AmazonEC2ContainerRegistryFullAccess 触るユーザーにロールを割り当て、上記の管理ポリシーをアタッチしておくこと。 aws configureでprofileを設定しておいて、samコマンドのオプションにprofileを渡せる。インストール homebrewでインストール。 $ brew tap aws/tap $ brew install aws-sam-cli $ sam --version SAM CLI, version 1.37.0 初期化 sam initでプロジェクトディレクトリを作成できる。対話的に雛形を作るか、またはテンプレートを読み込む。 Lambdaで使える言語は割と多いが、NodejsとPythonがほとんどとのこと。 NodejsがMost popular runtimeとして扱われてるんだな。 Python書きたくないなというか。all right $ mkdir samtest && cd samtest $ sam init Which template source would you like to use? 1 - AWS Quick Start Templates 2 - Custom Template Location Choice: 1 Cloning from https://github.com/aws/aws-sam-cli-app-templates Choose an AWS Quick Start application template 1 - Hello World Example 2 - Multi-step workflow 3 - Serverless API 4 - Scheduled task 5 - Standalone function 6 - Data processing 7 - Infrastructure event management 8 - Machine Learning Template: 1 Use the most popular runtime and package type? (Nodejs and zip) [y/N]: y Project name [sam-app]: ----------------------- Generating application: ----------------------- Name: sam-app Runtime: nodejs14.x Architectures: x86_64 Dependency Manager: npm Application Template: hello-world Output Directory: . Next steps can be found in the README file at ./sam-app/README.md プロジェクト内は以下のような構成となった。 sam-app/ │ .gitignore │ README.md │ template.yaml ├─events │ event.json └─hello-world │ .npmignore │ app.js │ package.json └─tests └─unit test-handler.js app.jsがコード本体. Hello World.が書かれている。 eventを受け取るlambdaHandlerというアロー関数があって200を返してる。 // const axios = require(\'axios\') // const url = \'http://checkip.amazonaws.com/\'; let response; /** * * Event doc: https://docs.aws.amazon.com/apigateway/latest/developerguide/set-up-lambda-proxy-integrations.html#api-gateway-simple-proxy-for-lambda-input-format * @param {Object} event - API Gateway Lambda Proxy Input Format * * Context doc: https://docs.aws.amazon.com/lambda/latest/dg/nodejs-prog-model-context.html * @param {Object} context * * Return doc: https://docs.aws.amazon.com/apigateway/latest/developerguide/set-up-lambda-proxy-integrations.html * @returns {Object} object - API Gateway Lambda Proxy Output Format * */ exports.lambdaHandler = async (event, context) => { try { // const ret = await axios(url); response = { \'statusCode\': 200, \'body\': JSON.stringify({ message: \'hello world\', // location: ret.data.trim() }) } } catch (err) { console.log(err); return err; } return response }; ビルドそもそもどういう仕組みなのかというと、Lambdaの実行環境をエミュレートしたコンテナが背後にあり、その中でコードを実行する、ということになっている。それがゴニョゴニョと隠蔽されている。 Lambda関数のコードをビルドしてデプロイ用の「アーティファクト」を作る。 $ sam build Building codeuri: /Users/ikuty/ikuty/samtest/sam-app/hello-world runtime: nodejs14.x metadata: {} architecture: x86_64 functions: [\'HelloWorldFunction\'] Running NodejsNpmBuilder:NpmPack Running NodejsNpmBuilder:CopyNpmrc Running NodejsNpmBuilder:CopySource Running NodejsNpmBuilder:NpmInstall Running NodejsNpmBuilder:CleanUpNpmrc Build Succeeded Built Artifacts : .aws-sam/build Built Template : .aws-sam/build/template.yaml Commands you can use next ========================= [*] Invoke Function: sam local invoke [*] Test Function in the Cloud: sam sync --stack-name {stack-name} --watch [*] Deploy: sam deploy --guided ローカルで実行そしてローカルで実行。 Lambdaをエミュレートするコンテナが動いてapp.jsにあるアロー関数が評価される。 1発目は重いが2発目以降は結構速い。 $ sam local invoke Invoking app.lambdaHandler (nodejs14.x) Image was not found. Removing rapid images for repo public.ecr.aws/sam/emulation-nodejs14.x Building image..................................................................................................................................................................................................................................................................................................................................................................................................................... Skip pulling image and use local one: public.ecr.aws/sam/emulation-nodejs14.x:rapid-1.37.0-x86_64. Mounting /Users/ikuty/ikuty/samtest/sam-app/.aws-sam/build/HelloWorldFunction as /var/task:ro,delegated inside runtime container START RequestId: e0bbec88-dafd-4e3c-8b5e-5fcb0f38f1fa Version: $LATEST END RequestId: e0bbec88-dafd-4e3c-8b5e-5fcb0f38f1fa REPORT RequestId: e0bbec88-dafd-4e3c-8b5e-5fcb0f38f1fa Init Duration: 0.47 ms Duration: 195.40 ms Billed Duration: 196 ms Memory Size: 128 MB Max Memory Used: 128 MB {\"statusCode\":200,\"body\":\"{\"message\":\"hello world\"}\"}⏎ デプロイ以前はもっと面倒だったらしい。新しいSAMではコマンド1発でデプロイできる。ただし、1回目と2回目以降でフローが異なる。 1回目ではsamconfig.tomlという設定ファイルを作成する。 2回目以降、作成済みのsamconfig.tomlを使ってデプロイが行われる。 $ sam deploy -g Configuring SAM deploy ====================== Looking for config file [samconfig.toml] : Not found Setting default arguments for \'sam deploy\' ========================================= Stack Name [sam-app]: AWS Region [ap-northeast-1]: #Shows you resources changes to be deployed and require a \'Y\' to initiate deploy Confirm changes before deploy [y/N]: y #SAM needs permission to be able to create roles to connect to the resources in your template Allow SAM CLI IAM role creation [Y/n]: y #Preserves the state of previously provisioned resources when an operation fails Disable rollback [y/N]: y HelloWorldFunction may not have authorization defined, Is this okay? [y/N]: y Save arguments to configuration file [Y/n]: y SAM configuration file [samconfig.toml]: SAM configuration environment [default]: Looking for resources needed for deployment: Creating the required resources... Successfully created! Managed S3 bucket: aws-sam-cli-managed-default-samclisourcebucket-h0aw0pxx8pxv A different default S3 bucket can be set in samconfig.toml Saved arguments to config file Running \'sam deploy\' for future deployments will use the parameters saved above. The above parameters can be changed by modifying samconfig.toml Learn more about samconfig.toml syntax at https://docs.aws.amazon.com/serverless-application-model/latest/developerguide/serverless-sam-cli-config.html ... (省略) 最後の文節にあるように、samconfig.tomlを変更することで構成を変更できる。この後、実際にCloudFormationスタックのアップロード/実行が走りリソースが組み上がる。 2回目以降、-gオプション抜きでsam deployを実行すると以下。 $ sam deploy File with same data already exists at sam-app/e32dcdf231268fbcad9915436e787001, skipping upload Deploying with following values =============================== Stack name : sam-app Region : ap-northeast-1 Confirm changeset : True Disable rollback : True Deployment s3 bucket : aws-sam-cli-managed-default-samclisourcebucket-h0aw0pxx8pxv Capabilities : [\"CAPABILITY_IAM\"] Parameter overrides : {} Signing Profiles : {} Initiating deployment ===================== File with same data already exists at sam-app/9a813032a850e5b7fb214dffc5ac5783.template, skipping upload Waiting for changeset to be created.. Error: No changes to deploy. Stack sam-app is up to date Webコンソールで動作確認 Webコンソール上、生成されたLambda関数を確認できる。 HelloWorldが書かれたapp.jsが見える。 Testでサンプルイベントを送るとHelloWorldが200で返ってきた。OK。

Fellegi-Sunterモデルに基づく確率的名寄せパッケージ Splinkを試してみる

[mathjax] Record Linkage、Entity Recognitionなど、いわゆる「名寄せ」問題には、割とさまざまな解決策が提案されている。その1つに確率論的な数学的背景を持つFellegi-Sunterモデルを実装したSplinkがある。数学的背景について咀嚼して説明してくれているサイトはあまりなく、Splinkの公式ドキュメントが唯一に近いソースだろうと思う。今回は5万人の人物データを使って重複を除去するSplinkの主要な機能のサンプルを追ってみた。数式を読んだところで実りは薄い、ということもあり、なるべく数式を読まずに読み進めたが、モデルのパラメタ、予測結果を解釈する際にどうしても数式を読まないとわからない箇所があった。特にベイズ統計について理解がかなり浅くて、正直なところ理解しきれてはいない。 Splinkの公式ドキュメントを読み進めていった際に気付いたメモを書いてみることにした。 [arst_toc tag=\"h4\"] 環境構築・準備環境構築の実装部分は主な学習要素ではないのでClaude Codeでサクッと作ってみた。まずはPlan modeで計画を立ててもらう。これぐらいだとほぼ言い値で使える。 DuckDB、Splinkを使う最小構成の環境を作ります。環境はDocker composeで構築します。計画を立ててください。 ⏺ DuckDB、Splinkを使うDocker Compose環境の計画を完了しました。計画概要 1. 設計文書: .tmp/design.mdで要件、アーキテクチャ、技術仕様を定義 2. タスク管理: .tmp/task.mdで詳細な実装手順を3フェーズに分けて整理主要コンポーネント - Jupyter Lab環境: DuckDB、Splink、必要なライブラリがプリインストール - データボリューム: サンプルデータ、notebook、出力用のディレクトリ - 最小構成: 開発・学習用途に特化した軽量環境次のステップ実装を開始する場合は、Phase 1から順番に進めます： 1. Docker設定ファイル作成 2. ディレクトリ構造構築 3. サンプル実装 4. テストと検証計画の詳細は.tmp/design.mdと.tmp/task.mdをご確認ください。実装を開始しますか？実際にファイルを作ってもらうと以下のような構成となる。得られる中間成果物である docker-compose.yml、Dockerfile は本記事では省略する。 . ├── data ├── docker-compose.yml ├── Dockerfile ├── notebooks ├── output └── requirements.txt 普通に http://localhost:8888 で JupyterLab が開く。使用するサンプルデータ 5万人の人物データを使って名寄せを行うサンプル。おそらくSplinkの用途として最初に思いつくやつ。 Splinkにデータをロードする前に必要なデータクリーニング手順について説明がある。公式によると、まずは行に一意のIDを割り当てる必要がある。データセット内で一意となるIDであって、重複除去した後のエンティティを識別するIDのことではない。 [clink implicit=\"false\" url=\"https://moj-analytical-services.github.io/splink/demos/tutorials/01_Prerequisites.html\" imgurl=\"https://user-images.githubusercontent.com/7570107/85285114-3969ac00-b488-11ea-88ff-5fca1b34af1f.png\" title=\"Data Prerequisites\" excerpt=\"Splink では、リンクする前にデータをクリーンアップし、行に一意の ID を割り当てる必要があります。このセクションでは、Splink にデータをロードする前に必要な追加のデータクリーニング手順について説明します。\"] 使用するサンプルデータは以下の通り。 from splink import splink_datasets df = splink_datasets.historical_50k df.head() データの分布を可視化 splink.exploratoryのprofile_columnsを使って分布を可視化してみる。 from splink import DuckDBAPI from splink.exploratory import profile_columns db_api = DuckDBAPI() profile_columns(df, db_api, column_expressions=[\"first_name\", \"substr(surname,1,2)\"]) 同じ姓・名の人が大量にいることがわかる。ブロッキングとブロッキングルールの評価テーブル内のレコードが他のレコードと「同一かどうか」を調べるためには、基本的には、他のすべてのレコードとの何らかの比較操作を行うこととなる。全てのレコードについて全てのカラム同士を比較したいのなら、対象のテーブルをCROSS JOINした結果、各カラム同士を比較することとなる。 SELECT ... FROM input_tables as l CROSS JOIN input_tables as r あるカラムが条件に合わなければ、もうその先は見ても意味がない、というケースは多い。例えば、まず first_name 、surname が同じでなければ、その先の比較を行わない、というのはあり得る。 SELECT ... FROM input_tables as l INNER JOIN input_tables as r ON l.first_name = r.first_name AND l.surname = r.surname このような考え方をブロッキング、ON句の条件をブロッキングルールと言う。ただ、これだと性と名が完全一致していないレコードが残らない。そこで、ブロッキングルールを複数定義し、いずれかが真であれば残すことができる。ここでポイントなのが、ブロッキングルールを複数定義したとき、それぞれのブロッキングルールで重複して選ばれるレコードが発生した場合、 Splinkが自動的に排除してくれる。このため、ブロッキングルールを重ねがけすると、最終的に残るレコード数は一致する。ただ、順番により、同じルールで残るレコード数は変化する。逆に言うと、ブロッキングルールを足すことで、重複除去後のOR条件が増えていく。積算グラフにして、ブロッキングルールとその順番の効果を見ることができる。 from splink import DuckDBAPI, block_on from splink.blocking_analysis import ( cumulative_comparisons_to_be_scored_from_blocking_rules_chart, ) blocking_rules = [ block_on(\"substr(first_name,1,3)\", \"substr(surname,1,4)\"), block_on(\"surname\", \"dob\"), block_on(\"first_name\", \"dob\"), block_on(\"postcode_fake\", \"first_name\"), block_on(\"postcode_fake\", \"surname\"), block_on(\"dob\", \"birth_place\"), block_on(\"substr(postcode_fake,1,3)\", \"dob\"), block_on(\"substr(postcode_fake,1,3)\", \"first_name\"), block_on(\"substr(postcode_fake,1,3)\", \"surname\"), block_on(\"substr(first_name,1,2)\", \"substr(surname,1,2)\", \"substr(dob,1,4)\"), ] db_api = DuckDBAPI() cumulative_comparisons_to_be_scored_from_blocking_rules_chart( table_or_tables=df, blocking_rules=blocking_rules, db_api=db_api, link_type=\"dedupe_only\", ) 積算グラフは以下の通り。積み上がっている数値は「比較の数」。要は、論理和で条件を足していって、次第に緩和されている様子がわかる。 DuckDBでは比較の数を2,000万件以内、Athena,Sparkでは1億件以内を目安にせよとのこと。比較の定義 Splinkは Fellegi-Sunter model モデル (というかフレームワーク) に基づいている。 https://moj-analytical-services.github.io/splink/topic_guides/theory/fellegi_sunter.html 各カラムの同士をカラムの特性に応じた距離を使って比較し、重みを計算していく。各カラムの比較に使うためのメソッドが予め用意されているので、特性に応じて選んでいく。以下では、first_name, sur_name に ForenameSurnameComparison が使われている。 dobにDateOfBirthComparison、birth_place、ocupationにExactMatchが使われている。 import splink.comparison_library as cl from splink import Linker, SettingsCreator settings = SettingsCreator( link_type=\"dedupe_only\", blocking_rules_to_generate_predictions=blocking_rules, comparisons=[ cl.ForenameSurnameComparison( \"first_name\", \"surname\", forename_surname_concat_col_name=\"first_name_surname_concat\", ), cl.DateOfBirthComparison( \"dob\", input_is_string=True ), cl.PostcodeComparison(\"postcode_fake\"), cl.ExactMatch(\"birth_place\").configure(term_frequency_adjustments=True), cl.ExactMatch(\"occupation\").configure(term_frequency_adjustments=True), ], retain_intermediate_calculation_columns=True, ) # Needed to apply term frequencies to first+surname comparison df[\"first_name_surname_concat\"] = df[\"first_name\"] + \" \" + df[\"surname\"] linker = Linker(df, settings, db_api=db_api) ComparisonとComparison Level ここでSplinkツール内の比較の概念の説明。以下の通り概念に名前がついている。 Data Linking Model ├─-- Comparison: Date of birth │ ├─-- ComparisonLevel: Exact match │ ├─-- ComparisonLevel: One character difference │ ├─-- ComparisonLevel: All other ├─-- Comparison: First name │ ├─-- ComparisonLevel: Exact match on first_name │ ├─-- ComparisonLevel: first_names have JaroWinklerSimilarity > 0.95 │ ├─-- ComparisonLevel: first_names have JaroWinklerSimilarity > 0.8 │ ├─-- ComparisonLevel: All other モデルのパラメタ推定モデルの実行に必要なパラメタは以下の3つ。Splinkを用いてパラメタを得る。ちなみに u は \"\'U\'nmatch\"、m は \"\'M\'atch\"。背後の数式の説明で現れる。 No パラメタ説明 1 無作為に選んだレコードが一致する確率入力データからランダムに取得した2つのレコードが一致する確率 (通常は非常に小さい数値) 2 u値(u確率) 実際には一致しないレコードの中で各 ComparisonLevel に該当するレコードの割合。具体的には、レコード同士が同じエンティティを表すにも関わらず値が異なる確率。例えば、同じ人なのにレコードによって生年月日が違う確率。これは端的には「データ品質」を表す。名前であればタイプミス、別名、ニックネーム、ミドルネーム、結婚後の姓など。 3 m値(m確率) 実際に一致するレコードの中で各 ComparisonLevel に該当するレコードの割合。具体的には、レコード同士が異なるエンティティを表すにも関わらず値が同じである確率。例えば別人なのにレコードによって性・名が同じ確率 (同姓同名)。性別は男か女かしかないので別人でも50%の確率で一致してしまう。無作為に選んだレコードが一致する確率入力データからランダムに抽出した2つのレコードが一致する確率を求める。値は0.000136。すべての可能なレコードのペア比較のうち7,362.31組に1組が一致すると予想される。合計1,279,041,753組の比較が可能なため、一致するペアは合計で約173,728.33組になると予想される、とのこと。 linker.training.estimate_probability_two_random_records_match( [ block_on(\"first_name\", \"surname\", \"dob\"), block_on(\"substr(first_name,1,2)\", \"surname\", \"substr(postcode_fake,1,2)\"), block_on(\"dob\", \"postcode_fake\"), ], recall=0.6, ) > Probability two random records match is estimated to be 0.000136. > This means that amongst all possible pairwise record comparisons, > one in 7,362.31 are expected to match. > With 1,279,041,753 total possible comparisons, > we expect a total of around 173,728.33 matching pairs u確率の推定実際には一致しないレコードの中でComparisonの評価結果がPositiveである確率。基本、無作為に抽出したレコードは一致しないため、「無作為に抽出したレコード」を「実際には一致しないレコード」として扱える、という点がミソ。 probability_two_random_records_match によって得られた値を使ってu確率を求める。 estimate_u_using_random_sampling によって、ラベルなし、つまり教師なしでu確率を得られる。レコードのペアをランダムでサンプルして上で定義したComparisonを評価する。ランダムサンプルなので大量の不一致が発生するが、各Comparisonにおける不一致の分布を得ている。これは、例えば性別について、50%が一致、50%が不一致である、という分布を得ている。一方、例えば生年月日について、一致する確率は 1%、1 文字の違いがある確率は 3%、その他はすべて 96% の確率で発生する、という分布を得ている。 linker.training.estimate_u_using_random_sampling(max_pairs=5e6) > ----- Estimating u probabilities using random sampling ----- > > Estimated u probabilities using random sampling > > Your model is not yet fully trained. Missing estimates for: > - first_name_surname (no m values are trained). > - dob (no m values are trained). > - postcode_fake (no m values are trained). > - birth_place (no m values are trained). > - occupation (no m values are trained). m確率の推定「実際に一致するレコード」の中で、Comparisonの評価がNegativeになる確率。そもそも、このモデルを使って名寄せ、つまり「一致するレコード」を見つけたいのだから、モデルを作るために「実際に一致するレコード」を計算しなければならないのは矛盾では..となる。無作為抽出結果から求められるu確率とは異なり、m確率を求めるのは難しい。もしラベル付けされた「一致するレコード」、つまり教師データセットがあるのであれば、そのデータセットを使ってm確率を求められる。例えば、日本人全員にマイナンバーが振られて、全てのレコードにマイナンバーが振られている、というアナザーワールドがあるのであれば、マイナンバーを使ってm確率を推定する。(どういう状況??) ラベル付けされたデータがないのであれば、EMアルゴリズムでm確率を求めることになっている。 EMアルゴリズムは反復的な手法で、メモリや収束速度の点でペア数を減らす必要があり、例ではブロッキングルールを設定している。以下のケースでは、first_nameとsurnameをブロッキングルールとしている。つまり、first_name, surnameが完全に一致するレコードについてペア比較を行う。この仮定を設定したため、first_name, surname (first_name_surname) のパラメタを推定できない。 training_blocking_rule = block_on(\"first_name\", \"surname\") training_session_names = ( linker.training.estimate_parameters_using_expectation_maximisation( training_blocking_rule, estimate_without_term_frequencies=True ) ) > ----- Starting EM training session ----- > > Estimating the m probabilities of the model by blocking on: > (l.\"first_name\" = r.\"first_name\") AND (l.\"surname\" = r.\"surname\") > > Parameter estimates will be made for the following comparison(s): > - dob > - postcode_fake > - birth_place > - occupation > > Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: > - first_name_surname > > Iteration 1: Largest change in params was 0.248 in probability_two_random_records_match > Iteration 2: Largest change in params was 0.0929 in probability_two_random_records_match > Iteration 3: Largest change in params was -0.0237 in the m_probability of birth_place, level `Exact match on > birth_place` > Iteration 4: Largest change in params was 0.00961 in the m_probability of birth_place, level `All other >comparisons` > Iteration 5: Largest change in params was -0.00457 in the m_probability of birth_place, level `Exact match on birth_place` > Iteration 6: Largest change in params was -0.00256 in the m_probability of birth_place, level `Exact match on birth_place` > Iteration 7: Largest change in params was 0.00171 in the m_probability of dob, level `Abs date difference Iteration 8: Largest change in params was 0.00115 in the m_probability of dob, level `Abs date difference Iteration 9: Largest change in params was 0.000759 in the m_probability of dob, level `Abs date difference Iteration 10: Largest change in params was 0.000498 in the m_probability of dob, level `Abs date difference Iteration 11: Largest change in params was 0.000326 in the m_probability of dob, level `Abs date difference Iteration 12: Largest change in params was 0.000213 in the m_probability of dob, level `Abs date difference Iteration 13: Largest change in params was 0.000139 in the m_probability of dob, level `Abs date difference Iteration 14: Largest change in params was 9.04e-05 in the m_probability of dob, level `Abs date difference <= 10 year` 同様にdobをブロッキングルールに設定して実行すると、dob以外の列についてパラメタを推定できる。 training_blocking_rule = block_on(\"dob\") training_session_dob = ( linker.training.estimate_parameters_using_expectation_maximisation( training_blocking_rule, estimate_without_term_frequencies=True ) ) > ----- Starting EM training session ----- > > Estimating the m probabilities of the model by blocking on: > l.\"dob\" = r.\"dob\" > > Parameter estimates will be made for the following comparison(s): > - first_name_surname > - postcode_fake > - birth_place > - occupation > > Parameter estimates cannot be made for the following comparison(s) since they are used in the blocking rules: > - dob > > Iteration 1: Largest change in params was -0.474 in the m_probability of first_name_surname, level `Exact match on first_name_surname_concat` > Iteration 2: Largest change in params was 0.052 in the m_probability of first_name_surname, level `All other comparisons` > Iteration 3: Largest change in params was 0.0174 in the m_probability of first_name_surname, level `All other comparisons` > Iteration 4: Largest change in params was 0.00532 in the m_probability of first_name_surname, level `All other comparisons` > Iteration 5: Largest change in params was 0.00165 in the m_probability of first_name_surname, level `All other comparisons` > Iteration 6: Largest change in params was 0.00052 in the m_probability of first_name_surname, level `All other comparisons` > Iteration 7: Largest change in params was 0.000165 in the m_probability of first_name_surname, level `All other comparisons` > Iteration 8: Largest change in params was 5.29e-05 in the m_probability of first_name_surname, level `All other comparisons` > > EM converged after 8 iterations > > Your model is not yet fully trained. Missing estimates for: > - first_name_surname (some u values are not trained). モデルパラメタの可視化 m確率、u確率の可視化。マッチウェイトの可視化。マッチウェイトは (log_2 (m / u))で計算される。 linker.visualisations.match_weights_chart() モデルの保存と読み込み以下でモデルを保存できる。 settings = linker.misc.save_model_to_json( \"./saved_model_from_demo.json\", overwrite=True ) 以下で保存したモデルを読み込める。 import json settings = json.load( open(\'./saved_model_from_demo.json\', \'r\') ) リンクするのに十分な情報が含まれていないレコード「John Smith」のみを含み、他のすべてのフィールドがnullであるレコードは、他のレコードにリンクされている可能性もあるが、潜在的なリンクを明確にするには十分な情報がない。以下により可視化できる。 linker.evaluation.unlinkables_chart() 横軸は「マッチウェイトの閾値」。縦軸は「リンクするのに十分な情報が含まれないレコード」の割合。マッチウェイト閾値=6.11ぐらいのところを見ると、入力データセットのレコードの約1.3%がリンクできないことが示唆される。訓練済みモデルを使って未知データのマッチウェイトを予測上で構築した推定モデルを使用し、どのペア比較が一致するかを予測する。内部的には以下を行うとのこと。 blocking_rules_to_generate_predictionsの少なくとも1つと一致するペア比較を生成 Comparisonで指定されたルールを使用して、入力データの類似性を評価推定された一致重みを使用し、要求に応じて用語頻度調整を適用して、最終的な一致重みと一致確率スコアを生成 df_predictions = linker.inference.predict(threshold_match_probability=0.2) df_predictions.as_pandas_dataframe(limit=1) > Blocking time: 0.88 seconds > Predict time: 1.91 seconds > > -- WARNING -- > You have called predict(), but there are some parameter estimates which have neither been estimated or > specified in your settings dictionary. To produce predictions the following untrained trained parameters will > use default values. > Comparison: \'first_name_surname\': > u values not fully trained records_to_plot = df_e.to_dict(orient=\"records\") linker.visualisations.waterfall_chart(records_to_plot, filter_nulls=False) predictしたマッチウェイトの可視化、数式との照合 predictしたマッチウェイトは、ウォーターフォール図で可視化できる。マッチウェイトは、モデル内の各特徴量によって一致の証拠がどの程度提供されるかを示す中心的な指標。 (lambda)は無作為抽出した2つのレコードが一致する確率。(K=m/u)はベイズ因子。 begin{align} M &= log_2 ( frac{lambda}{1-lambda} ) + log_2 K \\ &= log_2 ( frac{lambda}{1-lambda} ) + log_2 m - log_2 u end{align} 異なる列の比較が互いに独立しているという仮定を置いていて、 2つのレコードのベイズ係数が各列比較のベイズ係数の積として扱う。 begin{eqnarray} K_{feature} = K_{first_name_surname} + K_{dob} + K_{postcode_fake} + K_{birth_place} + K_{occupation} + cdots end{eqnarray} マッチウェイトは以下の和。 begin{eqnarray} M_{observe} = M_{prior} + M_{feature} end{eqnarray} ここで begin{align} M_{prior} &= log_2 (frac{lambda}{1-lambda}) \\ M_{feature} &= M_{first_name_surname} + M_{dob} + M_{postcode_fake} + M_{birth_place} + M_{occupation} + cdots end{align} 以下のように書き換える。 begin{align} M_{observe} &= log_2 (frac{lambda}{1-lambda}) + sum_i^{feature} log_2 (frac{m_i}{u_i}) \\ &= log_2 (frac{lambda}{1-lambda}) + log_2 (prod_i^{feature} (frac{m_i}{u_i}) ) end{align} ウォーターフォール図の一番左、赤いバーは(M_{prior} = log_2 (frac{lambda}{1-lambda}))。特徴に関する追加の知識が考慮されていない場合のマッチウェイト。横に並んでいる薄い緑のバーは (M_{first_name_surname} + M_{dob} + M_{postcode_fake} + M_{birth_place} + M_{occupation} + cdots)。各特徴量のマッチウェイト。一番右の濃い緑のバーは2つのレコードの合計マッチウェイト。 begin{align} M_{feature} &= M_{first_name_surname} + M_{dob} + M_{postcode_fake} + M_{birth_place} + M_{occupation} + cdots \\ &= 8.50w end{align} まとめ長くなったのでいったん終了。この記事では教師なし確率的名寄せパッケージSplinkを使用してモデルを作ってみた。次の記事では、作ったモデルを使用して実際に名寄せをしてみる。途中、DuckDBが楽しいことに気づいたので、DuckDBだけで何個か記事にしてみようと思う。

SageMaker用のコードをローカルで動かす – scikit-learnの決定木でアヤメの種類を分類

機械学習のHelloWorld

前準備

CredentialsとIAM

SageMaker PythonSDKインストール

ローカルのJupyter Notebookでファイルを修正

SageMaker ローカルSessionを開始

学習用データの準備 (変更なし)

Scikit learn Estimator

学習

推論

まとめ

React+Next.jsでDummy JSONのCRUDをCSR/SSRの両方で作成して違いを調べてみた話

go-txdbを使ってgolang, gin, gorm(gen)+sqlite構成のAPI をテストケース毎に管理する

gorm互換の型安全なORMであるgenでCRUD APIを試作

Golang + Gin カスタムバリデーション

Golang + Gin Framework で Hello World してみた話〜基本的なルーティング、バスパラメタ・クエリパラメタ・JSON Req/Res、フォームデータ

Snowflake MCPサーバを試してみた

Fellegi-Sunterモデルに基づく確率的名寄せパッケージ Splinkを試してみる

AirflowでEnd-To-End Pipeline Testsを行うためにAirflow APIを調べてみた話

CustomOperatorのUnitTestを理解するためGCSToBigQueryOperatorのUnitTestを読んでみた話

GoogleによるAirflow DAG実装のベスプラ集を読んでみた – その1

機械学習のHelloWorld

前準備

CredentialsとIAM

SageMaker PythonSDKインストール

ローカルのJupyter Notebookでファイルを修正

SageMaker ローカルSessionを開始

学習用データの準備 (変更なし)

Scikit learn Estimator

学習

推論

まとめ

関連記事