{ "nbformat": 4, "nbformat_minor": 0, "metadata": { "colab": { "name": "04-intro-pandas-functions.ipynb", "provenance": [] }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.7.6" } }, "cells": [ { "cell_type": "markdown", "metadata": { "id": "HHvqvbiUBFcW" }, "source": [ "[![AnalyticsDojo](https://github.com/rpi-techfundamentals/spring2019-materials/blob/master/fig/final-logo.png?raw=1)](http://rpi.analyticsdojo.com)\n", "

Introduction to Python - Introduction to Apply Function

\n", "

rpi.analyticsdojo.com

\n", "\n" ] }, { "cell_type": "markdown", "metadata": { "id": "Wjg1PgSoBFcb" }, "source": [ "# Introduction to Apply Function\n", "- Don't loop over a dataframe. \n", "- Instead, us the apply function to process a function across each value. \n", "\n" ] }, { "cell_type": "code", "metadata": { "id": "MiuYwjIEBFcd", "colab": { "base_uri": "https://localhost:8080/", "height": 447 }, "outputId": "53986440-5251-4414-dfc2-a12368fbfc74" }, "source": [ "import pandas as pd\n", "df=pd.read_csv('https://raw.githubusercontent.com/rpi-techfundamentals/spring2019-materials/master/input/train.csv')\n", "df" ], "execution_count": 1, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th...female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS
.......................................
88688702Montvila, Rev. Juozasmale27.00021153613.0000NaNS
88788811Graham, Miss. Margaret Edithfemale19.00011205330.0000B42S
88888903Johnston, Miss. Catherine Helen \"Carrie\"femaleNaN12W./C. 660723.4500NaNS
88989011Behr, Mr. Karl Howellmale26.00011136930.0000C148C
89089103Dooley, Mr. Patrickmale32.0003703767.7500NaNQ
\n", "

891 rows × 12 columns

\n", "
" ], "text/plain": [ " PassengerId Survived Pclass ... Fare Cabin Embarked\n", "0 1 0 3 ... 7.2500 NaN S\n", "1 2 1 1 ... 71.2833 C85 C\n", "2 3 1 3 ... 7.9250 NaN S\n", "3 4 1 1 ... 53.1000 C123 S\n", "4 5 0 3 ... 8.0500 NaN S\n", ".. ... ... ... ... ... ... ...\n", "886 887 0 2 ... 13.0000 NaN S\n", "887 888 1 1 ... 30.0000 B42 S\n", "888 889 0 3 ... 23.4500 NaN S\n", "889 890 1 1 ... 30.0000 C148 C\n", "890 891 0 3 ... 7.7500 NaN Q\n", "\n", "[891 rows x 12 columns]" ] }, "metadata": {}, "execution_count": 1 } ] }, { "cell_type": "markdown", "metadata": { "id": "2WhYu3NcYE3j" }, "source": [ "### Make it easy with the lambda function.\n", "- Create a value for `Age-squared`." ] }, { "cell_type": "code", "metadata": { "id": "jkzsuCmaYCef" }, "source": [ "df['age-squared']=df['Age'].apply(lambda x: x**2)" ], "execution_count": 2, "outputs": [] }, { "cell_type": "markdown", "metadata": { "id": "ss9fW77EYuDQ" }, "source": [ "### Or define an entire function.\n", "- Define a function to get the title from the name. \n", "- Always test your function with a single entry, not the apply." ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 57 }, "id": "kYKg79qaYsMd", "outputId": "1f356417-c3ee-4cf6-d168-4e99c1568c70" }, "source": [ "def get_title(x):\n", " \n", " x = str(x)\n", " x = x.split(',') #Split at the comma\n", " x = x[1].strip() #remove any leading spaces\n", " x = x.split(' ')#Split at the spaces\n", " return x[0]\n", "\n", "#Always test your function with a single entry, not the apply.\n", "get_title('Dooley, Mr. Patrick')" ], "execution_count": 8, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'Mr.'" ] }, "metadata": {}, "execution_count": 8 } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 447 }, "id": "sfEpRTCfZ0EU", "outputId": "7fda0f72-5b65-425f-ea05-a16df5aeaf92" }, "source": [ "df['Title']=df['Name'].apply(get_title)\n", "df[['Name','Title']]" ], "execution_count": 12, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameTitle
0Braund, Mr. Owen HarrisMr.
1Cumings, Mrs. John Bradley (Florence Briggs Th...Mrs.
2Heikkinen, Miss. LainaMiss.
3Futrelle, Mrs. Jacques Heath (Lily May Peel)Mrs.
4Allen, Mr. William HenryMr.
.........
886Montvila, Rev. JuozasRev.
887Graham, Miss. Margaret EdithMiss.
888Johnston, Miss. Catherine Helen \"Carrie\"Miss.
889Behr, Mr. Karl HowellMr.
890Dooley, Mr. PatrickMr.
\n", "

891 rows × 2 columns

\n", "
" ], "text/plain": [ " Name Title\n", "0 Braund, Mr. Owen Harris Mr.\n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... Mrs.\n", "2 Heikkinen, Miss. Laina Miss.\n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Mrs.\n", "4 Allen, Mr. William Henry Mr.\n", ".. ... ...\n", "886 Montvila, Rev. Juozas Rev.\n", "887 Graham, Miss. Margaret Edith Miss.\n", "888 Johnston, Miss. Catherine Helen \"Carrie\" Miss.\n", "889 Behr, Mr. Karl Howell Mr.\n", "890 Dooley, Mr. Patrick Mr.\n", "\n", "[891 rows x 2 columns]" ] }, "metadata": {}, "execution_count": 12 } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "oMwxfTSybYwQ", "outputId": "a536154c-501e-4634-dc24-3f12b81fd9fe" }, "source": [ "df['Title'].unique()" ], "execution_count": 13, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "array(['Mr.', 'Mrs.', 'Miss.', 'Master.', 'Don.', 'Rev.', 'Dr.', 'Mme.',\n", " 'Ms.', 'Major.', 'Lady.', 'Sir.', 'Mlle.', 'Col.', 'Capt.', 'the',\n", " 'Jonkheer.'], dtype=object)" ] }, "metadata": {}, "execution_count": 13 } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "ohcAkm0CbfsQ", "outputId": "74612e04-e843-41ac-fe04-b465cdf62efe" }, "source": [ "df['Title'].value_counts()" ], "execution_count": 14, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Mr. 517\n", "Miss. 182\n", "Mrs. 125\n", "Master. 40\n", "Dr. 7\n", "Rev. 6\n", "Major. 2\n", "Col. 2\n", "Mlle. 2\n", "Lady. 1\n", "Capt. 1\n", "Jonkheer. 1\n", "Don. 1\n", "Ms. 1\n", "the 1\n", "Mme. 1\n", "Sir. 1\n", "Name: Title, dtype: int64" ] }, "metadata": {}, "execution_count": 14 } ] }, { "cell_type": "markdown", "metadata": { "id": "vuTDAoxEcjAL" }, "source": [ "### Pass Additional Values\n", "You can even use things that pass additional values. " ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 57 }, "id": "SkLFCJSPbsTZ", "outputId": "f3b82df1-b5b2-4771-a3e8-89860e22c463" }, "source": [ "RECODE_MRS=['Lady.','Mme.']\n", "RECODE_MISS=['Ms.']\n", "RECODE_MR=['Sir.','the','Don.','Jonkheer.','Capt.']\n", "def get_title2(x,recode_mrs, recode_miss, recode_mr):\n", " \n", " x = str(x)\n", " x = x.split(',') #Split at the comma\n", " x = x[1].strip() #remove any leading spaces\n", " x = x.split(' ')#Split at the spaces\n", " x = x[0] #select the first word. \n", " if x in recode_mrs:\n", " x='Mrs.'\n", " elif x in recode_miss:\n", " x='Miss.'\n", " elif x in recode_mr:\n", " x='Mr.'\n", " return x\n", "\n", "#Always test your function with a single entry, not the apply.\n", "get_title('Dooley, Sir., Patrick', recode_mrs=RECODE_MRS, recode_miss=RECODE_MISS, recode_mr=RECODE_MR)" ], "execution_count": 19, "outputs": [ { "output_type": "execute_result", "data": { "application/vnd.google.colaboratory.intrinsic+json": { "type": "string" }, "text/plain": [ "'Mr.'" ] }, "metadata": {}, "execution_count": 19 } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 447 }, "id": "FV-SYLJtdjgi", "outputId": "7736af49-09e9-4cbd-bc86-6bb36ca1fcc9" }, "source": [ "df['Title']=df['Name'].apply(get_title2,recode_mrs=RECODE_MRS, recode_miss=RECODE_MISS, recode_mr=RECODE_MR )\n", "df[['Name','Title']]" ], "execution_count": 20, "outputs": [ { "output_type": "execute_result", "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
NameTitle
0Braund, Mr. Owen HarrisMr.
1Cumings, Mrs. John Bradley (Florence Briggs Th...Mrs.
2Heikkinen, Miss. LainaMiss.
3Futrelle, Mrs. Jacques Heath (Lily May Peel)Mrs.
4Allen, Mr. William HenryMr.
.........
886Montvila, Rev. JuozasRev.
887Graham, Miss. Margaret EdithMiss.
888Johnston, Miss. Catherine Helen \"Carrie\"Miss.
889Behr, Mr. Karl HowellMr.
890Dooley, Mr. PatrickMr.
\n", "

891 rows × 2 columns

\n", "
" ], "text/plain": [ " Name Title\n", "0 Braund, Mr. Owen Harris Mr.\n", "1 Cumings, Mrs. John Bradley (Florence Briggs Th... Mrs.\n", "2 Heikkinen, Miss. Laina Miss.\n", "3 Futrelle, Mrs. Jacques Heath (Lily May Peel) Mrs.\n", "4 Allen, Mr. William Henry Mr.\n", ".. ... ...\n", "886 Montvila, Rev. Juozas Rev.\n", "887 Graham, Miss. Margaret Edith Miss.\n", "888 Johnston, Miss. Catherine Helen \"Carrie\" Miss.\n", "889 Behr, Mr. Karl Howell Mr.\n", "890 Dooley, Mr. Patrick Mr.\n", "\n", "[891 rows x 2 columns]" ] }, "metadata": {}, "execution_count": 20 } ] }, { "cell_type": "code", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "id": "9oZJQ3yWdzE_", "outputId": "66de1d56-6dde-460d-93df-5c0effe79a2a" }, "source": [ "df['Title'].value_counts()" ], "execution_count": 21, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "Mr. 521\n", "Miss. 183\n", "Mrs. 127\n", "Master. 40\n", "Dr. 7\n", "Rev. 6\n", "Mlle. 2\n", "Major. 2\n", "Col. 2\n", "Capt. 1\n", "Name: Title, dtype: int64" ] }, "metadata": {}, "execution_count": 21 } ] }, { "cell_type": "markdown", "metadata": { "id": "-8n8t68ed8m0" }, "source": [ "### Using Values from more than one column\n", "- Apply somethign on the entire dataframe if calcs involve more than once column." ] }, { "cell_type": "code", "metadata": { "id": "9BP4gx8LXGWh", "colab": { "base_uri": "https://localhost:8080/" }, "outputId": "89688c26-bfb4-44b2-94fc-168308e86735" }, "source": [ "def complex_process(row):\n", " \n", " return row['Sex']+str(row['Age'])\n", "\n", "df.apply(complex_process, axis = 1)" ], "execution_count": 24, "outputs": [ { "output_type": "execute_result", "data": { "text/plain": [ "0 male22.0\n", "1 female38.0\n", "2 female26.0\n", "3 female35.0\n", "4 male35.0\n", " ... \n", "886 male27.0\n", "887 female19.0\n", "888 femalenan\n", "889 male26.0\n", "890 male32.0\n", "Length: 891, dtype: object" ] }, "metadata": {}, "execution_count": 24 } ] } ] }